Starburst Warp Speed #

Starburst Enterprise includes Starburst Warp Speed. You can take advantage of the performance improvements by updating your cluster to suitable hardware and configuring the Starburst Warp Speed utility connector for any catalog accessing object storage with the Hive, Iceberg, or Delta Lake connector. A cluster deployment on Amazon Elastic Kubernetes Service (EKS), Microsoft Azure Kubernetes Service (AKS), or Google Kubernetes Engine (GKE) is required. For more information, see the installation and configuration instructions.

Overview #

Starburst Warp Speed automatically creates and maintains indexes and caches with your queried data, determined by the characteristics of the processed queries. The index and cache data is stored on local storage attached to each worker node in the cluster.

Default acceleration #

When a query accesses a column that is not accelerated, the system performs data and index materialization on the cluster to accelerate future access to the data in the column. This process of creating the indexes and caches is also called warmup. Warmup is performed individually by each worker based on the processed splits and uses the local high performance storage of the worker. Typically, these are SSD NVMe drives.

When new data is added to a table or the index and cache creation are in progress, the new portions of the table that are not accelerated are served from the object storage. After the asynchronous indexing and caching is complete, query processing accessing that data is accelerated, because the data is available directly in the cluster from the indexes and caches, and no longer has to be retrieved from the remote object storage.

This results in immediately improved performance for recently used datasets. In addition to the automatic default acceleration, advanced users can create specific warmup rules. The default acceleration has a lower priority than a user-created warmup rule.

Default acceleration is not performed for SELECT * FROM <table_name> queries that are commonly used to explore a table rather than to retrieve specific data.

Index and cache resiliency #

Starburst Warp Speed optionally provides index and cache resiliency. When a new index is created or data is cached, it is stored on the SSD NVMe (nonvolatile memory express) drives attached to each worker by default, and in addition on a dedicated, shared bucket in your object storage. When you scale the cluster, the indexes and data cache remain available in the shared storage.

The resiliency feature enables fast warmup when scaling up by adding worker nodes to a cluster. This also eliminates the need for keeping instances idling when a cluster is not in use. Instead, when a cluster needs to warm an index or data cache, it first accesses the designated bucket on the object storage to check if the indexes or cache are ready and loads them accordingly. If the indexes or cache are not available in the shared object storage or cannot be loaded for any reason, the data is warmed as usual.

There are three storage tiers accessed for queries:

  • Hot data and index: Uses SSD NVMe attached to the workers in a cluster to process queries and store hot data and a cache for optimal performance. This layer is enabled by default with Starburst Warp Speed.
  • Warm data and index: Indexes and cached data are stored in a designated object storage bucket. This layer is shared amongst all of your workers to ensure minimal resources are allocated to indexing when scaling up a cluster and adds resiliency. When a cluster is scaled down or eliminated, and nodes are shut down, the indexes remain available as warm data. The warm layer is disabled by default. Enabling index and cache resiliency makes the warm tier available for fast warmup of indexes and caches.
  • Cold data: Direct access of your object storage.

With this tiered approach, you can continue using your existing scaling and auto-scaling policies.

Storing the indexes and data caches can fill up your object storage. Setting a lifecycle policy to control the associated cost is highly recommended. Refer to the following lifecycle documentation for your cloud provider:

As a best practice, allocate at least the average available SSD storage of your cluster.

Acceleration types #

Starburst Warp Speed uses different types of acceleration to improve query processing performance:

These acceleration types are used automatically by default acceleration, and can also be configured manually with warmup rules defined with the REST API.

Data cache acceleration #

Data cache acceleration is the system that caches the raw data objects from the object storage directly on the high-performance storage attached to the workers in the cluster. The data from one or more objects is processed in the cluster as splits. The data from the splits and associated metadata are managed as a row group. These row groups are used to accelerate any queries that access the contained data. The row groups are stored in a proprietary columnar block caching format.

Use the WARM_UP_TYPE_DATA value in the warmUpType property to configure data cache acceleration for a specific column with the REST API.

Index acceleration #

Index acceleration uses the data in a specific column in a table to create an index. This index is added to the row group and used when queries access a column to filter rows. It accelerates queries that use predicates, joins, filters, and searches, and minimizes data scanning.

The index types (such as bitmap, tree, and others), are determined automatically by the column data types, and data patterns and characteristics.

Use the WARM_UP_TYPE_DATA value in the warmUpType property to configure index acceleration for a specific column with the REST API.

Text search acceleration #

Text search acceleration creates an index of the content of text columns using Apache Lucene. This index is used in query predicates. It accelerates queries that use predicates of filters and searches on text columns.

Starburst Warp Speed automatically enables text search acceleration, and maintains the indexes.

Text search acceleration uses Apache Lucene indexing to accelerate text analytics and provide fast text filters, particularly with LIKE predicates. The KeywordAnalyzer provides full support for LIKE semantics to search for the exact appearance of a value in a filtered column.

A use case is a search for a specific short string in a larger column, such as a description. For example, consider a table with a column named city and a value New York, United States. The index is case-sensitive. When indexing is applied to the column, the following query returns that record because the LIKE predicate is an exact match:

SELECT *
FROM tbl
WHERE city LIKE '%New York%'

The following queries do not return the results because the LIKE predicates are not an exact match. The first query is missing a space in the pattern:

SELECT *
FROM tbl
WHERE city LIKE '%NewYork%'

The second query uses lowercase:

SELECT *
FROM tbl
WHERE city LIKE '%new york%'

Text search acceleration indexing is recommended for:

  • Queries with LIKE predicates, prefix or suffix queries, or queries that use the starts_with functions.
  • Range queries on string columns. A common use is dates that are stored as strings that have range predicates. For example, date_string>='yyyy-mm-dd'.

Text search acceleration indexing supports the following data types:

  • CHAR
  • VARCHAR
  • CHAR ARRAY
  • VARCHAR ARRAY

Use the WARM_UP_TYPE_LUCENE value in the warmUpType property to configure text search acceleration for a specific column with the REST API.

Limitations:

  • The maximum supported string length is 33k characters.
  • Queries with nested expressions, such as starts_with(some_nested_method(col1), 'aaa'), are not accelerated.

Automated clean up #

When the available storage on the cluster is about to run out of storage space, index and cache elements are automatically deleted. As a user or administrator, you don’t need to manage index and cache allocation. When the storage capacity threshold is exceeded, the system deletes the following content until the clean up threshold is reached:

  • All expired content based on the TTL value.
  • Content with the lowest values on the priority property that were created as a result of the default acceleration.
  • Content related to custom warmup rules for indexing and caching.

After a clean up, new data is indexed and cached as needed based on the data access by the processed queries.

Configuration #

Starburst Warp Speed is supported on Kubernetes-based clusters on EKS, AKS, or GKE with specific configuration. It is suitable for catalogs using the Hive, Iceberg, or Delta Lake connector with minimal configuration changes for the catalog properties file.

Refer to our detailed installation and configuration instructions to get started.

Index and cache usage #

Now that you have configured , you can view acceleration details and other summary statistics on the Index and cache usage tab in the Starburst Enterprise web UI.

For more information, see the reference documentation.

Cluster management #

Starburst Warp Speed accommodates cluster expansion and contraction. Be aware of the following when scaling up or down:

  • When scaling a cluster horizontally (adding or removing worker nodes), Starburst Warp Speed continues operating, assuming that requirements are properly fulfilled. A cluster restart is not required when adding or removing nodes.

  • Scaling a cluster vertically to use larger nodes requires a cluster restart, which facilitates the replacement of all worker nodes to the larger node size.

  • After restarting the cluster, the default acceleration becomes active. New caches and indexes get created and populated based on the query workload. Any user-defined warmup rules are lost after restart, unless a database is configured for Starburst Warp Speed.

FAQ #

What happens in case data is not cached and indexed? Am I getting partial results?

No. In case a split can be served from SSD, it is served; but if not, Starburst Warp Speed gets the data for this split from the object storage to complete the query and sends back the results. Then the index and cache are created asynchronously, based on priority and available SSD storage, so that future queries can leverage the index and cache.

Is there a chance a user can get stale results?

No. Starburst Warp Speed uses a mapping between the generated splits and index and cache data on SSDs during query processing. If a split can be served from SSD, it is; but if not, gets the data for this split from the object storage and then asynchronously indexes and caches it as appropriate.

What is the caching and indexing speed?

Performance depends on many different factors. For example, indexing and caching the entire TPC-DS SF1000 dataset takes about 20 minutes on a cluster with two workers with the machine size r5d.8xlarge.