Smart indexing and caching #

Starburst Enterprise includes smart indexing and caching. You can take advantage of the performance improvements by updating your cluster to suitable hardware and configuring the Starburst smarting indexing and caching connector for any catalog accessing object storage with the Hive, Iceberg, or Delta Lake connector. A cluster deployment on Amazon Elastic Kubernetes Service (EKS) or Microsoft Azure Kubernetes Service (AKS) is required. For more information, see the installation and configuration instructions.

Overview #

Smart indexing and caching automatically creates and maintains indexes and caches with your queried data, determined by the characteristics of the processed queries.

Default acceleration #

When a query accesses a column that is not accelerated, the system performs data and index materialization on the cluster to accelerate future access to the data in the column. This process of creating the indexes and caches is also called warmup. Warmup is performed individually by each worker based on the processed splits and uses the local high performance storage of the worker. Typically, these are SSD NVMe drives.

When new data is added to the table or the index and cache creation are in progress, the new portions of the table that are not accelerated are served from the object storage. After the asynchronous indexing and caching is complete, query processing accessing that data is accelerated, because the data is available directly in the cluster from the indexes and caches, and no longer has to be retrieved from the remote object storage.

This results in immediately improved performance for recently used datasets. In addition to the automatic default acceleration, advanced users can create specific indexing and caching warmup rules. The default acceleration has a lower priority than a user-created warmup rule.

Default acceleration is not performed for SELECT * FROM <table_name> queries that are commonly used to explore a table rather than to retrieve specific data.

Acceleration types #

Smart indexing and cashing uses different types of acceleration to improve query processing performance:

These acceleration types are used automatically by default acceleration, and can also be configured manually with warmup rules defined with the REST API.

Data cache acceleration #

Data cache acceleration is the system that caches the raw data objects from the object storage directly on the high-performance storage attached to the workers in the cluster. The data from one or more objects is processed in the cluster as splits. The data from the splits and associated metadata are managed as a row group in smart indexing and caching.

These row groups are used to accelerate any queries that access the contained data. The row groups are stored on in a proprietary columnar block caching format.

Use the COL_WARM_UP_TYPE_DATA value in the colWarmUpType property to configure data cache acceleration for a specific column with the REST API.

Index acceleration #

Index acceleration uses the data in a specific column in a table to create an index. This index is added to the row group and used when queries access a column to filter rows. It accelerates queries that use predicates, joins, filters, and searches, and minimizes data scanning.

The index types (such as bitmap, tree, and others), are determined automatically by the column data types.

Use the COL_WARM_UP_TYPE_DATA value in the colWarmUpType property to configure index acceleration for a specific column with the REST API.

Text search acceleration #

Text search acceleration creates a tokenized index of the content of text columns that is used in query predicates. It accelerates queries that use predicates of filters and searches on text columns.

For example, a suitable use case is a search for a specific short string in a larger column such as a description.

The index uses the well-known Apache Lucene library.

Use the COL_WARM_UP_TYPE_LUCENE value in the colWarmUpType property to configure text search acceleration for a specific column with the REST API.

Automated clean up #

When the available storage on the cluster is about to run out of storage space, index and cache elements are automatically deleted. As a user or administrator, you don’t need to manage index and cache allocation. When the storage capacity threshold is exceeded, the system deletes the following content until the clean up threshold is reached:

  • All expired content based on the TTL value.
  • Content with the lowest values on the priority property that were created as a result of the default acceleration.
  • Content related to custom warmup rules for indexing and caching.

After a clean up, new data is indexed and cached as needed based on the data access by the processed queries.

Configuration #

Smart indexing and caching is supported on Kubernetes-based clusters on EKS or AKS with specific configuration. It is suitable for catalogs using the Hive, Iceberg, or Delta Lake connector with minimal configuration changes for the catalog properties file.

Refer to our detailed installation and configuration instructions to get started.

FAQ #

What happens in case data is not cached and indexed? Am I getting partial results?

No. In case a split can be served from SSD, it is served; but if not, smart indexing and caching gets the data for this split from the object storage to complete the query and sends back the right results. Then the index and cache are created asynchronously as appropriate so that future queries can leverage the index and cache.

Is there a chance a user can get stale results?

No. Smart indexing and caching uses a mapping between the generated splits and index and cache data on SSDs during query processing. If a split can be served from SSD, it is; but if not, smart indexing and caching gets the data for this split from the object storage and then asynchronously indexes and caches it as appropriate.

What is the caching and indexing speed?

Performance depends on many different factors. For example, indexing and caching the entire TPC-DS SF1000 dataset takes about 20 minutes on a cluster with two workers with the machine size r5d.8xlarge.