Smart indexing and caching #

Starburst Enterprise includes Smart indexing and caching. You can take advantage of the performance improvements by updating your cluster to suitable hardware and configuring the Smart indexing and caching utility connector for any catalog accessing object storage with the Hive, Iceberg, or Delta Lake connector. A cluster deployment on Amazon Elastic Kubernetes Service (EKS), Microsoft Azure Kubernetes Service (AKS), or Google Kubernetes Engine (GKE) is required. For more information, see the installation and configuration instructions.

Overview #

Smart indexing and caching automatically creates and maintains indexes and caches with your queried data, determined by the characteristics of the processed queries.

Default acceleration #

When a query accesses a column that is not accelerated, the system performs data and index materialization on the cluster to accelerate future access to the data in the column. This process of creating the indexes and caches is also called warmup. Warmup is performed individually by each worker based on the processed splits and uses the local high performance storage of the worker. Typically, these are SSD NVMe drives.

When new data is added to the table or the index and cache creation are in progress, the new portions of the table that are not accelerated are served from the object storage. After the asynchronous indexing and caching is complete, query processing accessing that data is accelerated, because the data is available directly in the cluster from the indexes and caches, and no longer has to be retrieved from the remote object storage.

This results in immediately improved performance for recently used datasets. In addition to the automatic default acceleration, advanced users can create specific warmup rules. The default acceleration has a lower priority than a user-created warmup rule.

Default acceleration is not performed for SELECT * FROM <table_name> queries that are commonly used to explore a table rather than to retrieve specific data.

Acceleration types #

Smart indexing and caching uses different types of acceleration to improve query processing performance:

These acceleration types are used automatically by default acceleration, and can also be configured manually with warmup rules defined with the REST API.

Data cache acceleration #

Data cache acceleration is the system that caches the raw data objects from the object storage directly on the high-performance storage attached to the workers in the cluster. The data from one or more objects is processed in the cluster as splits. The data from the splits and associated metadata are managed as a row group. These row groups are used to accelerate any queries that access the contained data. The row groups are stored on in a proprietary columnar block caching format.

Use the WARM_UP_TYPE_DATA value in the warmUpType property to configure data cache acceleration for a specific column with the REST API.

Index acceleration #

Index acceleration uses the data in a specific column in a table to create an index. This index is added to the row group and used when queries access a column to filter rows. It accelerates queries that use predicates, joins, filters, and searches, and minimizes data scanning.

The index types (such as bitmap, tree, and others), are determined automatically by the column data types.

Use the WARM_UP_TYPE_DATA value in the warmUpType property to configure index acceleration for a specific column with the REST API.

Text search acceleration #

Text search acceleration creates a tokenized index of the content of text columns that is used in query predicates. It accelerates queries that use predicates of filters and searches on text columns.

Smart indexing and caching automatically enables text search acceleration, and maintains the indexes.

Text search acceleration uses Apache Lucene indexing to accelerate text analytics and provide fast text filters, particularly with LIKE predicates. The KeywordAnalyzer provides full support for LIKE semantics to search for the exact appearance of a value in a filtered column.

A use case is a search for a specific short string in a larger column, such as a description. For example, consider a table with a column named city and a value New York, United States. The index is case-sensitive. When indexing is applied to the column, the following query returns that record because the LIKE predicate is an exact match:

SELECT *
FROM tbl
WHERE city LIKE '%New York%'

The following queries do not return the results because the LIKE predicates are not an exact match. The first query is missing a space in the pattern:

SELECT *
FROM tbl
WHERE city LIKE '%NewYork%'

The second query uses lowercase:

SELECT *
FROM tbl
WHERE city LIKE '%new york%'

Text search acceleration indexing is recommended for:

  • Queries with LIKE predicates, prefix or suffix queries, or queries that use the starts_with functions.
  • Range queries on string columns. A common use is dates that are stored as strings that have range predicates. For example, date_string>='yyyy-mm-dd'.

Text search acceleration indexing supports the following data types:

  • CHAR
  • VARCHAR
  • CHAR ARRAY
  • VARCHAR ARRAY

Use the WARM_UP_TYPE_LUCENE value in the warmUpType property to configure text search acceleration for a specific column with the REST API.

Limitations:

  • The maximum supported string length is 50k characters.
  • Queries with nested expressions, such as starts_with(some_nested_method(col1), 'aaa'), are not accelerated.

Automated clean up #

When the available storage on the cluster is about to run out of storage space, index and cache elements are automatically deleted. As a user or administrator, you don’t need to manage index and cache allocation. When the storage capacity threshold is exceeded, the system deletes the following content until the clean up threshold is reached:

  • All expired content based on the TTL value.
  • Content with the lowest values on the priority property that were created as a result of the default acceleration.
  • Content related to custom warmup rules for indexing and caching.

After a clean up, new data is indexed and cached as needed based on the data access by the processed queries.

Configuration #

Smart indexing and caching is supported on Kubernetes-based clusters on EKS, AKS, or GKE with specific configuration. It is suitable for catalogs using the Hive, Iceberg, or Delta Lake connector with minimal configuration changes for the catalog properties file.

Refer to our detailed installation and configuration instructions to get started.

Cluster management #

Smart indexing and caching accommodates cluster expansion and contraction. Be aware of the following when scaling up or down:

  • When scaling a cluster horizontally (adding or removing worker nodes), Smart indexing and caching continues operating, assuming that requirements are properly fulfilled. A cluster restart is not required when adding or removing nodes.

  • Scaling a cluster vertically to use larger nodes requires a cluster restart, which facilitates the replacement of all worker nodes to the larger node size.

  • After restarting the cluster, the default acceleration becomes active. New caches and indexes get created and populated based on the query workload. Any user-defined warmup rules are lost after restart, unless a database is configured for Smart indexing and caching.

FAQ #

What happens in case data is not cached and indexed? Am I getting partial results?

No. In case a split can be served from SSD, it is served; but if not, Smart indexing and caching gets the data for this split from the object storage to complete the query and sends back the results. Then the index and cache are created asynchronously as appropriate so that future queries can leverage the index and cache.

Is there a chance a user can get stale results?

No. Smart indexing and caching uses a mapping between the generated splits and index and cache data on SSDs during query processing. If a split can be served from SSD, it is; but if not, Smart indexing and caching gets the data for this split from the object storage and then asynchronously indexes and caches it as appropriate.

What is the caching and indexing speed?

Performance depends on many different factors. For example, indexing and caching the entire TPC-DS SF1000 dataset takes about 20 minutes on a cluster with two workers with the machine size r5d.8xlarge.