Smart indexing and caching #
Starburst Enterprise includes Smart indexing and caching. You can take advantage of the performance improvements by updating your cluster to suitable hardware and configuring the Smart indexing and caching utility connector for any catalog accessing object storage with the Hive, Iceberg, or Delta Lake connector. A cluster deployment on Amazon Elastic Kubernetes Service (EKS), Microsoft Azure Kubernetes Service (AKS), or Google Kubernetes Engine (GKE) is required. For more information, see the installation and configuration instructions.
Smart indexing and caching automatically creates and maintains indexes and caches with your queried data, determined by the characteristics of the processed queries.
Default acceleration #
When a query accesses a column that is not accelerated, the system performs data and index materialization on the cluster to accelerate future access to the data in the column. This process of creating the indexes and caches is also called warmup. Warmup is performed individually by each worker based on the processed splits and uses the local high performance storage of the worker. Typically, these are SSD NVMe drives.
When new data is added to the table or the index and cache creation are in progress, the new portions of the table that are not accelerated are served from the object storage. After the asynchronous indexing and caching is complete, query processing accessing that data is accelerated, because the data is available directly in the cluster from the indexes and caches, and no longer has to be retrieved from the remote object storage.
This results in immediately improved performance for recently used datasets. In addition to the automatic default acceleration, advanced users can create specific warmup rules. The default acceleration has a lower priority than a user-created warmup rule.
Default acceleration is not performed for
SELECT * FROM <table_name> queries
that are commonly used to explore a table rather than to retrieve specific
Acceleration types #
Smart indexing and caching uses different types of acceleration to improve query processing performance:
Data cache acceleration #
Data cache acceleration is the system that caches the raw data objects from the object storage directly on the high-performance storage attached to the workers in the cluster. The data from one or more objects is processed in the cluster as splits. The data from the splits and associated metadata are managed as a row group. These row groups are used to accelerate any queries that access the contained data. The row groups are stored on in a proprietary columnar block caching format.
WARM_UP_TYPE_DATA value in the
warmUpType property to
configure data cache acceleration for a specific column with the REST API.
Index acceleration #
Index acceleration uses the data in a specific column in a table to create an index. This index is added to the row group and used when queries access a column to filter rows. It accelerates queries that use predicates, joins, filters, and searches, and minimizes data scanning.
The index types (such as bitmap, tree, and others), are determined automatically by the column data types.
WARM_UP_TYPE_DATA value in the
warmUpType property to
configure index acceleration for a specific column with the REST API.
Text search acceleration #
Text search acceleration creates a tokenized index of the content of text columns that is used in query predicates. It accelerates queries that use predicates of filters and searches on text columns.
Smart indexing and caching automatically enables text search acceleration, and maintains the indexes.
Text search acceleration uses Apache Lucene
indexing to accelerate text analytics and provide fast text filters,
LIKE predicates. The
provides full support for
LIKE semantics to search for the exact appearance of
a value in a filtered column.
A use case is a search for a specific short string in a larger column, such as a
description. For example, consider a table with a column named
city and a
New York, United States. The index is case-sensitive. When indexing is
applied to the column, the following query returns that record because the
LIKE predicate is an exact match:
SELECT * FROM tbl WHERE city LIKE '%New York%'
The following queries do not return the results because the
are not an exact match. The first query is missing a space in the pattern:
SELECT * FROM tbl WHERE city LIKE '%NewYork%'
The second query uses lowercase:
SELECT * FROM tbl WHERE city LIKE '%new york%'
Text search acceleration indexing is recommended for:
- Queries with
LIKEpredicates, prefix or suffix queries, or queries that use the starts_with functions.
- Range queries on string columns. A common use is dates that are stored as
strings that have range predicates. For example,
Text search acceleration indexing supports the following data types:
WARM_UP_TYPE_LUCENE value in the
warmUpType property to
configure text search acceleration for a specific column with the REST API.
- The maximum supported string length is 50k characters.
- Queries with nested expressions, such as
starts_with(some_nested_method(col1), 'aaa'), are not accelerated.
Automated clean up #
When the available storage on the cluster is about to run out of storage space, index and cache elements are automatically deleted. As a user or administrator, you don’t need to manage index and cache allocation. When the storage capacity threshold is exceeded, the system deletes the following content until the clean up threshold is reached:
- All expired content based on the TTL value.
- Content with the lowest values on the priority property that were created as a result of the default acceleration.
- Content related to custom warmup rules for indexing and caching.
After a clean up, new data is indexed and cached as needed based on the data access by the processed queries.
Smart indexing and caching is supported on Kubernetes-based clusters on EKS, AKS, or GKE with specific configuration. It is suitable for catalogs using the Hive, Iceberg, or Delta Lake connector with minimal configuration changes for the catalog properties file.
Refer to our detailed installation and configuration instructions to get started.
Cluster management #
Smart indexing and caching accommodates cluster expansion and contraction. Be aware of the following when scaling up or down:
When scaling a cluster horizontally (adding or removing worker nodes), Smart indexing and caching continues operating, assuming that requirements are properly fulfilled. A cluster restart is not required when adding or removing nodes.
Scaling a cluster vertically to use larger nodes requires a cluster restart, which facilitates the replacement of all worker nodes to the larger node size.
After restarting the cluster, the default acceleration becomes active. New caches and indexes get created and populated based on the query workload. Any user-defined warmup rules are lost after restart, unless a database is configured for Smart indexing and caching.
What happens in case data is not cached and indexed? Am I getting partial results?
No. In case a split can be served from SSD, it is served; but if not, Smart indexing and caching gets the data for this split from the object storage to complete the query and sends back the results. Then the index and cache are created asynchronously as appropriate so that future queries can leverage the index and cache.
Is there a chance a user can get stale results?
No. Smart indexing and caching uses a mapping between the generated splits and index and cache data on SSDs during query processing. If a split can be served from SSD, it is; but if not, Smart indexing and caching gets the data for this split from the object storage and then asynchronously indexes and caches it as appropriate.
What is the caching and indexing speed?
Performance depends on many different factors. For example, indexing and caching
the entire TPC-DS SF1000 dataset takes about 20 minutes on a cluster with two
workers with the machine size
Is the information on this page helpful?