Starburst Galaxy includes Starburst Warp Speed. With Starburst Warp Speed you can use accelerated clusters to leverage smart indexing and caching. Starburst Warp Speed automatically creates and maintains these indexes and caches based on the characteristics of the processed queries. The index and cache data is stored on local storage attached to each worker node in the cluster. Because the data is available directly in the cluster and no longer must be retrieved from remote object storage, query processing is accelerated when accessing the same data.
You can take advantage of the performance improvements by updating your cluster to an accelerated cluster to access data in Amazon S3, Google Cloud Storage, Azure Data Lake Storage, or Apache Polaris catalogs.
The following sections detail how Starburst Warp Speed operates, and provides further information.
When you access data that has not yet been used in a query, the system performs data and index materialization on the cluster to accelerate future access to that data. This process of creating the indexes and caches is also called warmup. Warmup is performed individually by each worker based on the processed splits and uses the local, high-performance storage of the worker. Typically, these are NVMe SSD drives.
When new data is added to a table or when index and cache creation is in progress, new portions of the table that have not been warmed are served from the object storage. After indexing and caching is complete, query processing is accelerated. Data is accessed directly from the cluster’s indexes and cache instead of remote object storage.
This results in immediately improved performance for recently used datasets.
Starburst Warp Speed uses different types of acceleration to improve query processing performance, as described in the following sections.
Data cache acceleration is the system that caches the raw data objects from the object storage directly on the high-performance storage attached to the workers in the cluster. The data from one or more objects is processed in the cluster as splits. The data from the splits and associated metadata are managed as a row group. These row groups are used to accelerate any queries that access the contained data. The row groups are stored in a proprietary columnar block caching format.
Index acceleration uses the data in a specific column in a table to create an index. This index is added to the row group and used when queries access a column to filter rows. It accelerates queries that use predicates, joins, filters, and searches, and minimizes data scanning.
The index types (such as bitmap, tree, and others), are determined automatically by the column data types.
Starburst Warp Speed automatically enables text search acceleration. Text search
acceleration uses Apache Lucene to create an index
of the content of text columns. This index is used in query predicates and
accelerates queries that use predicates of filters and searches on text columns.
Lucene indexing accelerates text analytics and provides fast text filters,
particularly with LIKE
predicates. The
KeywordAnalyzer
provides full support for LIKE
semantics to search for the exact appearance of
a value in a filtered column.
A use case is a search for a specific short string in a larger column, such as a
description. For example, consider a table with a column named city
and a
value New York, United States
. The index is case-sensitive. When indexing is
applied to the column, the following query returns that record because the
LIKE
predicate is an exact match:
SELECT *
FROM tbl
WHERE city LIKE '%New York%'
The following queries do not return the results because the LIKE
predicates
are not an exact match. The first query is missing a space in the pattern:
SELECT *
FROM tbl
WHERE city LIKE '%NewYork%'
The second query uses lowercase:
SELECT *
FROM tbl
WHERE city LIKE '%new york%'
Text search acceleration indexing is recommended for:
LIKE
predicates, prefix or suffix queries, or queries that use
the starts_with functions.date_string>='yyyy-mm-dd'
.Text search acceleration indexing supports the following data types:
CHAR
VARCHAR
CHAR ARRAY
VARCHAR ARRAY
Limitations:
starts_with(some_nested_method(col1), 'aaa')
, are not accelerated.Starburst Warp Speed optionally provides fast warmup, formerly known as index and cache resiliency. When a new index is created or data is cached, it is stored on the NVMe SSDs attached to each worker node by default, and in addition on a dedicated, shared bucket in your object storage. When you scale the cluster, the indexes and data cache remain available in the shared storage.
The fast warmup feature provisions additional worker nodes to a cluster as it scales up. When a cluster needs to warm an index or data cache, it first accesses the designated object storage bucket to check whether the indexes or cache are ready, and loads them accordingly. If the indexes or cache are not available in the shared object storage or cannot be loaded for any reason, the data is warmed as usual.
The bucket used to back up the index and cache data must be hosted in the same region as the cluster.
To use the fast warmup feature to back up your index and data cache in Amazon S3, follow these steps:
After completing these steps, index and cache elements are created in the backup
location when you run a query. When you run the same query on a suspended
or
not enabled
cluster, warmup time is reduced.
To set a backup location in your object storage for index or data caches, enter a Bucket name and a Directory name within the bucket where the data is stored.
To use Starburst Warp Speed fast warmup, you must include read/write permissions to the backup location in object storage. The following shows a privilege example for read/write access to S3:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "s3ReadWrite",
"Effect": "Allow",
"Action": [
"s3:ListBucket",
"s3:GetBucketPolicy",
"s3:GetObject",
"s3:GetObjectAttributes",
"s3:PutObject",
"s3:DeleteObject",
"s3:AbortMultipartUpload"
],
"Resource": [
"arn:aws:s3:::<backup-location-bucket>/*",
"arn:aws:s3:::<backup-location-bucket>",
"arn:aws:s3:::<data-bucket>/*",
"arn:aws:s3:::<data-bucket>/*"
]
},
{
"Sid": "VisualEditor1",
"Effect": "Allow",
"Action": [
"s3:ListAllMyBuckets",
"glue:*"
],
"Resource": "*"
}
]
}
After you have configured a backup location, click toggle_on Fast warmup to read and write index and cache elements in object storage. You can enable the feature when creating a new catalog or when editing a catalog on an existing accelerated cluster.
To control the cost associated with storing the index and cache data on your object storage, we recommend setting a lifecycle policy to delete expired elements in the backup location. Experiment with expiration settings based on your workload patterns.
You can view index and cache usage metrics about query performance and resource utilization in the General tab of the query details pane.
For more information, see Index and cache usage on the Query details pane.
When the available storage on the cluster is about to run out of storage space, index and cache elements are automatically deleted. As a user or administrator, you don’t need to manage index and cache allocation.
After a clean up, new data is indexed and cached as needed based on the data access by the processed queries.
Starburst Warp Speed supports all SQL statements and functions that Great Lakes connectivity supports. SQL support differs based on the table format you are using. For more information, see Great Lakes connectivity.
Starburst Warp Speed supports all data types, including structural data types.
All structural data types are accessible, but indexing is only applicable to
fields within ROW
data types.
For some functions, Starburst Warp Speed does not accelerate filtering operations on columns. For example, this filtering operation is not accelerated:
SELECT count(*)
FROM tbl
WHERE lower(company) = 'starburst'
Starburst Warp Speed indexing accelerates the following functions when used on the left or the right side of the predicate:
ceil(x)
with real
and double
data typesin_nan(x)
with real
and double
data typescast(x as type)
with double
cast to real
, or any type cast to
varchar
cast(x as type)
with double
and decimal
data typesday(d)
and day_of_month(d)
with date
and timestamp
data typesday_of_year(d)
and doy(y)
with date
and timestamp
data typesday_of_week(d)
and dow(d)
with date
and timestamp
data typesyear(d)
with date
and timestamp
data typesyear_of_week(d)
and yow(d)
with date
and timestamp
data typesweek(d)
and week_of_year(d)
with date
and timestamp
data typeslike
and not like
with varchar
data typecontains(arr_varchar, value)
with array of varchar
data typesThe maximum supported string length for any cached data type is 48000 characters.
There are many reasons to choose an Accelerated cluster. However, Accelerated clusters are not suitable for all use cases. Here are some things to consider:
What happens in case data is not cached and indexed? Am I getting partial results?
No. In case a split can be served from SSD, it is served; but if not, Starburst Warp Speed gets the data for this split from the object storage to complete the query and sends back the results. Then the index and cache are created asynchronously, based on priority and available SSD storage, so that future queries can leverage the index and cache.
Is there a chance a user can get stale results?
No. Starburst Warp Speed uses a mapping between the generated splits and index and cache data on SSDs during query processing. If a split can be served from SSD, it is; but if not, Starburst Warp Speed gets the data for this split from the object storage and then asynchronously indexes and caches it as appropriate.
Is the information on this page helpful?
Yes
No