Starburst Warp Speed #
Starburst Galaxy includes Starburst Warp Speed. With Starburst Warp Speed you can use accelerated clusters to leverage smart indexing and caching. Starburst Warp Speed automatically creates and maintains these indexes and caches based on the characteristics of the processed queries. The index and cache data is stored on local storage attached to each worker node in the cluster. Because the data is available directly in the cluster and no longer must be retrieved from remote object storage, query processing is accelerated when accessing the same data.
You can take advantage of the performance improvements by updating your cluster to an accelerated cluster to access data in Amazon S3 or Tabular catalogs.
The following sections detail how Starburst Warp Speed operates, and provides further information.
Default acceleration #
When a query accesses data that is not accelerated, the system performs data and index materialization on the cluster to accelerate future access to that data. This process of creating the indexes and caches is also called warmup. Warmup is performed individually by each worker based on the processed splits and uses the local high performance storage of the worker. Typically, these are SSD NVMe drives.
When new data is added to a table or the index and cache creation are in progress, the new portions of the table that are not accelerated are served from the object storage. After the asynchronous indexing and caching is complete, query processing is accelerated when accessing the same data, because the data is available directly in the cluster from the indexes and caches, and no longer has to be retrieved from the remote object storage.
This results in immediately improved performance for recently used datasets.
Default acceleration is not performed for SELECT * FROM <table_name>
queries
that are commonly used to explore a table rather than to retrieve specific
data.
Acceleration types #
Starburst Warp Speed uses different types of acceleration to improve query processing performance, as described in the following sections.
Data cache acceleration #
Data cache acceleration is the system that caches the raw data objects from the object storage directly on the high-performance storage attached to the workers in the cluster. The data from one or more objects is processed in the cluster as splits. The data from the splits and associated metadata are managed as a row group. These row groups are used to accelerate any queries that access the contained data. The row groups are stored in a proprietary columnar block caching format.
Index acceleration #
Index acceleration uses the data in a specific column in a table to create an index. This index is added to the row group and used when queries access a column to filter rows. It accelerates queries that use predicates, joins, filters, and searches, and minimizes data scanning.
The index types (such as bitmap, tree, and others), are determined automatically by the column data types.
Text search acceleration #
Starburst Warp Speed automatically enables text search acceleration. Text search
acceleration uses Apache Lucene to create an index
of the content of text columns. This index is used in query predicates and
accelerates queries that use predicates of filters and searches on text columns.
Lucene indexing accelerates text analytics and provides fast text filters,
particularly with LIKE
predicates. The
KeywordAnalyzer
provides full support for LIKE
semantics to search for the exact appearance of
a value in a filtered column.
A use case is a search for a specific short string in a larger column, such as a
description. For example, consider a table with a column named city
and a
value New York, United States
. The index is case-sensitive. When indexing is
applied to the column, the following query returns that record because the
LIKE
predicate is an exact match:
SELECT *
FROM tbl
WHERE city LIKE '%New York%'
The following queries do not return the results because the LIKE
predicates
are not an exact match. The first query is missing a space in the pattern:
SELECT *
FROM tbl
WHERE city LIKE '%NewYork%'
The second query uses lowercase:
SELECT *
FROM tbl
WHERE city LIKE '%new york%'
Text search acceleration indexing is recommended for:
- Queries with
LIKE
predicates, prefix or suffix queries, or queries that use the starts_with functions. - Range queries on string columns. A common use is dates that are stored as
strings that have range predicates. For example,
date_string>='yyyy-mm-dd'
.
Text search acceleration indexing supports the following data types:
CHAR
VARCHAR
CHAR ARRAY
VARCHAR ARRAY
Limitations:
- The maximum supported string length is 33k characters.
- Queries with nested expressions, such as
starts_with(some_nested_method(col1), 'aaa')
, are not accelerated.
Index and cache usage #
You can view index and cache usage metrics about query performance and resource utilization in the General tab of the query details pane.
For more information, see Index and cache usage on the Query details pane.
Automated clean up #
When the available storage on the cluster is about to run out of storage space, index and cache elements are automatically deleted. As a user or administrator, you don’t need to manage index and cache allocation.
After a clean up, new data is indexed and cached as needed based on the data access by the processed queries.
SQL support #
Starburst Warp Speed supports all SQL statements and functions that Great Lakes connectivity supports. SQL support differs based on the table format you are using. For more information, see Great Lakes connectivity.
Starburst Warp Speed supports all data types, including structural data types.
All structural data types are accessible, but indexing is only applicable to
fields within ROW
data types.
For some functions, Starburst Warp Speed does not accelerate filtering operations on columns. For example, this filtering operation is not accelerated:
SELECT count(*)
FROM tbl
WHERE lower(company) = 'starburst'
Starburst Warp Speed indexing accelerates the following functions when used on the left or the right side of the predicate:
ceil(x)
withreal
anddouble
data typesin_nan(x)
withreal
anddouble
data typescast(x as type)
withdouble
cast toreal
, or any type cast tovarchar
cast(x as type)
withdouble
anddecimal
data typesday(d)
andday_of_month(d)
withdate
andtimestamp
data typesday_of_year(d)
anddoy(y)
withdate
andtimestamp
data typesday_of_week(d)
anddow(d)
withdate
andtimestamp
data typesyear(d)
withdate
andtimestamp
data typesyear_of_week(d)
andyow(d)
withdate
andtimestamp
data typesweek(d)
andweek_of_year(d)
withdate
andtimestamp
data typeslike
andnot like
withvarchar
data typecontains(arr_varchar, value)
with array ofvarchar
data types
The maximum supported string length for any cached data type is 48000 characters.
When to choose Accelerated #
There are many reasons to choose an Accelerated cluster. However, Accelerated clusters are not suitable for all use cases. Here are some things to consider:
- Accelerated clusters are only available on AWS. They are not available on Azure or Google Cloud.
- Starburst Warp Speed only supports Auto suspend times of 1 hour and Never suspend.
- Starburst Warp Speed does not support cluster autoscaling.
- Clusters hosted on AWS EC2 instances using Graviton3 processors with SSD storage are only available in four AWS regions. For more information, see the AWS news blog.
FAQ #
What happens in case data is not cached and indexed? Am I getting partial results?
No. In case a split can be served from SSD, it is served; but if not, Starburst Warp Speed gets the data for this split from the object storage to complete the query and sends back the results. Then the index and cache are created asynchronously, based on priority and available SSD storage, so that future queries can leverage the index and cache.
Is there a chance a user can get stale results?
No. Starburst Warp Speed uses a mapping between the generated splits and index and cache data on SSDs during query processing. If a split can be served from SSD, it is; but if not, Starburst Warp Speed gets the data for this split from the object storage and then asynchronously indexes and caches it as appropriate.
Is the information on this page helpful?
Yes
No
Is the information on this page helpful?
Yes
No