Starburst Galaxy

  •  Get started

  •  Working with data

  •  Data engineering

  •  Developer tools

  •  Cluster administration

  •  Troubleshooting

  • Galaxy status

  •  Reference
  • Warp Speed-enabled clusters #

    Starburst Galaxy includes Starburst Warp Speed. With Starburst Warp Speed you can use accelerated clusters to leverage smart indexing and caching. Starburst Warp Speed automatically creates and maintains these indexes and caches based on the characteristics of the processed queries. The index and cache data is stored on local storage attached to each worker node in the cluster. Because the data is available directly in the cluster and no longer must be retrieved from remote object storage, query processing is accelerated when accessing the same data.

    You can take advantage of the performance improvements by updating your cluster to an accelerated cluster to access data in Amazon S3 or Tabular catalogs.

    The following sections detail how Starburst Warp Speed operates, and provides further information.

    Default acceleration #

    When a query accesses data that is not accelerated, the system performs data and index materialization on the cluster to accelerate future access to that data. This process of creating the indexes and caches is also called warmup. Warmup is performed individually by each worker based on the processed splits and uses the local high performance storage of the worker. Typically, these are NVMe SSD drives.

    When new data is added to a table or the index and cache creation are in progress, the new portions of the table that are not accelerated are served from the object storage. After the asynchronous indexing and caching is complete, query processing is accelerated when accessing the same data, because the data is available directly in the cluster from the indexes and caches, and no longer has to be retrieved from the remote object storage.

    This results in immediately improved performance for recently used datasets.

    Default acceleration is not performed for SELECT * FROM <table_name> queries that are commonly used to explore a table rather than to retrieve specific data.

    Acceleration types #

    Starburst Warp Speed uses different types of acceleration to improve query processing performance, as described in the following sections.

    Data cache acceleration #

    Data cache acceleration is the system that caches the raw data objects from the object storage directly on the high-performance storage attached to the workers in the cluster. The data from one or more objects is processed in the cluster as splits. The data from the splits and associated metadata are managed as a row group. These row groups are used to accelerate any queries that access the contained data. The row groups are stored in a proprietary columnar block caching format.

    Index acceleration #

    Index acceleration uses the data in a specific column in a table to create an index. This index is added to the row group and used when queries access a column to filter rows. It accelerates queries that use predicates, joins, filters, and searches, and minimizes data scanning.

    The index types (such as bitmap, tree, and others), are determined automatically by the column data types.

    Text search acceleration #

    Starburst Warp Speed automatically enables text search acceleration. Text search acceleration uses Apache Lucene to create an index of the content of text columns. This index is used in query predicates and accelerates queries that use predicates of filters and searches on text columns. Lucene indexing accelerates text analytics and provides fast text filters, particularly with LIKE predicates. The KeywordAnalyzer provides full support for LIKE semantics to search for the exact appearance of a value in a filtered column.

    A use case is a search for a specific short string in a larger column, such as a description. For example, consider a table with a column named city and a value New York, United States. The index is case-sensitive. When indexing is applied to the column, the following query returns that record because the LIKE predicate is an exact match:

    SELECT *
    FROM tbl
    WHERE city LIKE '%New York%'
    

    The following queries do not return the results because the LIKE predicates are not an exact match. The first query is missing a space in the pattern:

    SELECT *
    FROM tbl
    WHERE city LIKE '%NewYork%'
    

    The second query uses lowercase:

    SELECT *
    FROM tbl
    WHERE city LIKE '%new york%'
    

    Text search acceleration indexing is recommended for:

    • Queries with LIKE predicates, prefix or suffix queries, or queries that use the starts_with functions.
    • Range queries on string columns. A common use is dates that are stored as strings that have range predicates. For example, date_string>='yyyy-mm-dd'.

    Text search acceleration indexing supports the following data types:

    • CHAR
    • VARCHAR
    • CHAR ARRAY
    • VARCHAR ARRAY

    Limitations:

    • The maximum supported string length is 33k characters.
    • Queries with nested expressions, such as starts_with(some_nested_method(col1), 'aaa'), are not accelerated.

    Index and cache resiliency #

    Starburst Warp Speed optionally provides index and cache resiliency. When a new index is created or data is cached, it is stored on the NVMe (nonvolatile memory express) SSDs attached to each worker node by default, and in addition on a dedicated, shared bucket in your object storage. When you scale the cluster, the indexes and data cache remain available in the shared storage.

    The index and cache resiliency feature enables fast warmup when scaling up by adding worker nodes to a cluster. When a cluster needs to warm an index or data cache, it first accesses the designated object storage bucket to check whether the indexes or cache are ready, and loads them accordingly. If the indexes or cache are not available in the shared object storage or cannot be loaded for any reason, the data is warmed as usual.

    Index and cache resiliency is available for AWS clusters in the us-east-1 and us-west-2 regions. The bucket used to back up the index and cache must be hosted in the same region as the cluster.

    To use the resiliency feature to back up your index and data cache in Amazon S3, follow these steps:

    1. Incorporate a backup location into your defined AWS S3 catalog.
    2. Adjust the AWS privileges to allow writing and reading from the backup location.
    3. Enable the resiliency feature on the catalog. Click Index and cache resiliency to enable resiliency when creating a new catalog or editing an existing catalog on an accelerated cluster.
    4. Optionally set an S3 lifecycle policy to delete expired index or data cache elements in the backup location.

    After completing these steps, index and cache elements are created in the backup location when you run a query. When you run the same query on a suspended or not enabled cluster, warmup time is reduced.

    Set up a backup location #

    To set a backup location in your object storage for index or data caches, enter a Bucket name and a Directory name within the bucket where the data is stored.

    backup location

    Edit AWS privileges #

    To use Starburst Warp Speed index and cache resiliency, you must include read/write permissions to the backup location in object storage. The following shows a privilege example for read/write access to S3:

    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "s3ReadWrite",
                "Effect": "Allow",
                "Action": [
                    "s3:ListBucket",
                    "s3:GetBucketPolicy",
                    "s3:GetObject",
                    "s3:GetObjectAttributes",
                    "s3:PutObject",
                    "s3:DeleteObject",
                    "s3:AbortMultipartUpload"
                ],
                "Resource": [
                    "arn:aws:s3:::<backup-location-bucket>/*",
                    "arn:aws:s3:::<backup-location-bucket>",
                    "arn:aws:s3:::<data-bucket>/*",
                    "arn:aws:s3:::<data-bucket>/*"
                ]
            },
            {
                "Sid": "VisualEditor1",
                "Effect": "Allow",
                "Action": [
                  "s3:ListAllMyBuckets",
                  "glue:*"
                ],
                "Resource": "*"
            }
        ]
    }
    

    Enable the feature #

    After you have configured a backup location, click Index and cache resiliency to read and write index and cache elements in object storage. You can enable the feature when creating a new catalog or when editing a catalog on an existing accelerated cluster.

    Lifecycle Policies #

    To control the cost associated with storing the index and cache data on your object storage, we recommend setting a lifecycle policy to delete expired elements in the backup location. Experiment with expiration settings based on your workload patterns.

    Index and cache usage #

    You can view index and cache usage metrics about query performance and resource utilization in the General tab of the query details pane.

    For more information, see Index and cache usage on the Query details pane.

    Automated clean up #

    When the available storage on the cluster is about to run out of storage space, index and cache elements are automatically deleted. As a user or administrator, you don’t need to manage index and cache allocation.

    After a clean up, new data is indexed and cached as needed based on the data access by the processed queries.

    SQL support #

    Starburst Warp Speed supports all SQL statements and functions that Great Lakes connectivity supports. SQL support differs based on the table format you are using. For more information, see Great Lakes connectivity.

    Starburst Warp Speed supports all data types, including structural data types. All structural data types are accessible, but indexing is only applicable to fields within ROW data types.

    For some functions, Starburst Warp Speed does not accelerate filtering operations on columns. For example, this filtering operation is not accelerated:

    SELECT count(*)
    FROM tbl
    WHERE lower(company) = 'starburst'
    

    Starburst Warp Speed indexing accelerates the following functions when used on the left or the right side of the predicate:

    • ceil(x) with real and double data types
    • in_nan(x) with real and double data types
    • cast(x as type) with double cast to real, or any type cast to varchar
    • cast(x as type) with double and decimal data types
    • day(d) and day_of_month(d) with date and timestamp data types
    • day_of_year(d) and doy(y) with date and timestamp data types
    • day_of_week(d) and dow(d) with date and timestamp data types
    • year(d) with date and timestamp data types
    • year_of_week(d) and yow(d) with date and timestamp data types
    • week(d) and week_of_year(d) with date and timestamp data types
    • like and not like with varchar data type
    • contains(arr_varchar, value) with array of varchar data types

    The maximum supported string length for any cached data type is 48000 characters.

    When to choose Accelerated #

    There are many reasons to choose an Accelerated cluster. However, Accelerated clusters are not suitable for all use cases. Here are some things to consider:

    • Accelerated clusters are only available on AWS. They are not available on Azure or Google Cloud.
    • Starburst Warp Speed only supports Auto suspend times of 1 hour and Never suspend.
    • Starburst Warp Speed does not support cluster autoscaling.
    • Clusters hosted on AWS EC2 instances using Graviton3 processors with SSD storage are only available in four AWS regions. For more information, see the AWS news blog.

    FAQ #

    What happens in case data is not cached and indexed? Am I getting partial results?

    No. In case a split can be served from SSD, it is served; but if not, Starburst Warp Speed gets the data for this split from the object storage to complete the query and sends back the results. Then the index and cache are created asynchronously, based on priority and available SSD storage, so that future queries can leverage the index and cache.

    Is there a chance a user can get stale results?

    No. Starburst Warp Speed uses a mapping between the generated splits and index and cache data on SSDs during query processing. If a split can be served from SSD, it is; but if not, Starburst Warp Speed gets the data for this split from the object storage and then asynchronously indexes and caches it as appropriate.