Starburst Delta Lake connector#

The Starburst Delta Lake connector is an extended version of the Delta Lake connector with configuration and usage identical.

Requirements#

To connect to Databricks Delta Lake, you need:

Extensions#

The connector includes all the functionality described in the Delta Lake connector as well as the features and integrations detailed in the following section:

Unity catalog#

The connector supports reading from managed, internal tables, and unmanaged Delta Lake tables when using the Databricks Unity Catalog as a metastore on AWS or Azure.

Note

The Databricks Unity Catalog metastore is available for Delta Lake as a public preview. Reading from views is not supported when using Databricks Unity Catalog as a metastore. Contact Starburst Support with questions or feedback.

To use Unity Catalog metastore, add the following configuration properties to your catalog configuration file:

hive.metastore=unity
delta.security=read_only
delta.metastore.unity.host=<unity catalog hostname>
delta.metastore.unity.access-token=<token>

The following table shows the configuration properties used to connect SEP to Unity Catalog as a metastore.

Unity configuration properties#

Property name

Description

delta.metastore.unity.host

Name of the host, without http(s) prefix, for example: dbc-28c47e62-60b6.cloud.databricks.com

delta.metastore.unity.access-token

The token used to authenticate a connection to the Unity Catalog metastore. For more information about generating access tokens, see the Databricks documentation.

delta.metastore.unity.catalog-name

(Optional) Name of the catalog in Databricks. Default is main.

Enable OAuth 2.0 token pass-through#

The Unity Catalog supports OAuth 2.0 token pass-through.

To enable OAuth 2.0 token pass-through:

  1. Add the following configuration properties to the config.properties file on the coordinator:

    http-server.authentication.type=DELEGATED-OAUTH2
    web-ui.authentication.type=DELEGATED-OAUTH2
    http-server.authentication.oauth2.scopes=<AzureDatabricks-ApplicationID>/.default,openid
    http-server.authentication.oauth2.additional-audiences=<AzureDatabricks-ApplicationID>
    

Replace <AzureDatabricks-ApplicationID> with the Application ID for your Azure Databricks Microsoft Application which can be found in your Azure Portal under Enterprise applications.

  1. Add only the following configuration properties to the delta.properties catalog configuration file:

    delta.metastore.unity.authentication-type=OAUTH2_PASSTHROUGH
    delta.security=unity
    hive.metastore-cache-ttl=0s
    

Limitations:

  • Credential passthrough is only supported with Azure Databricks and when Microsoft Entra is the IdP.

  • When enabling credential passthrough you cannot use Hive Passthrough.

Location alias mapping#

If you are using Unity catalog as a metastore when accessing external tables, the Starburst Delta Lake connector supports using a bucket-style alias for your Amazon S3 bucket access point.

To enable location alias mapping:

  1. Create a bucket alias mapping file in JSON format:

{
  "bucket_name_1": "bucket_alias_1",
  "bucket_name_2": "bucket_alias_2"
}
  1. Add the following properties to your catalog configuration:

location-alias.provider-type=file
location-alias.mapping.file.path=/path_to_bucket_alias_mapping_file
  1. Optionally, use location-alias.mapping.file.expiration-time to specify the interval at which SEP rereads the bucket alias mapping file. The default is 1m.

SEP uses the new external location path specified in the bucket alias mapping file to access the data. Only the bucket name is replaced. The URI is otherwise unchanged.

SQL support#

The connector supports all of the SQL statements listed in the Delta Lake connector documentation.

The following improvements are included:

SQL security#

You must set the delta.security property in your catalog properties file to sql-standard in order to use SQL security operation statements. See SQL standard based authorization for more information.

Replacing tables#

The connector supports replacing a table as an atomic operation. Atomic table replacement creates a new snapshot with the new table definition (see CREATE TABLE and CREATE TABLE AS), but keeps table history.

The new table after replacement is completely new and separate from the old table. Only the name of the table remains identical.

For example a partitioned table my_table can be replaced by a completely new definition.

CREATE TABLE my_table (
    a BIGINT,
    b DATE,
    c BIGINT)
WITH (partitioning = ARRAY['a']);
CREATE OR REPLACE TABLE my_table
WITH (sorted_by = ARRAY['a'])
AS SELECT * from another_table;

Table replacement in the Starburst Delta Lake connector has the following limitations:

  • Table replacement does not work on append-only Delta Lake tables.

  • Table replacement does not work for tables with the change_data_feed_enabled property set to true.

  • Table replacement does not work if the new table after replacement has the change_data_feed_enabled property set to true.

  • Table replacement does not work if the location specified in the property is different from the location of the existing table.

  • Table types must stay the same. For example, table replacement cannot be used to replace a managed table with an external table.

Performance#

The connector includes a number of performance improvements, detailed in the following sections:

Dynamic row filtering#

Dynamic filtering, and specifically also dynamic row filtering, is enabled by default. Row filtering improves the effectiveness of dynamic filtering for a connector by using dynamic filters to remove unnecessary rows during a table scan. It is especially powerful for selective filters on columns that are not used for partitioning, bucketing, or when the values do not appear in any clustered order naturally.

As a result the amount of data read from storage and transferred across the network is further reduced. You get access to higher query performance and a reduced cost.

You can use the following properties to configure dynamic row filtering:

Dynamic row filtering properties#

Property name

Description

dynamic-row-filtering.enabled

Toggle dynamic row filtering. Defaults to true. Catalog session property name is dynamic_row_filtering_enabled.

dynamic-row-filtering.selectivity-threshold

Control the threshold for the fraction of the selected rows from the overall table above which dynamic row filters are not used. Defaults to 0.7. Catalog session property name is dynamic_row_filtering_selectivity_threshold.

dynamic-row-filtering.wait-timeout

Duration to wait for completion of dynamic row filtering. Defaults to 0. The default causes query processing to proceed without waiting for the dynamic row filter, it is collected asynchronously and used as soon as it becomes available. Catalog session property name is dynamic_row_filtering_wait_timeout.

Starburst Cached Views#

The connector supports table scan redirection to improve performance and reduce load on the data source.

Security#

The connector includes a number of security-related features, detailed in the following sections.

Authorization#

The connector supports standard Hive security for authorization under the delta.security configuration property. For more information, see the Delta Lake connector authorization configuration options.

Built-in access control#

If you have enabled built-in access control for SEP, you must add the following configuration to all Delta Lake catalogs:

delta.security=starburst