Starburst Delta Lake connector#
The Delta Lake connector allows querying data stored in Delta Lake format, including Databricks Delta Lake. It can natively read the Delta transaction log and thus detect when external systems change data.
Note
This connector is currently available as beta release for usage on Azure Storage. Google Cloud Storage is not yet supported. Deployments using AWS or HDFS are fully supported. Work with the Starburst Delta Lake connector support team, if you are planning to use this connector in production.
Note
The connector requires a valid Starburst Enterprise license.
Configuration#
The connector requires a Hive metastore for table metadata and supports the same
metastore configuration properties as the Hive connector. At a minimum, hive.metastore.uri
must be configured.
The connector recognizes Delta tables created in the metastore by the Databricks runtime. If non-Delta tables are present in the metastore, as well, they will not be visible to the connector.
To configure the Delta Lake connector, create a catalog file, for example
etc/catalog/delta.properties
, that references the delta-lake
connector.
Update the hive.metastore.uri
with the URI of your Hive metastore Thrift
service:
connector.name=delta-lake
hive.metastore.uri=thrift://example.net:9083
If you are using AWS Glue as Hive metastore, you can simply set the metastore to
glue
:
connector.name=delta-lake
hive.metastore=glue
The Delta Lake connector reuses certain functionalities from the Hive connector, including the metastore thrift and glue configuration, detailed in the Hive connector documentation.
To configure access to S3, S3-compatible storage, Azure storage, and others, consult the Amazon S3 section of Hive connector documentation or the Azure storage documentation, respectively.
Configuration properties#
The following configuration properties are all using reasonable, tested default values. Typical usage does not require you to configure them.
Property name |
Description |
Default value |
---|---|---|
|
Frequency of checks for metadata updates, equivalent to transactions, to update the metadata cache |
5 min |
|
Amount of memory allocated for caching information about files. Needs
to be specified in data size values such as |
Calculated to 10% of the maximum memory allocated to the coordinator JVM. |
|
Maximum number of partitions per writer. |
100 |
|
Hide information about tables that are not managed by Delta Lake. Hiding only applies to tables with the metadata managed in a Glue catalog, does not apply to usage with a Hive metastore service. |
false |
|
Determines whether or not the connector checks that the location contains a valid transaction log entry when creating a new table. |
true |
The following table describes performance tuning catalog properties for the connector.
Warning
Performance tuning configuration properties are considered expert-level features. Altering these properties from their default values is likely to cause instability and performance degradation. We strongly suggest that you use them only to address non-trivial performance issues, and that you keep a backup of the original values if you change them.
Property name |
Description |
Default value |
---|---|---|
|
Sets the number of transactions to act as threshold. Once reached the connector initiates compaction of the underlying files and the delta files. A higher compaction threshold means reading less data from the underlying data source, but a higher memory and network consumption. |
100 |
|
The target number of buffered splits for each table scan in a query, before the scheduler tries to pause. |
1000 |
|
Sets the maximum number of splits used per second to access underlying storage. Reduce this number if your limit is routinely exceeded, based on your filesystem limits. This is set to the absolute maximum value, which results in PrestoSQL maximizing the parallelization of data access by default. Attempting to set it higher results in PrestoSQL not being able to start. |
Integer.MAX_VALUE |
|
For each query, the coordinator assigns file sections to read first
at the |
200 |
|
Sets the initial data size for a single read section assigned to a
worker until |
|
|
Sets the largest size for a single read section assigned to a worker
after max-initial-splits have been processed. You must specify units.
You can also use the corresponding catalog session property
|
|
The following table describes catalog session properties supported by the Delta Lake connector to configure processing of Parquet files.
Property name |
Description |
---|---|
|
The maximum block size used when reading Parquet files. |
|
The maximum block size created by the Parquet writer. |
|
The maximum page size created by the Parquet writer. |
|
Enables the experimental, native Parquet writer. |
Special columns#
In addition to the defined columns, the Delta Lake connector automatically exposes metadata in a number of hidden columns in each table. You can use these columns in your SQL statements like any other column, e.g., they can be selected directly or used in conditional statements.
$path
Full file system path name of the file for this row.
$file_modified_time
Date and time of the last modification of the file for this row.
$file_size
Size of the file for this row.
Creating tables#
When Delta tables exist in storage, but not in the metastore, SEP can be used to register them:
CREATE TABLE delta.default.my_table (
dummy bigint
)
WITH (
location = '...'
)
Note that the columns listed in the DDL, such as dummy
in the above example,
are ignored. The table schema is read from the transaction log, instead. If the
schema is changed by an external system, SEP automatically uses the new
schema.
The Delta Lake connector also supports creating tables using the CREATE TABLE AS
syntax. A set of partition columns can optionally be provided using the partitioned_by
table property. For example:
CREATE TABLE delta.default.my_partitioned_table
WITH (
location = '...',
partitioned_by = ARRAY['regionkey']
)
AS SELECT name, comment, regionkey FROM tpch.tiny.nation;
Performance#
The connector includes a number of performance improvements, detailed in the following sections.
Table statistics#
The Delta Lake specification defines a number of per file statistics that can be included in the transaction log. When they are present, the connector uses them to expose table and column level statistics, as documented in Table Statistics. Only the number of distinct values for a column is not provided by the transaction log, and can therefore not be used for cost based optimizations. This results in less efficient optimization.
The Databricks runtime automatically collects and records statistics, while they are missing for tables created by the open source Delta Lake implementation.
Security#
The connector includes a number of security-related features, detailed in the following sections.
Authorization#
The connector supports standard Hive security for authorization. For configuration properties, see the ‘Authorization’ section in Hive Connector Security Configuration.
Limitations#
Inserting data can only be done during table creation using CREATE TABLE AS
,
Removing and updating data is not supported. The connector does not
support DDL statements, with the exception of CREATE TABLE
, as described
above, and CREATE SCHEMA
.