Great Lakes connectivity #

Great Lakes connectivity abstracts the details of using different table formats and file types when using certain write access statements for object storage systems. This connectivity is built into Starburst Galaxy, and is available to all users.

This feature lets you use the following different table formats in the same catalog:

The following catalogs support Great Lakes connectivity:

Any read access or write access to existing tables works transparently, and independent of the used table format. As before, Starburst Galaxy recognizes your table formats by reading the metastore associated with your object storage.

For write operations, such as CREATE TABLE or CREATE TABLE AS statements, you need to specify an object storage type and optional format. Add the type= parameter with iceberg, delta, or hive as its argument. For the Hive and Iceberg types, you can also include a format= parameter with arguments as discussed in following sections.

Iceberg type #

Apache Iceberg is an open table format for huge analytic datasets. Starburst Galaxy allows querying data stored in files written in Iceberg format, as defined in the Iceberg Table Spec.

To create a table in Iceberg format, add type='iceberg' as a table property in your CREATE statement. For example:

CREATE TABLE customer (
    name varchar,
    address varchar)
WITH (type='iceberg');

Specify the format of table data files with an additional format property with value either parquet or orc, defaulting to orc:

...
WITH (type='iceberg',
      format='parquet');

Append further comma-separated table properties as required:

Table property Description
partitioning Optionally specifies table partitioning. If a table is partitioned by columns c1 and c2, this property setting is partitioning=ARRAY['c1', 'c2'].
location Optionally specifies the file system location URI for the table.
format_version Optionally specifies the format version of the Iceberg specification, either 1 or 2. For example format_version=1.

For the Iceberg type, only INSERT and DELETE by partition is supported.

Delta Lake type #

Delta Lake connectivity allows querying data stored in Delta Lake format, including Databricks Delta Lake.

To create a table in Delta Lake format, add type='delta' as a table property in your CREATE statement. There is no format parameter for Delta Lake, but you can append other table properties as required:

Table property Description
partitioned_by Optionally specifies table partitioning. If a table is partitioned by column regionkey, this property is set with partitioned_by=ARRAY['regionkey'].
location Optionally specifies the file system location URI for the table.
checkpoint_interval Optionally specifies the checkpoint interval in seconds.

For example:

CREATE TABLE galaxy_delta.default.my_partitioned_table
WITH (
  type = 'delta',
  location = 's3://my-bucket/at/this/path',
  partitioned_by = ARRAY['regionkey'],
  checkpoint_interval = 5
)
AS
  SELECT name, comment, regionkey
  FROM tpch.tiny.nation;

Hive type #

Hive connectivity allows querying data stored in an Apache Hive data warehouse.

To create a table in Hive format, add type='hive' as a table property in your CREATE statement. For example:

CREATE TABLE customer (
  name varchar,
  address varchar)
WITH (type='hive');

Specify the format of table data files with an additional format property with one of the values shown in the following table:

File format Description
ORC Default value, Apache ORC file format.
PARQUET Apache Parquet file format.
JSON JSON file format using org.apache.hive.hcatalog.data.JsonSerDe
CSV Comma-separated values file format using org.apache.hadoop.hive.serde2.OpenCSVSerde
TEXTFILE
AVRO Apache Avro file format.
RCTEXT RCFile using ColumnarSerDe
RCBINARY RCFile using LazyBinaryColumnarSerDe
SEQUENCEFILE

Specify additional table properties as needed, consulting the list of table properties in the Trino Hive documentation. Notice that many of the available table properties must be used in conjunction with a particular Hive file format.

Some of the example SQL statements on the Hive documentation page must be supplemented with the type='hive' property for use in Starburst Galaxy. For example:

CREATE TABLE hive.web.page_views (
  view_time timestamp,
  user_id bigint,
  page_url varchar,
  ds date,
  country varchar
)
WITH (
  type = 'hive',
  format = 'ORC',
  partitioned_by = ARRAY['ds', 'country'],
  bucketed_by = ARRAY['user_id'],
  bucket_count = 50
)