Great Lakes connectivity #
Great Lakes connectivity abstracts the details of using different table formats and file types when using certain write access statements for object storage systems. This connectivity is built into Starburst Galaxy, and is available to all users.
The following object storage catalogs support Great Lakes connectivity:
This feature supports the following table formats:
Any access to existing tables works transparently for all table formats. Starburst Galaxy recognizes the format of your tables by reading the metastore associated with your object storage.
SQL support #
The SQL support with a catalog using Great Lakes connectivity depends partially on the table format in use, and set as default per catalog. The following additional details apply:
- Globally available statements are fully supported.
- Read operations are are fully supported.
- Data management support varies across table formats.
- Schema and table management support varies across table formats.
- Catalogs can use different table formats for different tables. This results in different SQL support and different metadata and storage options for different tables.
- View management is fully supported.
- Materialized view management automatically uses the Iceberg table format. This behavior is independent of the default table format of the catalog.
More specifics are available in the following sections about the different table formats, and their SQL support.
Specify default table format #
For data consumers who want to read from any object storage file in any supported table format, there are no configuration steps or commands. Starburst Galaxy reads each file’s type and format from the object store’s associated metadata file.
For data engineers who want to create new tables in an object storage catalog, Starburst Galaxy provides a simple way to specify the default format for all newly created tables. When creating a new catalog, select from one of the options in the Default table format dialog. Your selection applies only to newly created tables, and does not convert any existing tables.
You can also use a SQL statement to specify an individual table’s default format
when creating that table. With a statement such as CREATE TABLE
or CREATE
TABLE AS
, specify the type
parameter with iceberg
, delta
, hive, or
hudi` as its argument. A table created this way can override the catalog’s
overall default table format.
Iceberg table format #
Apache Iceberg is an open table format for huge analytic datasets. Starburst Galaxy allows querying data stored in files written in the Iceberg format, as defined in the Iceberg Table Spec.
Iceberg tables are automatically detected and read based on information in the storage environment’s associated metadata.
To create an Iceberg table, add type='iceberg'
as a table property in your
CREATE statement. For example:
CREATE TABLE customer (
name varchar,
address varchar)
WITH (type='iceberg');
Specify the file format of table data files with an additional format
property
with value either parquet
or orc
, defaulting to orc
:
...
WITH (type='iceberg',
format='parquet');
Append further comma-separated table properties as required:
Table property | Description |
---|---|
partitioning |
Optionally specifies table partitioning. If a table is partitioned by
columns c1 and c2 , this property setting is
partitioning=ARRAY['c1', 'c2'] . |
location |
Optionally specifies the file system location URI for the table. |
format_version |
Optionally specifies the format version of the Iceberg specification,
either 1 or 2 . For example
format_version=1 . |
SQL support #
When using the Iceberg table format with Great Lakes connectivity, the general SQL support details apply with the following additional considerations:
- Only
INSERT
andDELETE
by partition is supported.
Delta Lake table format #
Delta Lake connectivity allows querying data stored in Delta Lake tables, including Databricks Delta Lake.
Delta Lake tables are automatically detected and read based on information in the storage environment’s associated metadata.
To create a Delta Lake table, add type='delta'
as a table property in your
CREATE
statement. There is no format
parameter for Delta Lake, but you can
append other table properties as required:
Table property | Description |
---|---|
partitioned_by |
Optionally specifies table partitioning. If a table is partitioned by
column regionkey , this property is set with
partitioned_by=ARRAY['regionkey'] . |
location |
Optionally specifies the file system location URI for the table. |
checkpoint_interval |
Optionally specifies the checkpoint interval in seconds. |
For example:
CREATE TABLE galaxy_delta.default.my_partitioned_table
WITH (
type = 'delta',
location = 's3://my-bucket/at/this/path',
partitioned_by = ARRAY['regionkey'],
checkpoint_interval = 5
)
AS
SELECT name, comment, regionkey
FROM tpch.tiny.nation;
SQL support #
When using the Delta Lake table format with Great Lakes connectivity, the general SQL support details apply.
Hive table format #
Hive connectivity allows querying data stored in an Apache Hive data warehouse.
Hive tables are automatically detected and read based on information in the storage environment’s associated metadata.
To create a Hive table, add type='hive'
as a table property in your CREATE
statement. For example:
CREATE TABLE customer (
name varchar,
address varchar)
WITH (type='hive');
Specify the format of table data files with an additional format
property
with one of the values shown in the following table:
File format | Description |
---|---|
ORC |
Default value, Apache ORC file format. |
PARQUET |
Apache Parquet file format. |
JSON |
JSON file format using
org.apache.hive.hcatalog.data.JsonSerDe |
CSV |
Comma-separated values file format using
org.apache.hadoop.hive.serde2.OpenCSVSerde |
TEXTFILE |
|
AVRO |
Apache Avro file format. |
RCTEXT |
RCFile using ColumnarSerDe |
RCBINARY |
RCFile using LazyBinaryColumnarSerDe |
SEQUENCEFILE |
Specify additional table properties as needed, consulting the list of table properties in the Trino Hive documentation. Notice that many of the available table properties must be used in conjunction with a particular Hive file format.
Some of the example SQL statements on the Hive documentation page must be
supplemented with the type='hive'
property for use in Starburst Galaxy. For
example:
CREATE TABLE hive.web.page_views (
view_time timestamp,
user_id bigint,
page_url varchar,
ds date,
country varchar
)
WITH (
type = 'hive',
format = 'ORC',
partitioned_by = ARRAY['ds', 'country'],
bucketed_by = ARRAY['user_id'],
bucket_count = 50
)
SQL support #
When using the Hive table format with Great Lakes connectivity, the general SQL support details apply with the following additional considerations:
- Hive views are not supported.
Hudi table format #
Hudi tables have read only support. Existing tables of the Hudi format that are detected in a Galaxy-connected object storage location are read automatically.
Galaxy cannot create new Hudi tables or write to them.
SQL support #
When using the Hudi table format with Great Lakes connectivity, the general SQL support details apply with the following additional considerations:
- Write operations including data management and schema and table management are not supported.
Is the information on this page helpful?
Yes
No
Is the information on this page helpful?
Yes
No