Starburst Galaxy

  • Starburst Galaxy Home
  • Get started
  • Get support
  •   Global features
  • Release notes
  • Feature release types

  • Starburst Galaxy UI
  •   Query
  •   Catalogs
  •   Catalog explorer
  •   Data products
  •   Clusters
  • Partner connect
  •   Admin
  •   Access control
  •   Cloud settings

  • Administration
  •   Security
  •   Single sign-on
  •   Troubleshooting
  • Galaxy status

  • Reference
  • API
  •   SQL
  •   Tutorials
  • Great Lakes connectivity #

    Great Lakes connectivity abstracts the details of using different table formats and file types when using certain write access statements for object storage systems. This connectivity is built into Starburst Galaxy, and is available to all users.

    The following object storage catalogs support Great Lakes connectivity:

    This feature supports the following table formats:

    Any access to existing tables works transparently for all table formats. Starburst Galaxy recognizes the format of your tables by reading the metastore associated with your object storage.

    SQL support #

    The SQL support with a catalog using Great Lakes connectivity depends partially on the table format in use, and set as default per catalog. The following additional details apply:

    More specifics are available in the following sections about the different table formats, and their SQL support.

    Specify default table format #

    For data consumers who want to read from any object storage file in any supported table format, there are no configuration steps or commands. Starburst Galaxy reads each file’s type and format from the object store’s associated metadata file.

    For data engineers who want to create new tables in an object storage catalog, Starburst Galaxy provides a simple way to specify the default format for all newly created tables. When creating a new catalog, select from one of the options in the Default table format dialog. Your selection applies only to newly created tables, and does not convert any existing tables.

    Select default table format

    You can also use a SQL statement to specify an individual table’s default format when creating that table. With a statement such as CREATE TABLE or CREATE TABLE AS, specify the type parameter with iceberg, delta, hive, or hudi` as its argument. A table created this way can override the catalog’s overall default table format.

    Iceberg table format #

    Apache Iceberg is an open table format for huge analytic datasets. Starburst Galaxy allows querying data stored in files written in the Iceberg format, as defined in the Iceberg Table Spec.

    Iceberg tables are automatically detected and read based on information in the storage environment’s associated metadata.

    To create an Iceberg table, add type='iceberg' as a table property in your CREATE statement. For example:

    CREATE TABLE customer (
        name varchar,
        address varchar)
    WITH (type='iceberg');
    

    Specify the file format of table data files with an additional format property with value either parquet or orc, defaulting to orc:

    ...
    WITH (type='iceberg',
          format='parquet');
    

    Append further comma-separated table properties as required:

    Table property Description
    partitioning Optionally specifies table partitioning. If a table is partitioned by columns c1 and c2, this property setting is partitioning=ARRAY['c1', 'c2'].
    location Optionally specifies the file system location URI for the table.
    format_version Optionally specifies the format version of the Iceberg specification, either 1 or 2. For example format_version=1.

    SQL support #

    When using the Iceberg table format with Great Lakes connectivity, the general SQL support details apply with the following additional considerations:

    • Only INSERT and DELETE by partition is supported.

    Delta Lake table format #

    Delta Lake connectivity allows querying data stored in Delta Lake tables, including Databricks Delta Lake.

    Delta Lake tables are automatically detected and read based on information in the storage environment’s associated metadata.

    To create a Delta Lake table, add type='delta' as a table property in your CREATE statement. There is no format parameter for Delta Lake, but you can append other table properties as required:

    Table property Description
    partitioned_by Optionally specifies table partitioning. If a table is partitioned by column regionkey, this property is set with partitioned_by=ARRAY['regionkey'].
    location Optionally specifies the file system location URI for the table.
    checkpoint_interval Optionally specifies the checkpoint interval in seconds.

    For example:

    CREATE TABLE galaxy_delta.default.my_partitioned_table
    WITH (
      type = 'delta',
      location = 's3://my-bucket/at/this/path',
      partitioned_by = ARRAY['regionkey'],
      checkpoint_interval = 5
    )
    AS
      SELECT name, comment, regionkey
      FROM tpch.tiny.nation;
    

    SQL support #

    When using the Delta Lake table format with Great Lakes connectivity, the general SQL support details apply.

    Hive table format #

    Hive connectivity allows querying data stored in an Apache Hive data warehouse.

    Hive tables are automatically detected and read based on information in the storage environment’s associated metadata.

    To create a Hive table, add type='hive' as a table property in your CREATE statement. For example:

    CREATE TABLE customer (
      name varchar,
      address varchar)
    WITH (type='hive');
    

    Specify the format of table data files with an additional format property with one of the values shown in the following table:

    File format Description
    ORC Default value, Apache ORC file format.
    PARQUET Apache Parquet file format.
    JSON JSON file format using org.apache.hive.hcatalog.data.JsonSerDe
    CSV Comma-separated values file format using org.apache.hadoop.hive.serde2.OpenCSVSerde
    TEXTFILE
    AVRO Apache Avro file format.
    RCTEXT RCFile using ColumnarSerDe
    RCBINARY RCFile using LazyBinaryColumnarSerDe
    SEQUENCEFILE

    Specify additional table properties as needed, consulting the list of table properties in the Trino Hive documentation. Notice that many of the available table properties must be used in conjunction with a particular Hive file format.

    Some of the example SQL statements on the Hive documentation page must be supplemented with the type='hive' property for use in Starburst Galaxy. For example:

    CREATE TABLE hive.web.page_views (
      view_time timestamp,
      user_id bigint,
      page_url varchar,
      ds date,
      country varchar
    )
    WITH (
      type = 'hive',
      format = 'ORC',
      partitioned_by = ARRAY['ds', 'country'],
      bucketed_by = ARRAY['user_id'],
      bucket_count = 50
    )
    

    SQL support #

    When using the Hive table format with Great Lakes connectivity, the general SQL support details apply with the following additional considerations:

    • Hive views are not supported.

    Hudi table format #

    Hudi tables have read only support. Existing tables of the Hudi format that are detected in a Galaxy-connected object storage location are read automatically.

    Galaxy cannot create new Hudi tables or write to them.

    SQL support #

    When using the Hudi table format with Great Lakes connectivity, the general SQL support details apply with the following additional considerations: