Starburst Galaxy

  • Starburst Galaxy Home
  •   Get started
  •   Global features
  • Help center
  • Release notes
  • Feature release types

  • Starburst Galaxy UI
  •   Query
  •   Catalogs
  •   Catalog explorer
  •   Data products
  •   Clusters
  • Partner connect
  •   Admin
  •   Access control
  •   Cloud settings

  • Administration
  •   Security
  •   Single sign-on
  •   Troubleshooting
  • Galaxy status

  • Reference
  •   Python
  • API
  •   SQL
  •   Tutorials
  • Hudi table format #

    Great Lakes connectivity abstracts the details of using different table formats and file types when using object storage catalogs.

    This page describes the features specific to the Hudi table format when used with Great Lakes connectivity.

    Hudi tables are read-only #

    Hudi tables have read-only support. Existing tables of the Hudi format that are detected in a Galaxy-connected object storage location are read automatically.

    Galaxy cannot create new Hudi tables or write to them.

    Metadata tables #

    Great Lakes connectivity exposes several metadata tables for the Hudi table format. These metadata tables contain information about the internal structure of the Hudi table. Query each metadata table by appending the metadata table name to the table_name:

    SELECT * FROM catalog_name.schema_name."table_name$timeline";
    

    $timeline #

    The $timeline table provides a detailed view of metadata instants in the Hudi table. Instants are specific points in time.

    The following table describes the table columns of the $timeline table query output:

    $timeline columns
    Name Type Description
    timestamp VARCHAR Instant time is a timestamp when the actions performed.
    action VARCHAR The type of action made on the table.
    state VARCHAR The current state of the instant.

    Session properties #

    A session property temporarily modifies a configuration property by a user for the duration of the current connection session to the cluster. Use the SET SESSION statement followed by a value such as true or false to modify the property:

    SET SESSION catalog_name.session_property = expression;
    

    Use the SHOW SESSION statement to view all current session properties. For additional information, read about the SET SESSION, and RESET SESSION SQL statements.

    Catalog session properties are connector-defined session properties that can be set on a per-catalog basis. These properties must be set separately for each catalog by including the catalog name before the property name, for example, catalog_name.property_name.

    Session properties are linked to the current session, so a user can have multiple connections to a cluster that each have different values for the same session properties. Once a session ends, either by disconnecting or creating a new session, any changes made to session properties during the previous session are lost.

    The following sections describe the properties supported by the Hudi table type:

    parquet_optimized_reader_enabled #

    SET SESSION catalog_name.parquet_optimized_reader_enabled = true;
    

    Specifies whether batched column readers are used when reading Parquet files for improved performance. Set this property to false to disable the optimized Parquet reader The default value for parquet_optimized_reader_enabled is true.

    parquet_optimized_nested_reader_enabled #

    SET SESSION catalog_name.parquet_optimized_nested_reader_enabled = true;
    

    Specifies whether batched column readers are used when reading ARRAY, MAP, and ROW types from Parquet files for improved performance. Set this property to false to disable the optimized Parquet reader for structural data types. The default value is true.

    Hudi SQL support #

    When using the Hudi table format with Great Lakes connectivity, the general SQL support details apply, with the following additional consideration.