Starburst Galaxy

  •  Get started

  •  Working with data

  •  Data engineering

  •  Developer tools

  •  Cluster administration

  •  Security and compliance

  •  Troubleshooting

  • Galaxy status

  •  Reference

  • Hive table format #

    Great Lakes connectivity abstracts the details of using different table formats and file types when using object storage catalogs.

    This page describes the features specific to the Hive table format when used with Great Lakes connectivity.

    Specify Hive format #

    Hive connectivity allows querying data stored in an Apache Hive data warehouse. Hive tables are automatically detected and read based on information in the storage environment’s associated metadata.

    For an object storage catalog that specifies Hive as its default table format, no special syntax is required. A simple CREATE TABLE statement creates a Hive format table.

    To create a Hive table when the default format is not Hive, add type='hive' as a table property in your CREATE statement. For example:

    CREATE TABLE customer (
      name varchar,
      address varchar)
    WITH (type='hive');
    

    Hive table format properties #

    As part of a CREATE TABLE statement, specify the format of table data files with an additional format property with one of the values shown in the following:

    File format Description
    ORC Default value, Apache ORC file format.
    PARQUET Apache Parquet file format.
    JSON JSON file format using org.apache.hive.hcatalog.data.JsonSerDe
    CSV Comma-separated values file format using org.apache.hadoop.hive.serde2.OpenCSVSerde
    TEXTFILE Plain text file format.
    AVRO Apache Avro file format.
    RCTEXT RCFile using ColumnarSerDe
    RCBINARY RCFile using LazyBinaryColumnarSerDe
    SEQUENCEFILE Flat file format using binary key/value pairs.

    Hive table properties #

    As part of a CREATE TABLE statement, append further comma-separated table properties as needed. Note that many of the available table properties must be used in conjunction with a particular Hive file format.

    Table property Description
    auto_purge Indicates to the configured metastore to perform a purge when a table or partition is deleted, instead of a soft deletion using the trash.
    avro_schema_url The URI of the Avro schema evolution for the table.
    bucket_count The number of buckets to group data into. Only valid if used with bucketed_by. Defaults to 0.
    bucketed_by The bucketing column for the storage table. Only valid if used with bucket_count. Defaults to [].
    bucketing_version Specifies which Hive bucketing version to use. Valid values are 1 or 2.
    csv_escape The CSV escape character. Requires CSV format.
    csv_quote The CSV quote character. Requires CSV format.
    csv_separator The CSV separator character. Requires CSV format. You can use other separators such as | or use Unicode to configure invisible separators such as tabs with U&'\0009'.
    external_location The URI of an external Hive table on S3, Azure Blob Storage, etc. See the basic usage examples for more information.
    format The table file format. Valid values are ORC, PARQUET, AVRO, RCBINARY, RCTEXT, SEQUENCEFILE, JSON, TEXTFILE, CSV, and REGEX.
    null_format The serialization format for NULL value. Requires TextFile, RCText, or SequenceFile format.
    orc_bloom_filter _columns Comma-separated list of columns to use for ORC bloom filter. It improves the performance of queries using range predicates when reading ORC files. Requires ORC format. Defaults to [].
    orc_bloom_filter_fpp The ORC bloom filters false positive probability. Requires ORC format. Defaults to 0.05.
    partitioned_by The partitioning column for the storage table. The columns listed in the partitioned_by clause must be the last columns as defined in the DDL. Defaults to [].
    partition_projection_ignore Ignore any partition projection properties stored in the metastore for the selected table. This bypasses compatibility issues on a specific table.
    partition_projection_type Defines the type of partition projection to use on a specific column. May be used only on partition columns. Available types include: ENUM, INTEGER, DATE.
    skip_footer_line_count The number of footer lines to ignore when parsing the file for data. Requires TextFile or CSV format tables.
    skip_header_line_count The number of header lines to ignore when parsing the file for data. Requires TextFile or CSV format tables.
    sorted_by The column to sort by to determine bucketing for row. Only valid if bucketed_by and bucket_count are specified as well. Defaults to [].
    textfile_field_separator Allows the use of custom field separators, such as |, for TextFile formatted tables.
    textfile_field _separator_escape Allows the use of a custom escape character for TextFile formatted tables.
    transactional Set this property to true to create an ORC ACID transactional table. Requires ORC format. This property may be shown as true for insert-only tables created using older versions of Hive.
    partition_projection _enabled Enables partition projection for selected table. Mapped from AWS Athena table property projection.enabled.
    partition_projection _ignore Ignore any partition projection properties stored in the metastore for the selected table. This is a Galaxy-only property which allows you to work around compatibility issues on a specific table, and if enabled, Galaxy ignores all other configuration options related to partition projection.
    partition_projection _location_template Projected partition location template, such as s3://test/name=${name}/. Mapped from the AWS Athena table property storage.location.template. Defaults to ${table_location}/${partition_name}.
    extra_properties Additional properties added to a Hive table. The properties are not used by Galaxy, and are available in the $properties metadata table. The properties are not included in the output of SHOW CREATE TABLE statements.

    Some SQL statements must be supplemented with the type='hive' property for use in Starburst Galaxy. For example:

    CREATE TABLE hive.web.page_views (
      view_time timestamp,
      user_id bigint,
      page_url varchar,
      ds date,
      country varchar
    )
    WITH (
      type = 'hive',
      format = 'ORC',
      partitioned_by = ARRAY['ds', 'country'],
      bucketed_by = ARRAY['user_id'],
      bucket_count = 50
    )
    

    Hive metadata table #

    The $properties metadata table provides general configuration information about a Hive table. Query the metadata table by appending $properties to a table name:

    SELECT * FROM "hive.web.page_views$properties";
    

    Table procedures #

    Use the CALL statement to perform data manipulation or administrative tasks. Procedures must include a qualified catalog name. For example, for a catalog named examplecatalog:

    CALL examplecatalog.system.example_procedure();
    

    The following Hive table procedures are available:

    create_empty_partition #

    system.create_empty_partition('schema_name', 'table_name', partition_columns, partition_values);
    

    Create an empty partition in the specified table.

    The partition arguments can have a different type of varchar such as an integer or date.

    The value must be specified as varchar in the partition_values and partition_columns array. For example, partition_values => ARRAY['2016-08-09', 'US']

    drop_stats #

    system.drop_stats('schema_name', 'table_name', partition_values);
    

    Drops statistics for one or more partitions or the entire table. Specify the partitions as an array whose elements are arrays of partition_values. The following example shows partition_values for dropping stats for multiple partitions:

    partition_values => ARRAY[ARRAY['2016-08-09', 'US'], ARRAY['2023-08-03', 'UK']]
    

    The partition_values argument is optional. If the partition_values argument is omitted, stats are dropped for the entire table.

    register_partition #

    system.register_partition('schema_name', 'table_name', partition_columns,
        partition_values, location);
    

    Registers the existing location as a new partition in the metastore for the specified table.

    With the location argument omitted, the partition location is constructed using partition_columns and partition_values.

    sync_partition_metadata #

    system.sync_partition_metadata('schema_name', 'table_name', 'mode', case_sensitive);
    

    Check and update partitions list in the metastore. There are three modes available:

    • ADD: add any partitions that exist on the file system, but not in the metastore.
    • DROP: drop any partitions that exist in the metastore, but not on the file system.
    • FULL: perform both ADD and DROP.

    The case_sensitive argument is optional. The default value is true for compatibility with Hive’s MSCK REPAIR TABLE behavior, which expects the partition column names in file system paths to use lowercase (for example, col_x=SomeValue). Partitions on the file system not conforming to this convention are ignored, unless the argument is set to false.

    unregister_partition #

    system.unregister_partition('schema_name', 'table_name', partition_columns,
        partition_values);
    

    Unregisters given, existing partition in the metastore for the specified table. The partition data is not deleted.

    Migrate to Iceberg #

    Great Lakes connectivity supports migration from Hive to Iceberg table format. It can read and write to Hive tables that have been successfully migrated to Iceberg. The migration process creates metadata without copying any data. In addition, table statistics are not generated.

    The migration process from Hive to Iceberg table format only supports Parquet, ORC, and AVRO file formats.

    Use the following statement ALTER TABLE SET PROPERTIES to convert a table from the Hive table format to the Iceberg table format:

    ALTER TABLE catalog_name.schema_name.table_name SET PROPERTIES type = 'ICEBERG';
    

    The statement must be called for a specific catalog example with the relevant schema and table names supplied with the required parameters schema_name and table_name.

    Once the migration process is complete, use the SHOW CREATE TABLE statement to confirm that the table type has changed:

    SHOW CREATE TABLE example_catalog.schema_name.table_name;
    

    Successful migration displays type='ICEBERG'.

    Partition projection #

    Partition projection is a feature used to speed up query processing. This feature computes partition values and locations statically from the partition projection definition of the table properties and does not rely on the metastore for retrieving them.

    Great Lakes connectivity supports partition projection table properties stored in the metastore, and it reimplements this functionality. Currently, date projection only supports intervals of DAYS, HOURS, MINUTES, and SECONDS.

    If there are any compatibility issues blocking access to a requested table when partition projection is enabled, set the partition_projection_ignore table property to true for a table to bypass any errors.

    In addition, Great Lakes connectivity supports the partition_projection_type table property.

    The partition_projection_type defines the type of partition projection to use on a specific column. It can be set as either enum or integer and date:

    Name Description
    enum When there are a predefined set of values for the partitions to be used such as, country or region names.
    integer Specifies a range, providing a more concise way to predefine the partition values. For example,[1, 2, 3, 4, ..., 1000] or [0500, 0550, 0600, ..., 2500].
    date Specifies a range, providing a more concise way to predefine the partition values. For example, [20200101, 20200102, ..., 20201231] or [1-1-2020 00:00:00, 1-1-2020 01:00:00, ..., 12-31-2020 23:00:00]

    Read more about Hive table properties.

    Using partition projection #

    To use partition projection, specify the ranges of partition values and projection types for each partition column in the table properties in the AWS Glue Data Catalog or in an external Hive metastore.

    Create a projection with the enum projection type:

    CREATE SCHEMA catalog_name.schema_name;
    use catalog_name.test_schema;
    
    CREATE TABLE enum_table_name (
        name varchar,
        comment varchar,
        nationkey integer,
        regionkey integer,
        short_name varchar WITH (
            partition_projection_type = 'enum',
            partition_projection_values = array ['POL', 'CZE']
        )
    )
    WITH (
        partitioned_by = array['short_name'],
        partition_projection_enabled = true
    );
    
    SHOW CREATE TABLE enum_table_name;
    
    INSERT INTO enum_table_name VALUES ('POLAND_1', 'Czesc', 0, 5, 'POL');
    INSERT INTO enum_table_name VALUES ('POLAND_2', 'Czesc', 1, 5, 'POL');
    INSERT INTO enum_table_name VALUES ('CZECH_1', 'Ahoj', 2, 5, 'CZE');
    INSERT INTO enum_table_name VALUES ('CZECH_2', 'Ahoj', 3, 5, 'CZE');
    

    The INSERT INTO statements adds the data corresponding to the POL and CZE partitions.

    SELECT * FROM "enum_table_name$partitions";
    

    The SELECT only displays the partitions POL and CZE corresponding to the partition projection defined while creating the table, not the metastore.

    | short_name |
    +------------+
    | POL        |
    +------------+
    | CZE        |
    +------------+
    

    Create a projection with the integer projection range:

    CREATE TABLE integer_table_name (
        name varchar,
        comment varchar,
        nationkey integer,
        regionkey integer WITH (
            partition_projection_type = 'integer',
            partition_projection_range = array ['0', '10']
        )
    )
    WITH (
        partitioned_by = array['regionkey'],
        partition_projection_enabled = true
    );
    
    INSERT INTO integer_table_name VALUES ('POLAND_1', 'Czesc', 0, 5);
    INSERT INTO integer_table_name VALUES ('POLAND_2', 'Czesc', 1, 5);
    INSERT INTO integer_table_name VALUES ('CZECH_1', 'Ahoj', 2, 5);
    INSERT INTO integer_table_name VALUES ('CZECH_2', 'Ahoj', 3, 5);
    INSERT INTO integer_table_name VALUES ('GERMANY_1', 'Hallo', 12, 8);
    INSERT INTO integer_table_name VALUES ('GERMANY_2', 'Hallo', 13, 8);
    INSERT INTO integer_table_name VALUES ('MEXICO_1', 'Hola', 22, 18);
    

    The INSERT INTO statements adds the data corresponding to the defined partitions.

    SELECT * FROM "integer_table_name$partitions";
    

    The SELECT * FROM statement returns only the partitions corresponding to the partition projection defined while creating the table, not the metastore.

    | regionkey |
    +-----------+
    | 0         |
    +-----------+
    | 1         |
    +-----------+
    | 2         |
    +-----------+
    | 3         |
    +-----------+
    | 4         |
    +-----------+
    | 5         |
    +-----------+
    | 6         |
    +-----------+
    | 7         |
    +-----------+
    | 8         |
    +-----------+
    | 9         |
    +-----------+
    | 10        |
    +-----------+
    

    Create a projection with the integer projection range using partition_projection_interval:

    CREATE TABLE interval_table_name (
        name varchar,
        comment varchar,
        nationkey integer,
        regionkey integer WITH (
            partition_projection_type = 'integer',
            partition_projection_range = array ['0', '10'],
            partition_projection_interval = 3
        )
    )
    WITH (
        partitioned_by = array['regionkey'],
        partition_projection_enabled = true
    );
    
    INSERT INTO interval_table_name VALUES ('POLAND_1', 'Czesc', 0, 5);
    INSERT INTO interval_table_name VALUES ('POLAND_2', 'Czesc', 1, 5);
    INSERT INTO interval_table_name VALUES ('CZECH_1', 'Ahoj', 2, 5);
    INSERT INTO interval_table_name VALUES ('CZECH_2', 'Ahoj', 3, 5);
    INSERT INTO interval_table_name VALUES ('GERMANY_1', 'Hallo', 12, 6);
    INSERT INTO interval_table_name VALUES ('GERMANY_2', 'Hallo', 13, 9);
    INSERT INTO interval_table_name VALUES ('MEXICO_1', 'Hola', 22, 18);
    

    The INSERT INTO statements adds the data corresponding to the defined partitions.

    SELECT * FROM "interval_table_name$partitions";
    

    The SELECT * FROM statement returns only the partitions corresponding to the partition projection defined while creating the table, not the metastore.

    In the following example output, only 0, 3, 6, and 9 are displayed because of the chosen partition_projection_interval value of 3.

    | regionkey |
    +-----------+
    | 0         |
    +-----------+
    | 3         |
    +-----------+
    | 6         |
    +-----------+
    | 9         |
    +-----------+
    

    Create a projection with the date projection range:

    CREATE TABLE date_table_name (
        name varchar,
        comment varchar,
        nationkey integer,
        dt varchar WITH (
            partition_projection_type = 'date',
            partition_projection_format = 'yyyy-MM-dd',
            partition_projection_range = array['NOW-5DAYS', 'NOW']
        )
    )
    WITH (
        partitioned_by = array['dt'],
        partition_projection_enabled = true
    );
    
    INSERT INTO date_table_name VALUES ('POLAND_1', 'Czesc', 0, '2023-06-01');
    INSERT INTO date_table_name VALUES ('POLAND_2', 'Czesc', 1, '2023-06-01');
    INSERT INTO date_table_name VALUES ('CZECH_1', 'Ahoj', 2, '2023-06-29');
    INSERT INTO date_table_name VALUES ('CZECH_2', 'Ahoj', 3, '2023-06-01');
    INSERT INTO date_table_name VALUES ('GERMANY_1', 'Hallo', 12, '2023-06-01');
    INSERT INTO date_table_name VALUES ('GERMANY_2', 'Hallo', 13, '2023-07-02');
    INSERT INTO date_table_name VALUES ('MEXICO_1', 'Hola', 22, '2023-07-01');
    

    The INSERT INTO statements adds the data corresponding to the defined partitions.

    SELECT * FROM "date_table_name$partitions";
    

    The SELECT * FROM statement returns only the partitions corresponding to the partition projection defined while creating the table, not the metastore.

    The following SELECT * FROM statement example returns the partition, dt:

    | dt         |
    +------------+
    | 2023-08-26 |
    +------------+
    | 2023-08-27 |
    +------------+
    | 2023-08-28 |
    +------------+
    | 2023-08-29 |
    +------------+
    | 2023-08-30 |
    +------------+
    | 2023-08-31 |
    +------------+
    

    Session properties #

    See session properties on the Great Lakes connectivity page to understand how these properties are used.

    The following table describes the session properties supported by the Hive table type.

    Session property Description
    avro_native_reader_enabled Use the native Avro file reader. The default value is false.
    avro_native_writer_enabled Use the native Avro file writer. The default value is false.
    bucket_execution_enabled Enable bucket-aware execution: only use a single worker per bucket. The default value is true.
    collect_column_statistics_on_write Enables automatic column level statistics collection on write. The default value is true.
    create_empty_bucket_files Create empty files for buckets that have no data. The default value is false.
    csv_native_reader_enabled Use the native CSV reader. The default value is false.
    csv_native_writer_enabled Use the native CSV writer. The default value is false.
    force_local_scheduling Only schedule splits on workers colocated with the data node. The default value isfalse.
    hive_storage_format Default storage format for new tables or partitions. The possible values are:
    • ORC
    • PARQUET
    • AVRO
    • RCBINARY
    • RCTEXT
    • SEQUENCEFILE
    • JSON
    • OPENX_JSON
    • TEXTFILE
    • CSV
    • REGEX
    The default value is ORC.
    hive_views_legacy_translation Use hive_views_legacy_translation for catalog specific use to translate Hive views. Allow read access to the data. Hive views are defined in HiveQL and stored in the Hive Metastore Service. They are analyzed to allow read access to the data. Not available when using the Glue or Starburst metastore options. The default value is false.
    ignore_absent_partitions Ignore partitions when the file system location does not exist rather than failing the query. The default value is false .
    ignore_corrupted_statistics Ignore the corrupted statistics rather than failing the query. The default value is false.
    insert_existing_partitions_behavior Behavior on insert existing partitions; this session property does not control behavior on insert existing unpartitioned table. The default value is APPEND.
    json_native_reader_enabled Use the native JSON reader. The default value is false.
    minimum_assigned_split_weight The minimum assigned split weight when size based split weighting is enabled. The default value is 0.05.
    json_native_writer_enabled Use the native JSON writer. The default value is false.
    openx_json_native_reader_enabled Use the native OpenX JSON reader. The default value is true.
    openx_json_native_writer_enabled Use the native OpenX JSON writer. The default value is true.
    optimize_mismatched_bucket_count Enable optimization to avoid shuffle when bucket count is compatible but not the same. The default value is false.
    optimize_symlink_listing Optimize the listing for SymlinkTextFormat tables with files in a single directory. The default value is true.
    orc_optimized_writer_max_dictionary_memory The maximum dictionary memory. The default value is 16MB.
    orc_optimized_writer_max_stripe_rows The maximum stripe row count. The default value is 10000000.
    orc_optimized_writer_max_stripe_size The maximum stripe size. The default value is 64MB.
    orc_optimized_writer_min_stripe_size The minimum stripe size. The default value is 32MB.
    orc_optimized_writer_validate Force validation for all files. The default value is false.
    orc_optimized_writer_validate_mode Level of detail in ORC validation. Possible values include HASHED, DETAILED, and BOTH. The default value is BOTH.
    orc_optimized_writer_validate_percentage The sample percentage for validation of files. The default value is 0.
    orc_use_column_names Access ORC columns by name. By default, columns in ORC files are accessed by their ordinal position in the Hive table definition. The default value is false.
    parallel_partitioned_bucketed_writes Improve parallelism of partitioned and bucketed table writes. The default value is true.
    parquet_ignore_statistics Ignore statistics from Parquet to allow querying files with corrupted or incorrect statistics. The default value is false.
    parquet_optimized_writer_validation_percentage TPercentage of Parquet files to validate after write by rereading the whole file. Validation is turned off by setting the value to 0. The default value is 5.
    parquet_use_column_index Use the Parquet column index. The default value is true.
    parquet_use_column_names Access Parquet columns by name by default. Set this property to false to access columns by their ordinal position in the Hive table definition. The default value is true.
    partition_statistics_sample_size The maximum sample size of the partitions column statistics. The default value is 100.
    propagate_table_scan_sorting_properties Use sorted table layout to generate more efficient execution plans. Note, this may lead to incorrect results if files are not sorted as per table definition. The default value is false.
    query_partition_filter_required Set to true to temporarily force a query to use a partition filter.
    query_partition_filter_required_schemas The list of schemas for which filter on partition column is enforced. The default value is array[''].
    rcfile_optimized_writer_validate Validate RCF writer files. The default value is false.
    regex_native_reader_enabled Use the native REGEX reader. The default value is false.
    respect_table_format Write new partitions using table format rather than the default storage format. The default value true.
    s3_select_pushdown_enabled Enable S3 Select pushdown. The default value is false.
    sequence_file_native_reader_enabled Use the native sequence file reader. The default value is false.
    sequence_file_native_writer_enabled Validate RCF writer files. The default value is false.
    size_based_split_weights_enabled Enable estimating split weights based on size in bytes. The default value is true.
    text_file_native_reader_enabled Use the native text file reader. The default value is false.
    text_file__writer_enabled Use the native text file writer. The default value is false.
    timestamp_precision Specifies the precision for the timestamp columns in Hive tables. The possible values are:
    • MILLISECONDS
    • MICROSECONDS
    • NANOSECONDS
    The default value is MILLISECONDS.
    validate_bucketing Verify that data is bucketed correctly when reading. The default value is true.

    Hive SQL support #

    When using the Hive table format with Great Lakes connectivity, the general SQL support details apply, with the following additional consideration:

    • Hive views are not supported.