Starburst Galaxy

  • Starburst Galaxy Home
  •   Get started
  •   Global features
  • Help center
  • Release notes
  • Feature release types

  • Starburst Galaxy UI
  •   Query
  •   Catalogs
  •   Catalog explorer
  •   Data products
  •   Clusters
  • Partner connect
  •   Admin
  •   Access control
  •   Cloud settings

  • Administration
  •   Security
  •   Single sign-on
  •   Troubleshooting
  • Galaxy status

  • Reference
  •   Python
  • API
  •   SQL
  •   Tutorials
  • Hive table format #

    Great Lakes connectivity abstracts the details of using different table formats and file types when using object storage catalogs.

    This page describes the features specific to the Hive table format when used with Great Lakes connectivity.

    Specify Hive format #

    Hive connectivity allows querying data stored in an Apache Hive data warehouse. Hive tables are automatically detected and read based on information in the storage environment’s associated metadata.

    For an object storage catalog that specifies Hive as its default table format, no special syntax is required. A simple CREATE TABLE statement creates a Hive format table.

    To create a Hive table when the default format is not Hive, add type='hive' as a table property in your CREATE statement. For example:

    CREATE TABLE customer (
      name varchar,
      address varchar)
    WITH (type='hive');
    

    Hive table format properties #

    As part of a CREATE TABLE statement, specify the format of table data files with an additional format property with one of the values shown in the following:

    File format Description
    ORC Default value, Apache ORC file format.
    PARQUET Apache Parquet file format.
    JSON JSON file format using org.apache.hive.hcatalog.data.JsonSerDe
    CSV Comma-separated values file format using org.apache.hadoop.hive.serde2.OpenCSVSerde
    TEXTFILE
    AVRO Apache Avro file format.
    RCTEXT RCFile using ColumnarSerDe
    RCBINARY RCFile using LazyBinaryColumnarSerDe
    SEQUENCEFILE

    Hive table properties #

    As part of a CREATE TABLE statement, append further comma-separated table properties as needed. Note that many of the available table properties must be used in conjunction with a particular Hive file format.

    Table property Description
    auto_purge Indicates to the configured metastore to perform a purge when a table or partition is deleted, instead of a soft deletion using the trash.
    avro_schema_url The URI of the Avro schema evolution for the table.
    bucket_count The number of buckets to group data into. Only valid if used with bucketed_by. Defaults to 0.
    bucketed_by The bucketing column for the storage table. Only valid if used with bucket_count. Defaults to [].
    bucketing_version Specifies which Hive bucketing version to use. Valid values are 1 or 2.
    csv_escape The CSV escape character. Requires CSV format.
    csv_quote The CSV quote character. Requires CSV format.
    csv_separator The CSV separator character. Requires CSV format. You can use other separators such as | or use Unicode to configure invisible separators such as tabs with U&'\0009'.
    external_location The URI of an external Hive table on S3, Azure Blob Storage, etc. See the basic usage examples for more information.
    format The table file format. Valid values are ORC, PARQUET, AVRO, RCBINARY, RCTEXT, SEQUENCEFILE, JSON, TEXTFILE, CSV, and REGEX.
    null_format The serialization format for NULL value. Requires TextFile, RCText, or SequenceFile format.
    orc_bloom_filter _columns Comma-separated list of columns to use for ORC bloom filter. It improves the performance of queries using range predicates when reading ORC files. Requires ORC format. Defaults to [].
    orc_bloom_filter_fpp The ORC bloom filters false positive probability. Requires ORC format. Defaults to 0.05.
    partitioned_by The partitioning column for the storage table. The columns listed in the partitioned_by clause must be the last columns as defined in the DDL. Defaults to [].
    partition_projection_ignore Ignore any partition projection properties stored in the metastore for the selected table. This bypasses compatibility issues on a specific table.
    partition_projection_type Defines the type of partition projection to use on a specific column. May be used only on partition columns. Available types include: ENUM, INTEGER, DATE.
    skip_footer_line_count The number of footer lines to ignore when parsing the file for data. Requires TextFile or CSV format tables.
    skip_header_line_count The number of header lines to ignore when parsing the file for data. Requires TextFile or CSV format tables.
    sorted_by The column to sort by to determine bucketing for row. Only valid if bucketed_by and bucket_count are specified as well. Defaults to [].
    textfile_field_separator Allows the use of custom field separators, such as |, for TextFile formatted tables.
    textfile_field _separator_escape Allows the use of a custom escape character for TextFile formatted tables.
    transactional Set this property to true to create an ORC ACID transactional table. Requires ORC format. This property may be shown as true for insert-only tables created using older versions of Hive.
    partition_projection _enabled Enables partition projection for selected table. Mapped from AWS Athena table property projection.enabled.
    partition_projection _ignore Ignore any partition projection properties stored in the metastore for the selected table. This is a Galaxy-only property which allows you to work around compatibility issues on a specific table, and if enabled, Galaxy ignores all other configuration options related to partition projection.
    partition_projection _location_template Projected partition location template, such as s3://test/name=${name}/. Mapped from the AWS Athena table property storage.location.template. Defaults to ${table_location}/${partition_name}.
    extra_properties Additional properties added to a Hive table. The properties are not used by Galaxy, and are available in the $properties metadata table. The properties are not included in the output of SHOW CREATE TABLE statements.

    Some SQL statements must be supplemented with the type='hive' property for use in Starburst Galaxy. For example:

    CREATE TABLE hive.web.page_views (
      view_time timestamp,
      user_id bigint,
      page_url varchar,
      ds date,
      country varchar
    )
    WITH (
      type = 'hive',
      format = 'ORC',
      partitioned_by = ARRAY['ds', 'country'],
      bucketed_by = ARRAY['user_id'],
      bucket_count = 50
    )
    

    Metadata tables #

    Great Lakes connectivity exposes several metadata tables for the Hive table format. These metadata tables contain information about the internal structure of the Hive table. Query each metadata table by appending the metadata table name to the table_name

    $properties #

    The $properties table exposes the parameters of the table in the metastore.

    The raw Hive table properties are available as a hidden table, containing a separate column per table property, with a single row containing the property values.

    Inspect the property names and values with the following query:

    SELECT * FROM catalog_name.schema_name."table_name$properties";
    
    | auto.purge |    numfiles |      presto_query_id       | presto_version | totalsize |transactional |
    -------------+-------------+----------------------------+----------------+-----------+--------------+
    | false      |      -1     |20230705_152456_00001_nfugi | 423            |     -1    |    false     |
    

    $partitions #

    The $partitions table provides a list of all partition values of a partitioned table.

    The following example returns all partition values from the table_name table:

    SELECT * FROM catalog_name.schema_name."table_name$partitions";
    
         day    | country
    ------------+---------
     2023-07-01 | POL
     2023-07-02 | POL
     2023-07-03 | POL
     2023-03-01 | USA
     2023-03-02 | USA
    

    Table procedures #

    Use the CALL statement to perform data manipulation or administrative tasks. Procedures must include a qualified catalog name. For example, for a catalog named examplecatalog:

    CALL examplecatalog.system.example_procedure();
    

    The following Hive table procedures are available:

    create_empty_partition #

    system.create_empty_partition('schema_name', 'table_name', partition_columns, partition_values);
    

    Create an empty partition in the specified table.

    The partition arguments can have a different type of varchar such as an integer or date.

    The value must be specified as varchar in the partition_values and partition_columns array. For example, partition_values => ARRAY['2016-08-09', 'US']

    drop_stats #

    system.drop_stats('schema_name', 'table_name', partition_values);
    

    Drops statistics for one or more partitions or the entire table. Specify the partitions as an array whose elements are arrays of partition_values. The following example shows partition_values for dropping stats for multiple partitions:

    partition_values => ARRAY[ARRAY['2016-08-09', 'US'], ARRAY['2023-08-03', 'UK']]
    

    The partition_values argument is optional. If the partition_values argument is omitted, stats are dropped for the entire table.

    register_partition #

    system.register_partition('schema_name', 'table_name', partition_columns, partition_values, location);
    

    Registers the existing location as a new partition in the metastore for the specified table.

    With the location argument omitted, the partition location is constructed using partition_columns and partition_values.

    sync_partition_metadata #

    system.sync_partition_metadata('schema_name', 'table_name', 'mode', case_sensitive);
    

    Check and update partitions list in the metastore. There are three modes available:

    • ADD: add any partitions that exist on the file system, but not in the metastore.
    • DROP: drop any partitions that exist in the metastore, but not on the file system.
    • FULL: perform both ADD and DROP.

    The case_sensitive argument is optional. The default value is true for compatibility with Hive’s MSCK REPAIR TABLE behavior, which expects the partition column names in file system paths to use lowercase (for example, col_x=SomeValue). Partitions on the file system not conforming to this convention are ignored, unless the argument is set to false.

    unregister_partition #

    system.unregister_partition('schema_name', 'table_name', partition_columns, partition_values);
    

    Unregisters given, existing partition in the metastore for the specified table. The partition data is not deleted.

    Migrate to Iceberg #

    The Great Lakes connectivity supports migration from Hive to Iceberg table format. It can read and write to Hive tables that have been successfully migrated to Iceberg. The migration process creates metadata without copying any data. In addition, table statistics are not generated.

    The migration process from Hive to Iceberg table format only supports Parquet, ORC, and AVRO file formats.

    Use the following statement ALTER TABLE SET PROPERTIES to convert a table from the Hive table format to the Iceberg table format:

    ALTER TABLE catalog_name.schema_name.table_name SET PROPERTIES type = 'ICEBERG';
    

    The statement must be called for a specific catalog example with the relevant schema and table names supplied with the required parameters schema_name and table_name.

    Once the migration process is complete, use the SHOW CREATE TABLE statement to confirm that the table type has changed:

    SHOW CREATE TABLE catalog_name.schema_name.table_name;
    

    Successful migration displays type='ICEBERG'.

    Partition projection #

    Partition projection is a feature used to speed up query processing. This feature computes partition values and locations statically from the partition projection definition of the table properties and does not rely on the metastore for retrieving them.

    Great Lakes connectivity supports partition projection table properties stored in the metastore, and it reimplements this functionality. Currently, date projection only supports intervals of DAYS, HOURS, MINUTES, and SECONDS.

    If there are any compatibility issues blocking access to a requested table when partition projection is enabled, set the partition_projection_ignore table property to true for a table to bypass any errors.

    In addition, Great Lakes connectivity supports the partition_projection_type table property.

    The partition_projection_type defines the type of partition projection to use on a specific column. It can be set as either enum or integer and date:

    Name Description
    enum When there are a predefined set of values for the partitions to be used such as, country or region names.
    integer Specifies a range, providing a more concise way to predefine the partition values. For example,[1, 2, 3, 4, ..., 1000] or [0500, 0550, 0600, ..., 2500].
    date Specifies a range, providing a more concise way to predefine the partition values. For example, [20200101, 20200102, ..., 20201231] or [1-1-2020 00:00:00, 1-1-2020 01:00:00, ..., 12-31-2020 23:00:00]

    Read more about Hive table properties.

    Using partition projection #

    To use partition projection, specify the ranges of partition values and projection types for each partition column in the table properties in the AWS Glue Data Catalog or in an external Hive metastore.

    Create a projection with the enum projection type:

    CREATE SCHEMA catalog_name.schema_name;
    use catalog_name.test_schema;
    
    CREATE TABLE enum_table_name (
        name varchar,
        comment varchar,
        nationkey integer,
        regionkey integer,
        short_name varchar WITH (
            partition_projection_type = 'enum',
            partition_projection_values = array ['POL', 'CZE']
        )
    )
    WITH (
        partitioned_by = array['short_name'],
        partition_projection_enabled = true
    );
    
    SHOW CREATE TABLE enum_table_name;
    
    INSERT INTO enum_table_name VALUES ('POLAND_1', 'Czesc', 0, 5, 'POL');
    INSERT INTO enum_table_name VALUES ('POLAND_2', 'Czesc', 1, 5, 'POL');
    INSERT INTO enum_table_name VALUES ('CZECH_1', 'Ahoj', 2, 5, 'CZE');
    INSERT INTO enum_table_name VALUES ('CZECH_2', 'Ahoj', 3, 5, 'CZE');
    

    The INSERT INTO statements adds the data corresponding to the POL and CZE partitions.

    SELECT * FROM "enum_table_name$partitions";
    

    The SELECT only displays the partitions POL and CZE corresponding to the partition projection defined while creating the table, not the metastore.

    | short_name |
    +------------+
    | POL        |
    +------------+
    | CZE        |
    +------------+
    

    Create a projection with the integer projection range:

    CREATE TABLE integer_table_name (
        name varchar,
        comment varchar,
        nationkey integer,
        regionkey integer WITH (
            partition_projection_type = 'integer',
            partition_projection_range = array ['0', '10']
        )
    )
    WITH (
        partitioned_by = array['regionkey'],
        partition_projection_enabled = true
    );
    
    INSERT INTO integer_table_name VALUES ('POLAND_1', 'Czesc', 0, 5);
    INSERT INTO integer_table_name VALUES ('POLAND_2', 'Czesc', 1, 5);
    INSERT INTO integer_table_name VALUES ('CZECH_1', 'Ahoj', 2, 5);
    INSERT INTO integer_table_name VALUES ('CZECH_2', 'Ahoj', 3, 5);
    INSERT INTO integer_table_name VALUES ('GERMANY_1', 'Hallo', 12, 8);
    INSERT INTO integer_table_name VALUES ('GERMANY_2', 'Hallo', 13, 8);
    INSERT INTO integer_table_name VALUES ('MEXICO_1', 'Hola', 22, 18);
    

    The INSERT INTO statements adds the data corresponding to the defined partitions.

    SELECT * FROM "integer_table_name$partitions";
    

    The SELECT * FROM statement returns only the partitions corresponding to the partition projection defined while creating the table, not the metastore.

    | regionkey |
    +-----------+
    | 0         |
    +-----------+
    | 1         |
    +-----------+
    | 2         |
    +-----------+
    | 3         |
    +-----------+
    | 4         |
    +-----------+
    | 5         |
    +-----------+
    | 6         |
    +-----------+
    | 7         |
    +-----------+
    | 8         |
    +-----------+
    | 9         |
    +-----------+
    | 10        |
    +-----------+
    

    Create a projection with the integer projection range using partition_projection_interval:

    CREATE TABLE interval_table_name (
        name varchar,
        comment varchar,
        nationkey integer,
        regionkey integer WITH (
            partition_projection_type = 'integer',
            partition_projection_range = array ['0', '10'],
            partition_projection_interval = 3
        )
    )
    WITH (
        partitioned_by = array['regionkey'],
        partition_projection_enabled = true
    );
    
    INSERT INTO interval_table_name VALUES ('POLAND_1', 'Czesc', 0, 5);
    INSERT INTO interval_table_name VALUES ('POLAND_2', 'Czesc', 1, 5);
    INSERT INTO interval_table_name VALUES ('CZECH_1', 'Ahoj', 2, 5);
    INSERT INTO interval_table_name VALUES ('CZECH_2', 'Ahoj', 3, 5);
    INSERT INTO interval_table_name VALUES ('GERMANY_1', 'Hallo', 12, 6);
    INSERT INTO interval_table_name VALUES ('GERMANY_2', 'Hallo', 13, 9);
    INSERT INTO interval_table_name VALUES ('MEXICO_1', 'Hola', 22, 18);
    

    The INSERT INTO statements adds the data corresponding to the defined partitions.

    SELECT * FROM "interval_table_name$partitions";
    

    The SELECT * FROM statement returns only the partitions corresponding to the partition projection defined while creating the table, not the metastore.

    In the following example output, only 0, 3, 6, and 9 are displayed because of the chosen partition_projection_interval value of 3.

    | regionkey |
    +-----------+
    | 0         |
    +-----------+
    | 3         |
    +-----------+
    | 6         |
    +-----------+
    | 9         |
    +-----------+
    

    Create a projection with the date projection range:

    CREATE TABLE date_table_name (
        name varchar,
        comment varchar,
        nationkey integer,
        dt varchar WITH (
            partition_projection_type = 'date',
            partition_projection_format = 'yyyy-MM-dd',
            partition_projection_range = array['NOW-5DAYS', 'NOW']
        )
    )
    WITH (
        partitioned_by = array['dt'],
        partition_projection_enabled = true
    );
    
    INSERT INTO date_table_name VALUES ('POLAND_1', 'Czesc', 0, '2023-06-01');
    INSERT INTO date_table_name VALUES ('POLAND_2', 'Czesc', 1, '2023-06-01');
    INSERT INTO date_table_name VALUES ('CZECH_1', 'Ahoj', 2, '2023-06-29');
    INSERT INTO date_table_name VALUES ('CZECH_2', 'Ahoj', 3, '2023-06-01');
    INSERT INTO date_table_name VALUES ('GERMANY_1', 'Hallo', 12, '2023-06-01');
    INSERT INTO date_table_name VALUES ('GERMANY_2', 'Hallo', 13, '2023-07-02');
    INSERT INTO date_table_name VALUES ('MEXICO_1', 'Hola', 22, '2023-07-01');
    

    The INSERT INTO statements adds the data corresponding to the defined partitions.

    SELECT * FROM "date_table_name$partitions";
    

    The SELECT * FROM statement returns only the partitions corresponding to the partition projection defined while creating the table, not the metastore.

    The following SELECT * FROM statement example returns the partition, dt:

    | dt         |
    +------------+
    | 2023-08-26 |
    +------------+
    | 2023-08-27 |
    +------------+
    | 2023-08-28 |
    +------------+
    | 2023-08-29 |
    +------------+
    | 2023-08-30 |
    +------------+
    | 2023-08-31 |
    +------------+
    

    Session properties #

    A session property temporarily modifies an internal Starburst Galaxy setting for the duration of the current connection session to the cluster. Use the SET SESSION statement followed by the property name, then a value such as true or false to modify the property:

    SET SESSION catalog_name.session_property = expression;
    

    Use the SHOW SESSION statement to view all current session properties. For additional information, read about the SET SESSION, and RESET SESSION SQL statements.

    Catalog session properties are internal session properties that can be set on a per-catalog basis. These properties must be set separately for each catalog by including the catalog name before the property name, for example, catalog_name.property_name.

    Session properties are linked to the current session, which lets you have multiple connections to a cluster that each have different values for the same session properties. Once a session ends, either by disconnecting or creating a new session, any changes made to session properties during the previous session are lost.

    The following sections describe the session properties supported by Hive table type:

    avro_native_reader_enabled #

    SET SESSION catalog_name.avro_native_reader_enabled = false;
    

    Use the native Avro file reader. The default value is false.

    avro_native_writer_enabled #

    SET SESSION catalog_name.writer_enabled = false;
    

    Use the native Avro file writer. The default value is false.

    bucket_execution_enabled #

    SET SESSION catalog_name.bucket_execution_enabled = true;
    

    Enable bucket-aware execution: only use a single worker per bucket. The default value is true.

    collect_column_statistics_on_write #

    SET SESSION catalog_name.collect_column_statistics_on_write = true;
    

    Enables automatic column level statistics collection on write. The default value is true.

    compression_codec #

    SET SESSION catalog_name.compression_codec = 'DEFAULT';
    

    The compression codec is used when writing new data files. The possible values are:

    • NONE
    • DEFAULT
    • SNAPPY
    • LZ4
    • ZSTD
    • GZIP

    The default value is DEFAULT.

    create_empty_bucket_files #

    SET SESSION catalog_name.create_empty_bucket_files = false;
    

    Create empty files for buckets that have no data. The default value is false.

    csv_native_reader_enabled #

    SET SESSION catalog_name.csv_native_reader_enabled = false;
    

    Use the native CSV reader. The default value is false.

    csv_native_writer_enabled #

    SET SESSION catalog_name.csv_native_writer_enabled = false;
    

    Use the native CSV writer. The default value is false.

    dynamic_filtering_wait_timeout #

    SET SESSION catalog_name.dynamic_filtering_wait_timeout = '10s';
    

    The duration to wait for completion of dynamic filtering during split generation. Dynamic filtering optimizations significantly improve the performance of queries with selective joins by avoiding reading of data that would be filtered by join conditions.

    This extra wait time can potentially result in significant overall savings in query and CPU time, if dynamic filtering is able to reduce the amount of scanned data. In the previous example, the dynamic_filtering_wait_timeout is set to 10 seconds.

    force_local_scheduling #

    SET SESSION catalog_name.force_local_scheduling = false;
    

    Only schedule splits on workers colocated with the data node. The default value is false.

    hive_storage_format #

    SET SESSION catalog_name.hive_storage_format = 'ORC';
    

    Default storage format for new tables or partitions. The possible values are:

    • ORC
    • PARQUET
    • AVRO
    • RCBINARY
    • RCTEXT
    • SEQUENCEFILE
    • JSON
    • OPENX_JSON
    • TEXTFILE
    • CSV
    • REGEX

    The default value is ORC.

    hive_views_legacy_translation #

    SET SESSION catalog_name.hive_views_legacy_translation = false;
    

    Use hive_views_legacy_translation for catalog specific use to translate Hive views. Allow read access to the data. Hive views are defined in HiveQL and stored in the Hive Metastore Service. They are analyzed to allow read access to the data. Not available when using the Glue or Starburst metastore options. The default value is false.

    ignore_absent_partitions #

    SET SESSION catalog_name.ignore_absent_partitions = false;
    

    Ignore partitions when the file system location does not exist rather than failing the query. The default value is false.

    ignore_corrupted_statistics #

    SET SESSION catalog_name.ignore_corrupted_statistics = false;
    

    Ignore the corrupted statistics rather than failing the query. The default value is false.

    insert_existing_partitions_behavior #

    SET SESSION catalog_name.insert_existing_partitions_behavior = 'APPEND';
    

    Behavior on insert existing partitions; this session property does not control behavior on insert existing unpartitioned table. The default value is APPEND.

    json_native_reader_enabled #

    SET SESSION catalog_name.json_native_reader_enabled = false;
    

    Use the native JSON reader. The default value is false.

    json_native_writer_enabled #

    SET SESSION catalog_name.json_native_writer_enabled = false;
    

    Use the native JSON writer. The default value is false.

    minimum_assigned_split_weight #

    SET SESSION catalog_name.minimum_assigned_split_weight = 0.05;
    

    The minimum assigned split weight when size based split weighting is enabled. The default value is 0.05.

    openx_json_native_reader_enabled #

    SET SESSION catalog_name.openx_json_native_reader_enabled = true;
    

    Use the native OpenX JSON reader. The default value is true.

    openx_json_native_writer_enabled #

    SET SESSION catalog_name.openx_json_native_writer_enabled = true;
    

    Use the native OpenX JSON writer. The default value is true.

    optimize_mismatched_bucket_count #

    SET SESSION catalog_name.optimize_mismatched_bucket_count = false;
    
    

    Enable optimization to avoid shuffle when bucket count is compatible but not the same. The default value is false.

    SET SESSION catalog_name.optimize_symlink_listing = true;
    

    Optimize the listing for SymlinkTextFormat tables with files in a single directory. The default value is true.

    orc_bloom_filters_enabled #

    SET SESSION catalog_name.orc_bloom_filters_enabled = `false`;
    

    Enable bloom filters for predicate pushdown. The default value is false.

    orc_lazy_read_small_ranges #

    SET SESSION catalog_name.orc_lazy_read_small_ranges = true;
    

    Read small file segments lazily. The default value is true.

    orc_max_buffer_size #

    SET SESSION catalog_name.orc_max_buffer_size = '8MB';
    

    Maximum size of a single read. The default value is 8MB.

    orc_max_merge_distance #

    SET SESSION catalog_name.orc_max_merge_distance = '1MB';
    

    Maximum size of gap between two reads to merge into a single read. The default value is 1MB.

    orc_max_read_block_size #

    SET SESSION catalog_name.orc_max_read_block_size = '16MB';
    

    Soft maximum size of Trino blocks produced by ORC reader. The default value is 16MB.

    orc_native_zstd_decompressor_enabled #

    SET SESSION catalog_name.orc_native_zstd_decompressor_enabled = true;
    

    Enable using the native zstd library for faster decompression of ORC files. The default value is true.

    orc_nested_lazy_enabled #

    SET SESSION catalog_name.orc_nested_lazy_enabled = true;
    

    Lazily read nested data. The default value is true.

    orc_optimized_writer_max_dictionary_memory #

    SET SESSION catalog_name.orc_optimized_writer_max_dictionary_memory = '16MB';
    

    The maximum dictionary memory. The default value is 16MB.

    orc_optimized_writer_max_stripe_rows #

    SET SESSION catalog_name.orc_optimized_writer_max_stripe_rows = 10000000;
    

    The maximum stripe row count. The default value is 10000000.

    orc_optimized_writer_max_stripe_size #

    SET SESSION catalog_name.orc_optimized_writer_max_stripe_size = '64MB';
    

    The maximum stripe size. The default value is 64MB.

    orc_optimized_writer_min_stripe_size #

    SET SESSION catalog_name.orc_optimized_writer_min_stripe_size = '32MB';
    

    The minimum stripe size. The default value is 32MB.

    orc_optimized_writer_validate #

    SET SESSION catalog_name.orc_optimized_writer_validate = false;
    
    

    Force validation for all files. The default value is false.

    orc_optimized_writer_validate_mode #

    SET SESSION catalog_name.orc_optimized_writer_validate_mode = 'BOTH';
    

    Level of detail in ORC validation. Possible values include HASHED, DETAILED, and BOTH. The default value is BOTH.

    orc_optimized_writer_validate_percentage #

    SET SESSION catalog_name.orc_optimized_writer_validate_percentage = 0;
    

    The sample percentage for validation of files. The default value is 0.

    orc_stream_buffer_size #

    SET SESSION catalog_name.orc_stream_buffer_size = '8MB';
    

    Size of buffer for streaming reads. The default value is 8MB.

    orc_string_statistics_limit #

    SET SESSION catalog_name.orc_string_statistics_limit = '64B';
    

    Maximum size of the string statistics. The default value is 64B.

    orc_tiny_stripe_threshold #

    SET SESSION catalog_name.orc_tiny_stripe_threshold = '8MB';
    

    The Threshold below which an ORC stripe or file will read in its entirety. The default value is 8MB.

    orc_use_column_names #

    SET SESSION catalog_name.orc_use_column_names = false;
    

    Access ORC columns by name. By default, columns in ORC files are accessed by their ordinal position in the Hive table definition. The default value is false.

    parallel_partitioned_bucketed_writes #

    SET SESSION catalog_name.parallel_partitioned_bucketed_writes = true;
    

    Improve parallelism of partitioned and bucketed table writes. The default value is true.

    parquet_ignore_statistics #

    SET SESSION catalog_name.parquet_ignore_statistics = false;
    

    Ignore statistics from Parquet to allow querying files with corrupted or incorrect statistics. The default value is false.

    parquet_max_read_block_row_count #

    SET SESSION catalog_name.parquet_max_read_block_row_count = 8192;
    

    Sets the maximum number of rows read in a batch. The default value is 8192.

    parquet_max_read_block_size #

    SET SESSION catalog_name.parquet_max_read_block_size = '16MB';
    

    The maximum block size used when reading Parquet files. The default value is 16MB.

    parquet_native_snappy_decompressor_enabled #

    SET SESSION catalog_name.parquet_native_snappy_decompressor_enabled = true;
    

    Enable using the native SNAPPY library for faster decompression of Parquet files. The default value is true;

    parquet_native_zstd_decompressor_enabled #

    SET SESSION catalog_name.parquet_native_zstd_decompressor_enabled = true;
    

    Enable using the native ZSTD library for faster decompression of Parquet files. The default value is true.

    parquet_optimized_nested_reader_enabled #

    SET SESSION catalog_name.parquet_optimized_nested_reader_enabled = true;
    

    Specifies whether batched column readers are used when reading ARRAY, MAP, and ROW types from Parquet files for improved performance. Set this property to false to disable the optimized Parquet reader for structural data types. The default value is true.

    parquet_optimized_reader_enabled #

    SET SESSION catalog_name.parquet_optimized_reader_enabled = true;
    

    Specifies whether batched column readers are used when reading Parquet files for improved performance. Set this property to false to disable the optimized Parquet reader The default value is true.

    parquet_optimized_writer_enabled #

    SET SESSION catalog_name.parquet_optimized_writer_enabled = true;
    

    Specifies whether the optimized writer is used when writing Parquet files. The default value is true.

    parquet_optimized_writer_validation_percentage #

    SET SESSION catalog_name.parquet_optimized_writer_validation_percentage = 5;
    

    Percentage of Parquet files to validate after write by rereading the whole file. Validation is turned off by setting the value to 0. The default value is 5.

    parquet_use_bloom_filter #

    SET SESSION catalog_name.parquet_use_bloom_filter = true;
    

    Specifies whether Bloom filters are used for predicate pushdown when reading Parquet files. The default value is true;

    parquet_use_column_index #

    SET SESSION catalog_name.parquet_use_column_index = true;
    

    Use the Parquet column index. The default value is true.

    parquet_use_column_names #

    SET SESSION catalog_name.parquet_use_column_names = true;
    

    Access Parquet columns by name by default. Set this property to false to access columns by their ordinal position in the Hive table definition. The default value is true.

    parquet_vectorized_decoding_enabled #

    SET SESSION catalog_name.parquet_use_column_names = true;
    

    Enable the use of the Java Vector API for faster decoding of Parquet files. The default value is true.

    parquet_writer_batch_size #

    SET SESSION catalog_name.parquet_writer_batch_size = 10000;
    

    Maximum number of rows processed by the Parquet writer in a batch. The default value is 10000.

    parquet_writer_block_size #

    SET SESSION catalog_name.parquet_writer_block_size = '128MB';
    

    The maximum block size created by the Parquet writer. The default value is 128MB.

    parquet_writer_page_size #

    SET SESSION catalog_name.parquet_writer_page_size = '1MB';
    

    The maximum page size created by the Parquet writer. The default value is 1MB.

    partition_statistics_sample_size #

    SET SESSION catalog_name.partition_statistics_sample_size = 100;
    

    The maximum sample size of the partitions column statistics. The default value is 100.

    projection_pushdown_enabled #

    SET SESSION catalog_name.projection_pushdown_enabled = true;
    

    Enable projection push down. The default value is true.

    propagate_table_scan_sorting_properties #

    SET SESSION catalog_name.propagate_table_scan_sorting_properties = false;
    

    Use sorted table layout to generate more efficient execution plans. Note, this may lead to incorrect results if files are not sorted as per table definition. The default value is false.

    query_partition_filter_required #

    SET SESSION catalog_name.query_partition_filter_required = false;
    

    Set to true to temporarily force a query to use a partition filter.

    query_partition_filter_required_schemas #

    SET SESSION catalog_name.query_partition_filter_required_schemas = array['1'];
    

    The list of schemas for which filter on partition column is enforced. The default value is array['']

    rcfile_optimized_writer_validate #

    SET SESSION catalog_name.rcfile_optimized_writer_validate = false;
    

    Validate RCF writer files. The default value is false.

    regex_native_reader_enabled #

    SET SESSION catalog_name.regex_native_reader_enabled = false;
    

    Use the native REGEX reader. The default value is false.

    respect_table_format #

    SET SESSION catalog_name.respect_table_format = true;
    

    Write new partitions using table format rather than the default storage format. The default value true.

    s3_select_pushdown_enabled #

    SET SESSION catalog_name.s3_select_pushdown_enabled = false;
    

    Enable S3 Select pushdown. The default value is false.

    sequence_file_native_reader_enabled #

    SET SESSION catalog_name.sequence_file_native_reader_enabled = false;
    

    Use the native sequence file reader. The default value is false.

    sequence_file_native_writer_enabled #

    SET SESSION catalog_name.sequence_file_native_writer_enabled = false;
    

    Use the native sequence file writer. The default value is false.

    size_based_split_weights_enabled #

    SET SESSION catalog_name.size_based_split_weights_enabled = true;
    

    Enable estimating split weights based on size in bytes. The default value is true.

    sorted_writing_enabled #

    SET SESSION catalog_name.sorted_writing_enabled = true;
    

    Enable sorted writes. The default value is true.

    statistics_enabled #

    SET SESSION catalog_name.statistics_enabled = true;
    

    Enables table statistics for performance improvements. The default value is true. If the value is set to false the following session property is disabled:

    target_max_file_size #

    SET SESSION catalog_name.target_max_file_size = '1GB';
    

    Target maximum size of written files. The default value is 1GB.

    text_file_native_reader_enabled #

    SET SESSION catalog_name.text_file_native_reader_enabled = false;
    

    Use the native text file reader. The default value is false.

    text_file__writer_enabled #

    SET SESSION catalog_name.text_file_native_writer_enabled = false;
    

    Use the native text file writer. The default value is false.

    timestamp_precision #

    SET SESSION catalog_name.timestamp_precision = 'MILLISECONDS';
    

    Specifies the precision for the timestamp columns in Hive tables. The possible values are:

    • MILLISECONDS
    • MICROSECONDS
    • NANOSECONDS

    The default value is MILLISECONDS.

    validate_bucketing #

    SET SESSION catalog_name.validate_bucketing = true;
    

    Verify that data is bucketed correctly when reading. The default value is true.

    Hive SQL support #

    When using the Hive table format with Great Lakes connectivity, the general SQL support details apply, with the following additional consideration:

    • Hive views are not supported.