Hive table format #
Great Lakes connectivity abstracts the details of using different table formats and file types when using object storage catalogs.
This page describes the features specific to the Hive table format when used with Great Lakes connectivity.
Specify Hive format #
Hive connectivity allows querying data stored in an Apache Hive data warehouse. Hive tables are automatically detected and read based on information in the storage environment’s associated metadata.
For an object storage catalog that specifies Hive as its default table
format, no special syntax
is required. A simple CREATE TABLE
statement creates a Hive format table.
To create a Hive table when the default format is not Hive, add type='hive'
as
a table property in your CREATE
statement. For example:
CREATE TABLE customer (
name varchar,
address varchar)
WITH (type='hive');
Hive table format properties #
As part of a CREATE TABLE
statement, specify the format of table data files
with an additional format
property with one of the values shown in the
following:
File format | Description |
---|---|
ORC |
Default value, Apache ORC file format. |
PARQUET |
Apache Parquet file format. |
JSON |
JSON file format using
org.apache.hive.hcatalog.data.JsonSerDe |
CSV |
Comma-separated values file format using
org.apache.hadoop.hive.serde2.OpenCSVSerde |
TEXTFILE |
|
AVRO |
Apache Avro file format. |
RCTEXT |
RCFile using ColumnarSerDe |
RCBINARY |
RCFile using LazyBinaryColumnarSerDe |
SEQUENCEFILE |
Hive table properties #
As part of a CREATE TABLE
statement, append further comma-separated table
properties as needed. Note that many of the available table properties must be
used in conjunction with a particular Hive file format.
Table property | Description |
---|---|
auto_purge |
Indicates to the configured metastore to perform a purge when a table or partition is deleted, instead of a soft deletion using the trash. |
avro_schema_url |
The URI of the Avro schema evolution for the table. |
bucket_count |
The number of buckets to group data into. Only valid if used with
bucketed_by . Defaults to 0 . |
bucketed_by |
The bucketing column for the storage table. Only valid if used with
bucket_count . Defaults to [] . |
bucketing_version |
Specifies which Hive bucketing version to use. Valid values are
1 or 2 . |
csv_escape |
The CSV escape character. Requires CSV format. |
csv_quote |
The CSV quote character. Requires CSV format. |
csv_separator |
The CSV separator character. Requires CSV format. You can use other
separators such as | or use Unicode to configure
invisible separators such as tabs with U&'\0009' . |
external_location |
The URI of an external Hive table on S3, Azure Blob Storage, etc. See the basic usage examples for more information. |
format |
The table file format.
Valid
values are ORC , PARQUET , AVRO ,
RCBINARY , RCTEXT , SEQUENCEFILE ,
JSON , TEXTFILE , CSV , and
REGEX . |
null_format |
The serialization format for NULL value. Requires
TextFile, RCText, or SequenceFile format. |
orc_bloom_filter _columns |
Comma-separated list of columns to use for ORC bloom filter. It
improves the performance of queries using range predicates when
reading ORC files. Requires ORC format. Defaults to
[] . |
orc_bloom_filter_fpp |
The ORC bloom filters false positive probability. Requires ORC format.
Defaults to 0.05 . |
partitioned_by |
The partitioning column for the storage table. The columns listed in
the partitioned_by clause must be the last columns as
defined in the DDL. Defaults to [] . |
partition_projection_ignore |
Ignore any partition projection properties stored in the metastore for the selected table. This bypasses compatibility issues on a specific table. |
partition_projection_type |
Defines the type of partition projection to use on a specific column.
May be used only on partition columns. Available types include:
ENUM , INTEGER , DATE . |
skip_footer_line_count |
The number of footer lines to ignore when parsing the file for data. Requires TextFile or CSV format tables. |
skip_header_line_count |
The number of header lines to ignore when parsing the file for data. Requires TextFile or CSV format tables. |
sorted_by |
The column to sort by to determine bucketing for row. Only valid if
bucketed_by and bucket_count are specified
as well. Defaults to [] . |
textfile_field_separator |
Allows the use of custom field separators, such as | ,
for TextFile formatted tables. |
textfile_field _separator_escape |
Allows the use of a custom escape character for TextFile formatted tables. |
transactional |
Set this property to true to create an ORC ACID
transactional table. Requires ORC format. This property may be shown
as true for insert-only tables created using older
versions of Hive. |
partition_projection _enabled |
Enables partition projection for selected table. Mapped from AWS Athena table property projection.enabled. |
partition_projection _ignore |
Ignore any partition projection properties stored in the metastore for the selected table. This is a Galaxy-only property which allows you to work around compatibility issues on a specific table, and if enabled, Galaxy ignores all other configuration options related to partition projection. |
partition_projection _location_template |
Projected partition location template, such as
s3://test/name=${name}/ . Mapped from the AWS Athena table
property
storage.location.template. Defaults to
${table_location}/${partition_name} . |
extra_properties |
Additional properties added to a Hive table. The properties are not
used by Galaxy, and are available in the $properties
metadata table. The properties are not included in the output of
SHOW CREATE TABLE statements. |
Some SQL statements must be supplemented with the type='hive'
property for use
in Starburst Galaxy. For example:
CREATE TABLE hive.web.page_views (
view_time timestamp,
user_id bigint,
page_url varchar,
ds date,
country varchar
)
WITH (
type = 'hive',
format = 'ORC',
partitioned_by = ARRAY['ds', 'country'],
bucketed_by = ARRAY['user_id'],
bucket_count = 50
)
Metadata tables #
Great Lakes connectivity exposes several metadata tables
for the Hive table format. These metadata tables contain information about
the internal structure of the Hive table. Query each metadata table by
appending the metadata table name to the table_name
$properties #
The $properties
table exposes the parameters of the table in the metastore.
The raw Hive table properties are available as a hidden table, containing a separate column per table property, with a single row containing the property values.
Inspect the property names and values with the following query:
SELECT * FROM catalog_name.schema_name."table_name$properties";
| auto.purge | numfiles | presto_query_id | presto_version | totalsize |transactional |
-------------+-------------+----------------------------+----------------+-----------+--------------+
| false | -1 |20230705_152456_00001_nfugi | 423 | -1 | false |
$partitions #
The $partitions
table provides a list of all partition values of a partitioned
table.
The following example returns all partition values from the table_name
table:
SELECT * FROM catalog_name.schema_name."table_name$partitions";
day | country
------------+---------
2023-07-01 | POL
2023-07-02 | POL
2023-07-03 | POL
2023-03-01 | USA
2023-03-02 | USA
Table procedures #
Use the CALL
statement to perform data manipulation or administrative tasks.
Procedures must include a qualified catalog name. For example, for a catalog
named examplecatalog
:
CALL examplecatalog.system.example_procedure();
The following Hive table procedures are available:
create_empty_partition #
system.create_empty_partition('schema_name', 'table_name', partition_columns, partition_values);
Create an empty partition in the specified table.
The partition arguments can have a different type of varchar
such as an
integer or date.
The value must be specified as varchar
in the partition_values
and
partition_columns
array. For example, partition_values => ARRAY['2016-08-09',
'US']
drop_stats #
system.drop_stats('schema_name', 'table_name', partition_values);
Drops statistics for one or more partitions or the entire table. Specify the
partitions as an array whose elements are arrays of partition_values
. The
following example shows partition_values
for dropping stats for multiple
partitions:
partition_values => ARRAY[ARRAY['2016-08-09', 'US'], ARRAY['2023-08-03', 'UK']]
The partition_values
argument is optional. If the partition_values
argument is omitted, stats are dropped for the entire table.
register_partition #
system.register_partition('schema_name', 'table_name', partition_columns, partition_values, location);
Registers the existing location as a new partition in the metastore for the specified table.
With the location
argument omitted, the partition location is constructed
using partition_columns
and partition_values
.
sync_partition_metadata #
system.sync_partition_metadata('schema_name', 'table_name', 'mode', case_sensitive);
Check and update partitions list in the metastore. There are three modes available:
ADD
: add any partitions that exist on the file system, but not in the metastore.DROP
: drop any partitions that exist in the metastore, but not on the file system.FULL
: perform bothADD
andDROP
.
The case_sensitive
argument is optional. The default value is true
for
compatibility with Hive’s MSCK REPAIR TABLE
behavior, which expects the
partition column names in file system paths to use lowercase (for example,
col_x=SomeValue
). Partitions on the file system not conforming to this
convention are ignored, unless the argument is set to false
.
unregister_partition #
system.unregister_partition('schema_name', 'table_name', partition_columns, partition_values);
Unregisters given, existing partition in the metastore for the specified table. The partition data is not deleted.
Migrate to Iceberg #
The Great Lakes connectivity supports migration from Hive to Iceberg table format. It can read and write to Hive tables that have been successfully migrated to Iceberg. The migration process creates metadata without copying any data. In addition, table statistics are not generated.
The migration process from Hive to Iceberg table format only supports Parquet, ORC, and AVRO file formats.
Use the following statement ALTER TABLE SET PROPERTIES
to convert a table from
the Hive table format to the Iceberg table format:
ALTER TABLE catalog_name.schema_name.table_name SET PROPERTIES type = 'ICEBERG';
The statement must be called for a specific catalog example with the relevant
schema and table names supplied with the required parameters schema_name
and
table_name
.
Once the migration process is complete, use the SHOW CREATE TABLE
statement to
confirm that the table type has changed:
SHOW CREATE TABLE catalog_name.schema_name.table_name;
Successful migration displays type='ICEBERG'
.
Partition projection #
Partition projection is a feature used to speed up query processing. This feature computes partition values and locations statically from the partition projection definition of the table properties and does not rely on the metastore for retrieving them.
Great Lakes connectivity supports partition projection table properties stored
in the metastore, and it reimplements this functionality. Currently, date
projection only supports intervals of DAYS
, HOURS
, MINUTES
, and SECONDS
.
If there are any compatibility issues blocking access to a requested table when
partition projection is enabled, set the partition_projection_ignore
table
property to true
for a table to bypass any errors.
In addition, Great Lakes connectivity supports the partition_projection_type table property.
The partition_projection_type
defines the type of partition projection to use
on a specific column. It can be set as either enum
or integer
and date
:
Name | Description |
---|---|
enum |
When there are a predefined set of values for the partitions to be used such as, country or region names. |
integer |
Specifies a range, providing a more concise way to predefine the
partition values. For example,[1, 2, 3, 4, ..., 1000] or
[0500, 0550, 0600, ..., 2500] . |
date |
Specifies a range, providing a more concise way to predefine the
partition values. For example, [20200101, 20200102, ...,
20201231] or [1-1-2020 00:00:00, 1-1-2020 01:00:00, ...,
12-31-2020 23:00:00] |
Read more about Hive table properties.
Using partition projection #
To use partition projection, specify the ranges of partition values and projection types for each partition column in the table properties in the AWS Glue Data Catalog or in an external Hive metastore.
Create a projection with the enum
projection type:
CREATE SCHEMA catalog_name.schema_name;
use catalog_name.test_schema;
CREATE TABLE enum_table_name (
name varchar,
comment varchar,
nationkey integer,
regionkey integer,
short_name varchar WITH (
partition_projection_type = 'enum',
partition_projection_values = array ['POL', 'CZE']
)
)
WITH (
partitioned_by = array['short_name'],
partition_projection_enabled = true
);
SHOW CREATE TABLE enum_table_name;
INSERT INTO enum_table_name VALUES ('POLAND_1', 'Czesc', 0, 5, 'POL');
INSERT INTO enum_table_name VALUES ('POLAND_2', 'Czesc', 1, 5, 'POL');
INSERT INTO enum_table_name VALUES ('CZECH_1', 'Ahoj', 2, 5, 'CZE');
INSERT INTO enum_table_name VALUES ('CZECH_2', 'Ahoj', 3, 5, 'CZE');
The INSERT INTO
statements adds the data corresponding to the POL and CZE
partitions.
SELECT * FROM "enum_table_name$partitions";
The SELECT
only displays the partitions POL and CZE corresponding
to the partition projection defined while creating the table, not the metastore.
| short_name |
+------------+
| POL |
+------------+
| CZE |
+------------+
Create a projection with the integer
projection range:
CREATE TABLE integer_table_name (
name varchar,
comment varchar,
nationkey integer,
regionkey integer WITH (
partition_projection_type = 'integer',
partition_projection_range = array ['0', '10']
)
)
WITH (
partitioned_by = array['regionkey'],
partition_projection_enabled = true
);
INSERT INTO integer_table_name VALUES ('POLAND_1', 'Czesc', 0, 5);
INSERT INTO integer_table_name VALUES ('POLAND_2', 'Czesc', 1, 5);
INSERT INTO integer_table_name VALUES ('CZECH_1', 'Ahoj', 2, 5);
INSERT INTO integer_table_name VALUES ('CZECH_2', 'Ahoj', 3, 5);
INSERT INTO integer_table_name VALUES ('GERMANY_1', 'Hallo', 12, 8);
INSERT INTO integer_table_name VALUES ('GERMANY_2', 'Hallo', 13, 8);
INSERT INTO integer_table_name VALUES ('MEXICO_1', 'Hola', 22, 18);
The INSERT INTO
statements adds the data corresponding to the defined
partitions.
SELECT * FROM "integer_table_name$partitions";
The SELECT * FROM
statement returns only the partitions corresponding to the
partition projection defined while creating the table, not the metastore.
| regionkey |
+-----------+
| 0 |
+-----------+
| 1 |
+-----------+
| 2 |
+-----------+
| 3 |
+-----------+
| 4 |
+-----------+
| 5 |
+-----------+
| 6 |
+-----------+
| 7 |
+-----------+
| 8 |
+-----------+
| 9 |
+-----------+
| 10 |
+-----------+
Create a projection with the integer
projection range using
partition_projection_interval
:
CREATE TABLE interval_table_name (
name varchar,
comment varchar,
nationkey integer,
regionkey integer WITH (
partition_projection_type = 'integer',
partition_projection_range = array ['0', '10'],
partition_projection_interval = 3
)
)
WITH (
partitioned_by = array['regionkey'],
partition_projection_enabled = true
);
INSERT INTO interval_table_name VALUES ('POLAND_1', 'Czesc', 0, 5);
INSERT INTO interval_table_name VALUES ('POLAND_2', 'Czesc', 1, 5);
INSERT INTO interval_table_name VALUES ('CZECH_1', 'Ahoj', 2, 5);
INSERT INTO interval_table_name VALUES ('CZECH_2', 'Ahoj', 3, 5);
INSERT INTO interval_table_name VALUES ('GERMANY_1', 'Hallo', 12, 6);
INSERT INTO interval_table_name VALUES ('GERMANY_2', 'Hallo', 13, 9);
INSERT INTO interval_table_name VALUES ('MEXICO_1', 'Hola', 22, 18);
The INSERT INTO
statements adds the data corresponding to the defined
partitions.
SELECT * FROM "interval_table_name$partitions";
The SELECT * FROM
statement returns only the partitions corresponding to the
partition projection defined while creating the table, not the metastore.
In the following example output, only 0
, 3
, 6
, and 9
are displayed
because of the chosen partition_projection_interval
value of 3
.
| regionkey |
+-----------+
| 0 |
+-----------+
| 3 |
+-----------+
| 6 |
+-----------+
| 9 |
+-----------+
Create a projection with the date
projection range:
CREATE TABLE date_table_name (
name varchar,
comment varchar,
nationkey integer,
dt varchar WITH (
partition_projection_type = 'date',
partition_projection_format = 'yyyy-MM-dd',
partition_projection_range = array['NOW-5DAYS', 'NOW']
)
)
WITH (
partitioned_by = array['dt'],
partition_projection_enabled = true
);
INSERT INTO date_table_name VALUES ('POLAND_1', 'Czesc', 0, '2023-06-01');
INSERT INTO date_table_name VALUES ('POLAND_2', 'Czesc', 1, '2023-06-01');
INSERT INTO date_table_name VALUES ('CZECH_1', 'Ahoj', 2, '2023-06-29');
INSERT INTO date_table_name VALUES ('CZECH_2', 'Ahoj', 3, '2023-06-01');
INSERT INTO date_table_name VALUES ('GERMANY_1', 'Hallo', 12, '2023-06-01');
INSERT INTO date_table_name VALUES ('GERMANY_2', 'Hallo', 13, '2023-07-02');
INSERT INTO date_table_name VALUES ('MEXICO_1', 'Hola', 22, '2023-07-01');
The INSERT INTO
statements adds the data corresponding to the defined
partitions.
SELECT * FROM "date_table_name$partitions";
The SELECT * FROM
statement returns only the partitions corresponding to the
partition projection defined while creating the table, not the metastore.
The following SELECT * FROM
statement example returns the partition, dt
:
| dt |
+------------+
| 2023-08-26 |
+------------+
| 2023-08-27 |
+------------+
| 2023-08-28 |
+------------+
| 2023-08-29 |
+------------+
| 2023-08-30 |
+------------+
| 2023-08-31 |
+------------+
Session properties #
A session property temporarily modifies an
internal Starburst Galaxy setting for the duration of the current connection
session to the cluster. Use the SET SESSION
statement followed by the property
name, then a value such as true
or false
to modify the property:
SET SESSION catalog_name.session_property = expression;
Use the SHOW SESSION statement to view all current session properties. For additional information, read about the SET SESSION, and RESET SESSION SQL statements.
Catalog session properties are internal session properties that can be set
on a per-catalog basis. These properties must be set separately for each catalog
by including the catalog name before the property name, for example,
catalog_name.property_name
.
Session properties are linked to the current session, which lets you have multiple connections to a cluster that each have different values for the same session properties. Once a session ends, either by disconnecting or creating a new session, any changes made to session properties during the previous session are lost.
The following sections describe the session properties supported by Hive table type:
avro_native_reader_enabled #
SET SESSION catalog_name.avro_native_reader_enabled = false;
Use the native Avro file reader. The default value is false
.
avro_native_writer_enabled #
SET SESSION catalog_name.writer_enabled = false;
Use the native Avro file writer. The default value is false
.
bucket_execution_enabled #
SET SESSION catalog_name.bucket_execution_enabled = true;
Enable bucket-aware execution: only use a single worker per bucket. The default
value is true
.
collect_column_statistics_on_write #
SET SESSION catalog_name.collect_column_statistics_on_write = true;
Enables automatic column level statistics collection on write. The default value
is true
.
compression_codec #
SET SESSION catalog_name.compression_codec = 'DEFAULT';
The compression codec is used when writing new data files. The possible values are:
NONE
DEFAULT
SNAPPY
LZ4
ZSTD
GZIP
The default value is DEFAULT
.
create_empty_bucket_files #
SET SESSION catalog_name.create_empty_bucket_files = false;
Create empty files for buckets that have no data. The default value is false
.
csv_native_reader_enabled #
SET SESSION catalog_name.csv_native_reader_enabled = false;
Use the native CSV reader. The default value is false
.
csv_native_writer_enabled #
SET SESSION catalog_name.csv_native_writer_enabled = false;
Use the native CSV writer. The default value is false
.
dynamic_filtering_wait_timeout #
SET SESSION catalog_name.dynamic_filtering_wait_timeout = '10s';
The duration to wait for completion of dynamic filtering during split generation. Dynamic filtering optimizations significantly improve the performance of queries with selective joins by avoiding reading of data that would be filtered by join conditions.
This extra wait time can potentially result in significant overall savings in
query and CPU time, if dynamic filtering is able to reduce the amount of scanned
data. In the previous example, the dynamic_filtering_wait_timeout
is set to
10
seconds.
force_local_scheduling #
SET SESSION catalog_name.force_local_scheduling = false;
Only schedule splits on workers colocated with the data node. The default value
is false
.
hive_storage_format #
SET SESSION catalog_name.hive_storage_format = 'ORC';
Default storage format for new tables or partitions. The possible values are:
ORC
PARQUET
AVRO
RCBINARY
RCTEXT
SEQUENCEFILE
JSON
OPENX_JSON
TEXTFILE
CSV
REGEX
The default value is ORC
.
hive_views_legacy_translation #
SET SESSION catalog_name.hive_views_legacy_translation = false;
Use hive_views_legacy_translation
for catalog specific use to
translate Hive views. Allow read access to the data. Hive views are defined in
HiveQL and stored in the Hive Metastore Service. They are analyzed to allow read
access to the data. Not available when using the Glue or Starburst metastore
options. The default value is false
.
ignore_absent_partitions #
SET SESSION catalog_name.ignore_absent_partitions = false;
Ignore partitions when the file system location does not exist rather than
failing the query. The default value is false
.
ignore_corrupted_statistics #
SET SESSION catalog_name.ignore_corrupted_statistics = false;
Ignore the corrupted statistics rather than failing the query. The default value
is false
.
insert_existing_partitions_behavior #
SET SESSION catalog_name.insert_existing_partitions_behavior = 'APPEND';
Behavior on insert existing partitions; this session property does not control
behavior on insert existing unpartitioned table. The default value is APPEND
.
json_native_reader_enabled #
SET SESSION catalog_name.json_native_reader_enabled = false;
Use the native JSON reader. The default value is false
.
json_native_writer_enabled #
SET SESSION catalog_name.json_native_writer_enabled = false;
Use the native JSON writer. The default value is false
.
minimum_assigned_split_weight #
SET SESSION catalog_name.minimum_assigned_split_weight = 0.05;
The minimum assigned split weight when size based split weighting is enabled.
The default value is 0.05
.
openx_json_native_reader_enabled #
SET SESSION catalog_name.openx_json_native_reader_enabled = true;
Use the native OpenX JSON reader. The default value is true
.
openx_json_native_writer_enabled #
SET SESSION catalog_name.openx_json_native_writer_enabled = true;
Use the native OpenX JSON writer. The default value is true
.
optimize_mismatched_bucket_count #
SET SESSION catalog_name.optimize_mismatched_bucket_count = false;
Enable optimization to avoid shuffle when bucket count is compatible but not the
same. The default value is false
.
optimize_symlink_listing #
SET SESSION catalog_name.optimize_symlink_listing = true;
Optimize the listing for SymlinkTextFormat
tables with files in a single
directory. The default value is true
.
orc_bloom_filters_enabled #
SET SESSION catalog_name.orc_bloom_filters_enabled = `false`;
Enable bloom filters for predicate pushdown. The default value is false
.
orc_lazy_read_small_ranges #
SET SESSION catalog_name.orc_lazy_read_small_ranges = true;
Read small file segments lazily. The default value is true
.
orc_max_buffer_size #
SET SESSION catalog_name.orc_max_buffer_size = '8MB';
Maximum size of a single read. The default value is 8MB
.
orc_max_merge_distance #
SET SESSION catalog_name.orc_max_merge_distance = '1MB';
Maximum size of gap between two reads to merge into a single read. The default
value is 1MB
.
orc_max_read_block_size #
SET SESSION catalog_name.orc_max_read_block_size = '16MB';
Soft maximum size of Trino blocks produced by ORC reader. The default value is
16MB
.
orc_native_zstd_decompressor_enabled #
SET SESSION catalog_name.orc_native_zstd_decompressor_enabled = true;
Enable using the native zstd
library for faster decompression of ORC files.
The
default value is true
.
orc_nested_lazy_enabled #
SET SESSION catalog_name.orc_nested_lazy_enabled = true;
Lazily read nested data. The default value is true
.
orc_optimized_writer_max_dictionary_memory #
SET SESSION catalog_name.orc_optimized_writer_max_dictionary_memory = '16MB';
The maximum dictionary memory. The default value is 16MB
.
orc_optimized_writer_max_stripe_rows #
SET SESSION catalog_name.orc_optimized_writer_max_stripe_rows = 10000000;
The maximum stripe row count. The default value is 10000000
.
orc_optimized_writer_max_stripe_size #
SET SESSION catalog_name.orc_optimized_writer_max_stripe_size = '64MB';
The maximum stripe size. The default value is 64MB
.
orc_optimized_writer_min_stripe_size #
SET SESSION catalog_name.orc_optimized_writer_min_stripe_size = '32MB';
The minimum stripe size. The default value is 32MB
.
orc_optimized_writer_validate #
SET SESSION catalog_name.orc_optimized_writer_validate = false;
Force validation for all files. The default value is false
.
orc_optimized_writer_validate_mode #
SET SESSION catalog_name.orc_optimized_writer_validate_mode = 'BOTH';
Level of detail in ORC validation. Possible values include HASHED
, DETAILED
,
and BOTH
. The default value is BOTH
.
orc_optimized_writer_validate_percentage #
SET SESSION catalog_name.orc_optimized_writer_validate_percentage = 0;
The sample percentage for validation of files. The default value is 0
.
orc_stream_buffer_size #
SET SESSION catalog_name.orc_stream_buffer_size = '8MB';
Size of buffer for streaming reads. The default value is 8MB
.
orc_string_statistics_limit #
SET SESSION catalog_name.orc_string_statistics_limit = '64B';
Maximum size of the string statistics. The default value is 64B
.
orc_tiny_stripe_threshold #
SET SESSION catalog_name.orc_tiny_stripe_threshold = '8MB';
The Threshold below which an ORC stripe or file will read in its entirety. The
default value is 8MB
.
orc_use_column_names #
SET SESSION catalog_name.orc_use_column_names = false;
Access ORC columns by name. By default, columns in ORC files are accessed by
their ordinal position in the Hive table definition. The default value is
false
.
parallel_partitioned_bucketed_writes #
SET SESSION catalog_name.parallel_partitioned_bucketed_writes = true;
Improve parallelism of partitioned and bucketed table writes. The default value
is true
.
parquet_ignore_statistics #
SET SESSION catalog_name.parquet_ignore_statistics = false;
Ignore statistics from Parquet to allow querying files with corrupted or
incorrect statistics. The default value is false
.
parquet_max_read_block_row_count #
SET SESSION catalog_name.parquet_max_read_block_row_count = 8192;
Sets the maximum number of rows read in a batch. The default value is 8192
.
parquet_max_read_block_size #
SET SESSION catalog_name.parquet_max_read_block_size = '16MB';
The maximum block size used when reading Parquet files. The default value is
16MB
.
parquet_native_snappy_decompressor_enabled #
SET SESSION catalog_name.parquet_native_snappy_decompressor_enabled = true;
Enable using the native SNAPPY
library for faster decompression of Parquet
files. The default value is true
;
parquet_native_zstd_decompressor_enabled #
SET SESSION catalog_name.parquet_native_zstd_decompressor_enabled = true;
Enable using the native ZSTD
library for faster decompression of Parquet
files. The default value is true
.
parquet_optimized_nested_reader_enabled #
SET SESSION catalog_name.parquet_optimized_nested_reader_enabled = true;
Specifies whether batched column readers are used when reading ARRAY
, MAP
,
and ROW
types from Parquet files for improved performance. Set this property
to false
to disable the optimized Parquet reader for structural data types.
The default value is true
.
parquet_optimized_reader_enabled #
SET SESSION catalog_name.parquet_optimized_reader_enabled = true;
Specifies whether batched column readers are used when reading Parquet files for
improved performance. Set this property to false
to disable the optimized
Parquet reader The default value is true
.
parquet_optimized_writer_enabled #
SET SESSION catalog_name.parquet_optimized_writer_enabled = true;
Specifies whether the optimized writer is used when writing Parquet files. The
default value is true
.
parquet_optimized_writer_validation_percentage #
SET SESSION catalog_name.parquet_optimized_writer_validation_percentage = 5;
Percentage of Parquet files to validate after write by rereading the whole
file. Validation is turned off by setting the value to 0
. The default value is
5
.
parquet_use_bloom_filter #
SET SESSION catalog_name.parquet_use_bloom_filter = true;
Specifies whether Bloom filters are used for predicate pushdown when reading
Parquet files. The default value is true
;
parquet_use_column_index #
SET SESSION catalog_name.parquet_use_column_index = true;
Use the Parquet column index. The default value is true
.
parquet_use_column_names #
SET SESSION catalog_name.parquet_use_column_names = true;
Access Parquet columns by name by default. Set this property to false
to
access columns by their ordinal position in the Hive table definition. The
default value is true
.
parquet_vectorized_decoding_enabled #
SET SESSION catalog_name.parquet_use_column_names = true;
Enable the use of the Java Vector API for faster decoding of Parquet files. The
default value is true
.
parquet_writer_batch_size #
SET SESSION catalog_name.parquet_writer_batch_size = 10000;
Maximum number of rows processed by the Parquet writer in a batch. The default
value is 10000
.
parquet_writer_block_size #
SET SESSION catalog_name.parquet_writer_block_size = '128MB';
The maximum block size created by the Parquet writer. The default value is
128MB
.
parquet_writer_page_size #
SET SESSION catalog_name.parquet_writer_page_size = '1MB';
The maximum page size created by the Parquet writer. The default value is 1MB
.
partition_statistics_sample_size #
SET SESSION catalog_name.partition_statistics_sample_size = 100;
The maximum sample size of the partitions column statistics. The default value
is 100
.
projection_pushdown_enabled #
SET SESSION catalog_name.projection_pushdown_enabled = true;
Enable projection push down. The default value is true
.
propagate_table_scan_sorting_properties #
SET SESSION catalog_name.propagate_table_scan_sorting_properties = false;
Use sorted table layout to generate more efficient execution plans. Note, this
may lead to incorrect results if files are not sorted as per table definition.
The default value is false
.
query_partition_filter_required #
SET SESSION catalog_name.query_partition_filter_required = false;
Set to true
to temporarily force a query to use a partition filter.
query_partition_filter_required_schemas #
SET SESSION catalog_name.query_partition_filter_required_schemas = array['1'];
The list of schemas for which filter on partition column is enforced. The
default value is array['']
rcfile_optimized_writer_validate #
SET SESSION catalog_name.rcfile_optimized_writer_validate = false;
Validate RCF
writer files. The default value is false
.
regex_native_reader_enabled #
SET SESSION catalog_name.regex_native_reader_enabled = false;
Use the native REGEX reader. The default value is false
.
respect_table_format #
SET SESSION catalog_name.respect_table_format = true;
Write new partitions using table format rather than the default storage format.
The default value true
.
s3_select_pushdown_enabled #
SET SESSION catalog_name.s3_select_pushdown_enabled = false;
Enable S3 Select pushdown. The default value is false
.
sequence_file_native_reader_enabled #
SET SESSION catalog_name.sequence_file_native_reader_enabled = false;
Use the native sequence file reader. The default value is false
.
sequence_file_native_writer_enabled #
SET SESSION catalog_name.sequence_file_native_writer_enabled = false;
Use the native sequence file writer. The default value is false
.
size_based_split_weights_enabled #
SET SESSION catalog_name.size_based_split_weights_enabled = true;
Enable estimating split weights based on size in bytes. The default value is
true
.
sorted_writing_enabled #
SET SESSION catalog_name.sorted_writing_enabled = true;
Enable sorted writes. The default value is true
.
statistics_enabled #
SET SESSION catalog_name.statistics_enabled = true;
Enables table statistics for performance improvements. The default value is
true
. If the value is set to false
the following session property is
disabled:
target_max_file_size #
SET SESSION catalog_name.target_max_file_size = '1GB';
Target maximum size of written files. The default value is 1GB
.
text_file_native_reader_enabled #
SET SESSION catalog_name.text_file_native_reader_enabled = false;
Use the native text file reader. The default value is false
.
text_file__writer_enabled #
SET SESSION catalog_name.text_file_native_writer_enabled = false;
Use the native text file writer. The default value is false
.
timestamp_precision #
SET SESSION catalog_name.timestamp_precision = 'MILLISECONDS';
Specifies the precision for the timestamp columns in Hive tables. The possible values are:
MILLISECONDS
MICROSECONDS
NANOSECONDS
The default value is MILLISECONDS
.
validate_bucketing #
SET SESSION catalog_name.validate_bucketing = true;
Verify that data is bucketed correctly when reading. The default value is
true
.
Hive SQL support #
When using the Hive table format with Great Lakes connectivity, the general SQL support details apply, with the following additional consideration:
- Hive views are not supported.
Is the information on this page helpful?
Yes
No
- Hive table format
- Specify Hive format
- Metadata tables
- Table procedures
- Migrate to Iceberg
- Partition projection
- Session properties
- avro_native_reader_enabled
- avro_native_writer_enabled
- bucket_execution_enabled
- collect_column_statistics_on_write
- compression_codec
- create_empty_bucket_files
- csv_native_reader_enabled
- csv_native_writer_enabled
- dynamic_filtering_wait_timeout
- force_local_scheduling
- hive_storage_format
- hive_views_legacy_translation
- ignore_absent_partitions
- ignore_corrupted_statistics
- insert_existing_partitions_behavior
- json_native_reader_enabled
- json_native_writer_enabled
- minimum_assigned_split_weight
- openx_json_native_reader_enabled
- openx_json_native_writer_enabled
- optimize_mismatched_bucket_count
- optimize_symlink_listing
- orc_bloom_filters_enabled
- orc_lazy_read_small_ranges
- orc_max_buffer_size
- orc_max_merge_distance
- orc_max_read_block_size
- orc_native_zstd_decompressor_enabled
- orc_nested_lazy_enabled
- orc_optimized_writer_max_dictionary_memory
- orc_optimized_writer_max_stripe_rows
- orc_optimized_writer_max_stripe_size
- orc_optimized_writer_min_stripe_size
- orc_optimized_writer_validate
- orc_optimized_writer_validate_mode
- orc_optimized_writer_validate_percentage
- orc_stream_buffer_size
- orc_string_statistics_limit
- orc_tiny_stripe_threshold
- orc_use_column_names
- parallel_partitioned_bucketed_writes
- parquet_ignore_statistics
- parquet_max_read_block_row_count
- parquet_max_read_block_size
- parquet_native_snappy_decompressor_enabled
- parquet_native_zstd_decompressor_enabled
- parquet_optimized_nested_reader_enabled
- parquet_optimized_reader_enabled
- parquet_optimized_writer_enabled
- parquet_optimized_writer_validation_percentage
- parquet_use_bloom_filter
- parquet_use_column_index
- parquet_use_column_names
- parquet_vectorized_decoding_enabled
- parquet_writer_batch_size
- parquet_writer_block_size
- parquet_writer_page_size
- partition_statistics_sample_size
- projection_pushdown_enabled
- propagate_table_scan_sorting_properties
- query_partition_filter_required
- query_partition_filter_required_schemas
- rcfile_optimized_writer_validate
- regex_native_reader_enabled
- respect_table_format
- s3_select_pushdown_enabled
- sequence_file_native_reader_enabled
- sequence_file_native_writer_enabled
- size_based_split_weights_enabled
- sorted_writing_enabled
- statistics_enabled
- target_max_file_size
- text_file_native_reader_enabled
- text_file__writer_enabled
- timestamp_precision
- validate_bucketing
- Hive SQL support
Is the information on this page helpful?
Yes
No