Great Lakes connectivity abstracts the details of using different table formats and file types when using object storage catalogs.
This page describes the features specific to the Hive table format when used with Great Lakes connectivity.
Hive connectivity allows querying data stored in an Apache Hive data warehouse. Hive tables are automatically detected and read based on information in the storage environment’s associated metadata.
For an object storage catalog that specifies Hive as its default table
format, no special
syntax is required. A simple CREATE TABLE
statement creates a Hive format
table.
To create a Hive table when the default format is not Hive, add type='hive'
as
a table property in your CREATE
statement. For example:
CREATE TABLE customer (
name varchar,
address varchar)
WITH (type='hive');
As part of a CREATE TABLE
statement, specify the format of table data files
with an additional format
property with one of the values shown in the
following:
File format | Description |
---|---|
ORC |
Default value, Apache ORC file format. |
PARQUET |
Apache Parquet file format. |
JSON |
JSON file format using
org.apache.hive.hcatalog.data.JsonSerDe |
CSV |
Comma-separated values file format using
org.apache.hadoop.hive.serde2.OpenCSVSerde |
TEXTFILE |
Plain text file format. |
AVRO |
Apache Avro file format. |
RCTEXT |
RCFile using ColumnarSerDe |
RCBINARY |
RCFile using LazyBinaryColumnarSerDe |
SEQUENCEFILE |
Flat file format using binary key/value pairs. |
As part of a CREATE TABLE
statement, append further comma-separated table
properties as needed. Note that many of the available table properties must be
used in conjunction with a particular Hive file format.
Table property | Description |
---|---|
auto_purge |
Indicates to the configured metastore to perform a purge when a table or partition is deleted, instead of a soft deletion using the trash. |
avro_schema_url |
The URI of the Avro schema evolution for the table. |
bucket_count |
The number of buckets to group data into. Only valid if used with
bucketed_by . Defaults to 0 . |
bucketed_by |
The bucketing column for the storage table. Only valid if used with
bucket_count . Defaults to [] . |
bucketing_version |
Specifies which Hive bucketing version to use. Valid values are
1 or 2 . |
csv_escape |
The CSV escape character. Requires CSV format. |
csv_quote |
The CSV quote character. Requires CSV format. |
csv_separator |
The CSV separator character. Requires CSV format. You can use other
separators such as | or use Unicode to configure
invisible separators such as tabs with U&'\0009' . |
external_location |
The URI of an external Hive table on S3, Azure Blob Storage, etc. See the basic usage examples for more information. |
format |
The table file format.
Valid
values are ORC , PARQUET , AVRO ,
RCBINARY , RCTEXT , SEQUENCEFILE ,
JSON , TEXTFILE , CSV , and
REGEX . |
null_format |
The serialization format for NULL value. Requires
TextFile, RCText, or SequenceFile format. |
orc_bloom_filter _columns |
Comma-separated list of columns to use for ORC bloom filter. It
improves the performance of queries using range predicates when
reading ORC files. Requires ORC format. Defaults to
[] . |
orc_bloom_filter_fpp |
The ORC bloom filters false positive probability. Requires ORC format.
Defaults to 0.05 . |
partitioned_by |
The partitioning column for the storage table. The columns listed in
the partitioned_by clause must be the last columns as
defined in the DDL. Defaults to [] . |
partition_projection_ignore |
Ignore any partition projection properties stored in the metastore for the selected table. This bypasses compatibility issues on a specific table. |
partition_projection_type |
Defines the type of partition projection to use on a specific column.
May be used only on partition columns. Available types include:
ENUM , INTEGER , DATE . |
skip_footer_line_count |
The number of footer lines to ignore when parsing the file for data. Requires TextFile or CSV format tables. |
skip_header_line_count |
The number of header lines to ignore when parsing the file for data. Requires TextFile or CSV format tables. |
sorted_by |
The column to sort by to determine bucketing for row. Only valid if
bucketed_by and bucket_count are specified
as well. Defaults to [] . |
textfile_field_separator |
Allows the use of custom field separators, such as | ,
for TextFile formatted tables. |
textfile_field _separator_escape |
Allows the use of a custom escape character for TextFile formatted tables. |
transactional |
Set this property to true to create an ORC ACID
transactional table. Requires ORC format. This property may be shown
as true for insert-only tables created using older
versions of Hive. |
partition_projection _enabled |
Enables partition projection for selected table. Mapped from AWS Athena table property projection.enabled. |
partition_projection _ignore |
Ignore any partition projection properties stored in the metastore for the selected table. This is a Galaxy-only property which allows you to work around compatibility issues on a specific table, and if enabled, Galaxy ignores all other configuration options related to partition projection. |
partition_projection _location_template |
Projected partition location template, such as
s3://test/name=${name}/ . Mapped from the AWS Athena table
property
storage.location.template. Defaults to
${table_location}/${partition_name} . |
extra_properties |
Additional properties added to a Hive table. The properties are not
used by Galaxy, and are available in the $properties
metadata table. The properties are not included in the output of
SHOW CREATE TABLE statements. |
Some SQL statements must be supplemented with the type='hive'
property for use
in Starburst Galaxy. For example:
CREATE TABLE hive.web.page_views (
view_time timestamp,
user_id bigint,
page_url varchar,
ds date,
country varchar
)
WITH (
type = 'hive',
format = 'ORC',
partitioned_by = ARRAY['ds', 'country'],
bucketed_by = ARRAY['user_id'],
bucket_count = 50
)
The $properties
metadata table provides general configuration information
about a Hive table. Query the metadata table by appending $properties
to a
table name:
SELECT * FROM "hive.web.page_views$properties";
Use the CALL
statement to perform data manipulation or administrative tasks.
Procedures must include a qualified catalog name. For example, for a catalog
named examplecatalog
:
CALL examplecatalog.system.example_procedure();
The following Hive table procedures are available:
system.create_empty_partition('schema_name', 'table_name', partition_columns, partition_values);
Create an empty partition in the specified table.
The partition arguments can have a different type of varchar
such as an
integer or date.
The value must be specified as varchar
in the partition_values
and
partition_columns
array. For example, partition_values => ARRAY['2016-08-09',
'US']
system.drop_stats('schema_name', 'table_name', partition_values);
Drops statistics for one or more partitions or the entire table. Specify the
partitions as an array whose elements are arrays of partition_values
. The
following example shows partition_values
for dropping stats for multiple
partitions:
partition_values => ARRAY[ARRAY['2016-08-09', 'US'], ARRAY['2023-08-03', 'UK']]
The partition_values
argument is optional. If the partition_values
argument is omitted, stats are dropped for the entire table.
system.register_partition('schema_name', 'table_name', partition_columns,
partition_values, location);
Registers the existing location as a new partition in the metastore for the specified table.
With the location
argument omitted, the partition location is constructed
using partition_columns
and partition_values
.
system.sync_partition_metadata('schema_name', 'table_name', 'mode', case_sensitive);
Check and update partitions list in the metastore. There are three modes available:
ADD
: add any partitions that exist on the file system, but not in the
metastore.DROP
: drop any partitions that exist in the metastore, but not on the
file system.FULL
: perform both ADD
and DROP
.The case_sensitive
argument is optional. The default value is true
for
compatibility with Hive’s MSCK REPAIR TABLE
behavior, which expects the
partition column names in file system paths to use lowercase (for example,
col_x=SomeValue
). Partitions on the file system not conforming to this
convention are ignored, unless the argument is set to false
.
system.unregister_partition('schema_name', 'table_name', partition_columns,
partition_values);
Unregisters given, existing partition in the metastore for the specified table. The partition data is not deleted.
Great Lakes connectivity supports migration from Hive to Iceberg table format. It can read and write to Hive tables that have been successfully migrated to Iceberg. The migration process creates metadata without copying any data. In addition, table statistics are not generated.
The migration process from Hive to Iceberg table format only supports Parquet, ORC, and AVRO file formats.
Use the following statement ALTER TABLE SET PROPERTIES
to convert a table from
the Hive table format to the Iceberg table format:
ALTER TABLE catalog_name.schema_name.table_name SET PROPERTIES type = 'ICEBERG';
The statement must be called for a specific catalog example with the relevant
schema and table names supplied with the required parameters schema_name
and
table_name
.
Once the migration process is complete, use the SHOW CREATE TABLE
statement to
confirm that the table type has changed:
SHOW CREATE TABLE example_catalog.schema_name.table_name;
Successful migration displays type='ICEBERG'
.
Partition projection is a feature used to speed up query processing. This feature computes partition values and locations statically from the partition projection definition of the table properties and does not rely on the metastore for retrieving them.
Great Lakes connectivity supports partition projection table properties stored
in the metastore, and it reimplements this functionality. Currently, date
projection only supports intervals of DAYS
, HOURS
, MINUTES
, and SECONDS
.
If there are any compatibility issues blocking access to a requested table when
partition projection is enabled, set the partition_projection_ignore
table
property to true
for a table to bypass any errors.
In addition, Great Lakes connectivity supports the partition_projection_type table property.
The partition_projection_type
defines the type of partition projection to use
on a specific column. It can be set as either enum
or integer
and date
:
Name | Description |
---|---|
enum |
When there are a predefined set of values for the partitions to be used such as, country or region names. |
integer |
Specifies a range, providing a more concise way to predefine the
partition values. For example,[1, 2, 3, 4, ..., 1000] or
[0500, 0550, 0600, ..., 2500] . |
date |
Specifies a range, providing a more concise way to predefine the
partition values. For example, [20200101, 20200102, ...,
20201231] or [1-1-2020 00:00:00, 1-1-2020 01:00:00, ...,
12-31-2020 23:00:00] |
Read more about Hive table properties.
To use partition projection, specify the ranges of partition values and projection types for each partition column in the table properties in the AWS Glue Data Catalog or in an external Hive metastore.
Create a projection with the enum
projection type:
CREATE SCHEMA catalog_name.schema_name;
use catalog_name.test_schema;
CREATE TABLE enum_table_name (
name varchar,
comment varchar,
nationkey integer,
regionkey integer,
short_name varchar WITH (
partition_projection_type = 'enum',
partition_projection_values = array ['POL', 'CZE']
)
)
WITH (
partitioned_by = array['short_name'],
partition_projection_enabled = true
);
SHOW CREATE TABLE enum_table_name;
INSERT INTO enum_table_name VALUES ('POLAND_1', 'Czesc', 0, 5, 'POL');
INSERT INTO enum_table_name VALUES ('POLAND_2', 'Czesc', 1, 5, 'POL');
INSERT INTO enum_table_name VALUES ('CZECH_1', 'Ahoj', 2, 5, 'CZE');
INSERT INTO enum_table_name VALUES ('CZECH_2', 'Ahoj', 3, 5, 'CZE');
The INSERT INTO
statements adds the data corresponding to the POL and CZE
partitions.
SELECT * FROM "enum_table_name$partitions";
The SELECT
only displays the partitions POL and CZE corresponding
to the partition projection defined while creating the table, not the metastore.
| short_name |
+------------+
| POL |
+------------+
| CZE |
+------------+
Create a projection with the integer
projection range:
CREATE TABLE integer_table_name (
name varchar,
comment varchar,
nationkey integer,
regionkey integer WITH (
partition_projection_type = 'integer',
partition_projection_range = array ['0', '10']
)
)
WITH (
partitioned_by = array['regionkey'],
partition_projection_enabled = true
);
INSERT INTO integer_table_name VALUES ('POLAND_1', 'Czesc', 0, 5);
INSERT INTO integer_table_name VALUES ('POLAND_2', 'Czesc', 1, 5);
INSERT INTO integer_table_name VALUES ('CZECH_1', 'Ahoj', 2, 5);
INSERT INTO integer_table_name VALUES ('CZECH_2', 'Ahoj', 3, 5);
INSERT INTO integer_table_name VALUES ('GERMANY_1', 'Hallo', 12, 8);
INSERT INTO integer_table_name VALUES ('GERMANY_2', 'Hallo', 13, 8);
INSERT INTO integer_table_name VALUES ('MEXICO_1', 'Hola', 22, 18);
The INSERT INTO
statements adds the data corresponding to the defined
partitions.
SELECT * FROM "integer_table_name$partitions";
The SELECT * FROM
statement returns only the partitions corresponding to the
partition projection defined while creating the table, not the metastore.
| regionkey |
+-----------+
| 0 |
+-----------+
| 1 |
+-----------+
| 2 |
+-----------+
| 3 |
+-----------+
| 4 |
+-----------+
| 5 |
+-----------+
| 6 |
+-----------+
| 7 |
+-----------+
| 8 |
+-----------+
| 9 |
+-----------+
| 10 |
+-----------+
Create a projection with the integer
projection range using
partition_projection_interval
:
CREATE TABLE interval_table_name (
name varchar,
comment varchar,
nationkey integer,
regionkey integer WITH (
partition_projection_type = 'integer',
partition_projection_range = array ['0', '10'],
partition_projection_interval = 3
)
)
WITH (
partitioned_by = array['regionkey'],
partition_projection_enabled = true
);
INSERT INTO interval_table_name VALUES ('POLAND_1', 'Czesc', 0, 5);
INSERT INTO interval_table_name VALUES ('POLAND_2', 'Czesc', 1, 5);
INSERT INTO interval_table_name VALUES ('CZECH_1', 'Ahoj', 2, 5);
INSERT INTO interval_table_name VALUES ('CZECH_2', 'Ahoj', 3, 5);
INSERT INTO interval_table_name VALUES ('GERMANY_1', 'Hallo', 12, 6);
INSERT INTO interval_table_name VALUES ('GERMANY_2', 'Hallo', 13, 9);
INSERT INTO interval_table_name VALUES ('MEXICO_1', 'Hola', 22, 18);
The INSERT INTO
statements adds the data corresponding to the defined
partitions.
SELECT * FROM "interval_table_name$partitions";
The SELECT * FROM
statement returns only the partitions corresponding to the
partition projection defined while creating the table, not the metastore.
In the following example output, only 0
, 3
, 6
, and 9
are displayed
because of the chosen partition_projection_interval
value of 3
.
| regionkey |
+-----------+
| 0 |
+-----------+
| 3 |
+-----------+
| 6 |
+-----------+
| 9 |
+-----------+
Create a projection with the date
projection range:
CREATE TABLE date_table_name (
name varchar,
comment varchar,
nationkey integer,
dt varchar WITH (
partition_projection_type = 'date',
partition_projection_format = 'yyyy-MM-dd',
partition_projection_range = array['NOW-5DAYS', 'NOW']
)
)
WITH (
partitioned_by = array['dt'],
partition_projection_enabled = true
);
INSERT INTO date_table_name VALUES ('POLAND_1', 'Czesc', 0, '2023-06-01');
INSERT INTO date_table_name VALUES ('POLAND_2', 'Czesc', 1, '2023-06-01');
INSERT INTO date_table_name VALUES ('CZECH_1', 'Ahoj', 2, '2023-06-29');
INSERT INTO date_table_name VALUES ('CZECH_2', 'Ahoj', 3, '2023-06-01');
INSERT INTO date_table_name VALUES ('GERMANY_1', 'Hallo', 12, '2023-06-01');
INSERT INTO date_table_name VALUES ('GERMANY_2', 'Hallo', 13, '2023-07-02');
INSERT INTO date_table_name VALUES ('MEXICO_1', 'Hola', 22, '2023-07-01');
The INSERT INTO
statements adds the data corresponding to the defined
partitions.
SELECT * FROM "date_table_name$partitions";
The SELECT * FROM
statement returns only the partitions corresponding to the
partition projection defined while creating the table, not the metastore.
The following SELECT * FROM
statement example returns the partition, dt
:
| dt |
+------------+
| 2023-08-26 |
+------------+
| 2023-08-27 |
+------------+
| 2023-08-28 |
+------------+
| 2023-08-29 |
+------------+
| 2023-08-30 |
+------------+
| 2023-08-31 |
+------------+
See session properties on the Great Lakes connectivity page to understand how these properties are used.
The following table describes the session properties supported by the Hive table type.
Session property | Description |
---|---|
avro_native_reader_enabled |
Use the native Avro file reader. The default value is
false . |
avro_native_writer_enabled |
Use the native Avro file writer. The default value is
false . |
bucket_execution_enabled |
Enable bucket-aware execution: only use a single worker per bucket.
The default value is true . |
collect_column_statistics_on_write |
Enables automatic column level statistics collection on write.
The default value is true . |
create_empty_bucket_files |
Create empty files for buckets that have no data. The default value is
false . |
csv_native_reader_enabled |
Use the native CSV reader. The default value is false . |
csv_native_writer_enabled |
Use the native CSV writer. The default value is false . |
force_local_scheduling |
Only schedule splits on workers colocated with the data node. The
default value isfalse . |
hive_storage_format |
Default storage format for new tables or partitions. The possible
values are:
ORC . |
hive_views_legacy_translation |
Use hive_views_legacy_translation for catalog specific
use to translate Hive views. Allow read access to the data. Hive views
are defined in HiveQL and stored in the Hive Metastore Service. They
are analyzed to allow read access to the data. Not available when
using the Glue or Starburst metastore options. The default value is
false . |
ignore_absent_partitions |
Ignore partitions when the file system location does not exist
rather than failing the query. The default value is false
. |
ignore_corrupted_statistics |
Ignore the corrupted statistics rather than failing the query. The
default value is false . |
insert_existing_partitions_behavior |
Behavior on insert existing partitions; this session property does not
control behavior on insert existing unpartitioned table. The default
value is APPEND . |
json_native_reader_enabled |
Use the native JSON reader. The default value is false .
|
minimum_assigned_split_weight |
The minimum assigned split weight when size based split weighting is
enabled. The default value is 0.05 . |
json_native_writer_enabled |
Use the native JSON writer. The default value is false .
|
openx_json_native_reader_enabled |
Use the native OpenX JSON reader. The default value is
true . |
openx_json_native_writer_enabled |
Use the native OpenX JSON writer. The default value is
true . |
optimize_mismatched_bucket_count |
Enable optimization to avoid shuffle when bucket count is compatible
but not the same. The default value is false . |
optimize_symlink_listing |
Optimize the listing for SymlinkTextFormat tables with
files in a single directory. The default value is true .
|
orc_optimized_writer_max_dictionary_memory |
The maximum dictionary memory. The default value is 16MB .
|
orc_optimized_writer_max_stripe_rows |
The maximum stripe row count. The default value is
10000000 .
|
orc_optimized_writer_max_stripe_size |
The maximum stripe size. The default value is 64MB .
|
orc_optimized_writer_min_stripe_size |
The minimum stripe size. The default value is 32MB .
|
orc_optimized_writer_validate |
Force validation for all files. The default value is
false .
|
orc_optimized_writer_validate_mode |
Level of detail in ORC validation. Possible values include
HASHED , DETAILED , and BOTH . The
default value is BOTH . |
orc_optimized_writer_validate_percentage |
The sample percentage for validation of files. The default value is
0 .
|
orc_use_column_names |
Access ORC columns by name. By default, columns in ORC files are
accessed by their ordinal position in the Hive table definition. The
default value is false . |
parallel_partitioned_bucketed_writes |
Improve parallelism of partitioned and bucketed table writes. The
default value is true .
|
parquet_ignore_statistics |
Ignore statistics from Parquet to allow querying files with corrupted
or incorrect statistics. The default value is false .
|
parquet_optimized_writer_validation_percentage |
TPercentage of Parquet files to validate after write by rereading the
whole file. Validation is turned off by setting the value to
0 . The default value is 5 .
|
parquet_use_column_index |
Use the Parquet column index. The default value is true .
|
parquet_use_column_names |
Access Parquet columns by name by default. Set this property to
false to access columns by their ordinal position in the
Hive table definition. The default value is true .
|
partition_statistics_sample_size |
The maximum sample size of the partitions column statistics. The
default value is 100 .
|
propagate_table_scan_sorting_properties |
Use sorted table layout to generate more efficient execution plans.
Note, this may lead to incorrect results if files are not sorted as
per table definition. The default value is false .
|
query_partition_filter_required |
Set to true to temporarily force a query to use a
partition filter.
|
query_partition_filter_required_schemas |
The list of schemas for which filter on partition column is enforced.
The default value is array[''] . |
rcfile_optimized_writer_validate |
Validate RCF writer files. The default value is
false . |
regex_native_reader_enabled |
Use the native REGEX reader. The default value is false .
|
respect_table_format |
Write new partitions using table format rather than the default
storage format. The default value true . |
s3_select_pushdown_enabled |
Enable S3 Select pushdown. The default value is false .
|
sequence_file_native_reader_enabled |
Use the native sequence file reader. The default value is
false . |
sequence_file_native_writer_enabled |
Validate RCF writer files. The default value is
false . |
size_based_split_weights_enabled |
Enable estimating split weights based on size in bytes. The default
value is true . |
text_file_native_reader_enabled |
Use the native text file reader. The default value is
false . |
text_file__writer_enabled |
Use the native text file writer. The default value is
false . |
timestamp_precision |
Specifies the precision for the timestamp columns in Hive tables.
The possible values are:
MILLISECONDS . |
validate_bucketing |
Verify that data is bucketed correctly when reading. The default value
is true . |
When using the Hive table format with Great Lakes connectivity, the general SQL support details apply, with the following additional consideration:
Is the information on this page helpful?
Yes
No