starburst galaxy > data engineering > working with data lakes > table formats > Hive table format

Hive table format #

Great Lakes connectivity abstracts the details of using different table formats and file types when using object storage catalogs.

This page describes the features specific to the Hive table format when used with Great Lakes connectivity.

Specify Hive format #

Hive connectivity allows querying data stored in an Apache Hive data warehouse. Hive tables are automatically detected and read based on information in the storage environment’s associated metadata.

For an object storage catalog that specifies Hive as its default table format, no special syntax is required. A simple CREATE TABLE statement creates a Hive format table.

To create a Hive table when the default format is not Hive, add type='hive' as a table property in your CREATE statement. For example:

CREATE TABLE customer (
  name varchar,
  address varchar)
WITH (type='hive');

Hive table format properties #

As part of a CREATE TABLE statement, specify the format of table data files with an additional format property with one of the values shown in the following:

File format	Description
`ORC`	Default value, Apache ORC file format.
`PARQUET`	Apache Parquet file format.
`JSON`	JSON file format using `org.apache.hive.hcatalog.data.JsonSerDe`
`CSV`	Comma-separated values file format using `org.apache.hadoop.hive.serde2.OpenCSVSerde`
`TEXTFILE`	Plain text file format.
`AVRO`	Apache Avro file format.
`RCTEXT`	RCFile using `ColumnarSerDe`
`RCBINARY`	RCFile using `LazyBinaryColumnarSerDe`
`SEQUENCEFILE`	Flat file format using binary key/value pairs.

Hive table properties #

As part of a CREATE TABLE statement, append further comma-separated table properties as needed. Note that many of the available table properties must be used in conjunction with a particular Hive file format.

Table property	Description
`auto_purge`	Indicates to the configured metastore to perform a purge when a table or partition is deleted, instead of a soft deletion using the trash.
`avro_schema_url`	The URI of the Avro schema evolution for the table.
`bucket_count`	The number of buckets to group data into. Only valid if used with `bucketed_by`. Defaults to `0`.
`bucketed_by`	The bucketing column for the storage table. Only valid if used with `bucket_count`. Defaults to `[]`.
`bucketing_version`	Specifies which Hive bucketing version to use. Valid values are `1` or `2`.
`csv_escape`	The CSV escape character. Requires CSV format.
`csv_quote`	The CSV quote character. Requires CSV format.
`csv_separator`	The CSV separator character. Requires CSV format. You can use other separators such as `\|` or use Unicode to configure invisible separators such as tabs with `U&'\0009'`.
`external_location`	The URI of an external Hive table on S3, Azure Blob Storage, etc. See the basic usage examples for more information.
`format`	The table file format. Valid values are `ORC`, `PARQUET`, `AVRO`, `RCBINARY`, `RCTEXT`, `SEQUENCEFILE`, `JSON`, `TEXTFILE`, `CSV`, and `REGEX`.
`null_format`	The serialization format for `NULL` value. Requires TextFile, RCText, or SequenceFile format.
`orc_bloom_filter_columns`	Comma-separated list of columns to use for ORC bloom filter. It improves the performance of queries using range predicates when reading ORC files. Requires ORC format. Defaults to `[]`.
`orc_bloom_filter_fpp`	The ORC bloom filters false positive probability. Requires ORC format. Defaults to `0.05`.
`partitioned_by`	The partitioning column for the storage table. The columns listed in the `partitioned_by` clause must be the last columns as defined in the DDL. Defaults to `[]`.
`partition_projection_ignore`	Ignore any partition projection properties stored in the metastore for the selected table. This bypasses compatibility issues on a specific table.
`partition_projection_type`	Defines the type of partition projection to use on a specific column. May be used only on partition columns. Available types include: `ENUM`, `INTEGER`, `DATE`, `INJECTED`.
`skip_footer_line_count`	The number of footer lines to ignore when parsing the file for data. Requires TextFile or CSV format tables.
`skip_header_line_count`	The number of header lines to ignore when parsing the file for data. Requires TextFile or CSV format tables.
`sorted_by`	The column to sort by to determine bucketing for row. Only valid if `bucketed_by` and `bucket_count` are specified as well. Defaults to `[]`.
`textfile_field_separator`	Allows the use of custom field separators, such as `\|`, for TextFile formatted tables.
`textfile_field _separator_escape`	Allows the use of a custom escape character for TextFile formatted tables.
`transactional`	Set this property to `true` to create an ORC ACID transactional table. Requires ORC format. This property may be shown as `true` for insert-only tables created using older versions of Hive.
`partition_projection _enabled`	Enables partition projection for selected table. Mapped from AWS Athena table property projection.enabled.
`partition_projection _ignore`	Ignore any partition projection properties stored in the metastore for the selected table. This is a Galaxy-only property which allows you to work around compatibility issues on a specific table, and if enabled, Galaxy ignores all other configuration options related to partition projection.
`partition_projection _location_template`	Projected partition location template, such as `s3://test/name=${name}/`. Mapped from the AWS Athena table property storage.location.template. Defaults to `${table_location}/${partition_name}`.
`extra_properties`	Additional properties added to a Hive table. The properties are not used by Galaxy, and are available in the `$properties` metadata table. The properties are not included in the output of `SHOW CREATE TABLE` statements.

Some SQL statements must be supplemented with the type='hive' property for use in Starburst Galaxy. For example:

CREATE TABLE hive.web.page_views (
  view_time timestamp,
  user_id bigint,
  page_url varchar,
  ds date,
  country varchar
)
WITH (
  type = 'hive',
  format = 'ORC',
  partitioned_by = ARRAY['ds', 'country'],
  bucketed_by = ARRAY['user_id'],
  bucket_count = 50
)

Hive metadata table #

The $properties metadata table provides general configuration information about a Hive table. Query the metadata table by appending $properties to a table name:

SELECT * FROM "hive.web.page_views$properties";

Table procedures #

Use the CALL statement to perform data manipulation or administrative tasks. Procedures must include a qualified catalog name. For example, for a catalog named examplecatalog:

CALL examplecatalog.system.example_procedure();

The following Hive table procedures are available:

create_empty_partition #

system.create_empty_partition('schema_name', 'table_name', partition_columns, partition_values);

Create an empty partition in the specified table.

The partition arguments can have a different type of varchar such as an integer or date.

The value must be specified as varchar in the partition_values and partition_columns array. For example, partition_values => ARRAY['2016-08-09', 'US']

drop_stats #

system.drop_stats('schema_name', 'table_name', partition_values);

Drops statistics for one or more partitions or the entire table. Specify the partitions as an array whose elements are arrays of partition_values. The following example shows partition_values for dropping stats for multiple partitions:

partition_values => ARRAY[ARRAY['2016-08-09', 'US'], ARRAY['2023-08-03', 'UK']]

The partition_values argument is optional. If the partition_values argument is omitted, stats are dropped for the entire table.

register_partition #

system.register_partition('schema_name', 'table_name', partition_columns,
    partition_values, location);

Registers the existing location as a new partition in the metastore for the specified table.

With the location argument omitted, the partition location is constructed using partition_columns and partition_values.

sync_partition_metadata #

system.sync_partition_metadata('schema_name', 'table_name', 'mode', case_sensitive);

Check and update partitions list in the metastore. There are three modes available:

ADD: add any partitions that exist on the file system, but not in the metastore.
DROP: drop any partitions that exist in the metastore, but not on the file system.
FULL: perform both ADD and DROP.

The case_sensitive argument is optional. The default value is true for compatibility with Hive’s MSCK REPAIR TABLE behavior, which expects the partition column names in file system paths to use lowercase (for example, col_x=SomeValue). Partitions on the file system not conforming to this convention are ignored, unless the argument is set to false.

unregister_partition #

system.unregister_partition('schema_name', 'table_name', partition_columns,
    partition_values);

Unregisters given, existing partition in the metastore for the specified table. The partition data is not deleted.

Migrate to Iceberg #

Great Lakes connectivity supports migration from Hive to Iceberg table format. It can read and write to Hive tables that have been successfully migrated to Iceberg format. The migration process creates metadata without copying any data. In addition, table statistics are not generated.

The migration process from Hive to Iceberg table format only supports Parquet, ORC, and AVRO file formats.

Use the following statement ALTER TABLE SET PROPERTIES to convert a table from the Hive table format to the Iceberg table format:

ALTER TABLE catalog_name.schema_name.table_name SET PROPERTIES type = 'ICEBERG';

The statement must be called for a specific catalog example with the relevant schema and table names supplied with the required parameters schema_name and table_name.

Once the migration process is complete, use the SHOW CREATE TABLE statement to confirm that the table type has changed:

SHOW CREATE TABLE example_catalog.schema_name.table_name;

Successful migration displays type='ICEBERG'.

Add files to Iceberg #

Great Lakes connectivity supports adding files from Hive tables or locations to existing Iceberg table format tables. See more information on the Iceberg table format page.

Partition projection #

Partition projection is a feature used to speed up query processing. This feature computes partition values and locations statically from the partition projection definition of the table properties and does not rely on the metastore for retrieving them.

Great Lakes connectivity supports partition projection table properties stored in the metastore, and it reimplements this functionality. Currently, date projection only supports intervals of DAYS, HOURS, MINUTES, and SECONDS.

If there are any compatibility issues blocking access to a requested table when partition projection is enabled, set the partition_projection_ignore table property to true for a table to bypass any errors.

In addition, Great Lakes connectivity supports the partition_projection_type table property.

The partition_projection_type defines the type of partition projection to use on a specific column. It can be set as either enum or integer or date and injected:

Name	Description
`enum`	When there are a predefined set of values for the partitions to be used such as, country or region names.
`integer`	Specifies a range, providing a more concise way to predefine the partition values. For example,`[1, 2, 3, 4, ..., 1000]` or `[0500, 0550, 0600, ..., 2500]`.
`date`	Specifies a range, providing a more concise way to predefine the partition values. For example, `[20200101, 20200102, ..., 20201231]` or `[1-1-2020 00:00:00, 1-1-2020 01:00:00, ..., 12-31-2020 23:00:00]`
`injected`	When the data is partitioned by a property with high cardinality or when the values are unknown. When this projection type is used to configure a partition key, the values from the query itself determine the set of partitions to be read.

Using partition projection #

To use partition projection, specify the ranges of partition values and projection types for each partition column in the table properties in the AWS Glue Data Catalog or in an external Hive metastore.

Create a projection with the enum projection type:

CREATE SCHEMA catalog_name.schema_name;
use catalog_name.test_schema;

CREATE TABLE enum_table_name (
    name varchar,
    comment varchar,
    nationkey integer,
    regionkey integer,
    short_name varchar WITH (
        partition_projection_type = 'enum',
        partition_projection_values = array ['POL', 'CZE']
    )
)
WITH (
    partitioned_by = array['short_name'],
    partition_projection_enabled = true
);

SHOW CREATE TABLE enum_table_name;

INSERT INTO enum_table_name VALUES ('POLAND_1', 'Czesc', 0, 5, 'POL');
INSERT INTO enum_table_name VALUES ('POLAND_2', 'Czesc', 1, 5, 'POL');
INSERT INTO enum_table_name VALUES ('CZECH_1', 'Ahoj', 2, 5, 'CZE');
INSERT INTO enum_table_name VALUES ('CZECH_2', 'Ahoj', 3, 5, 'CZE');

The INSERT INTO statements adds the data corresponding to the POL and CZE partitions.

SELECT * FROM "enum_table_name$partitions";

The SELECT only displays the partitions POL and CZE corresponding to the partition projection defined while creating the table, not the metastore.

| short_name |
+------------+
| POL        |
+------------+
| CZE        |
+------------+

Create a projection with the integer projection range:

CREATE TABLE integer_table_name (
    name varchar,
    comment varchar,
    nationkey integer,
    regionkey integer WITH (
        partition_projection_type = 'integer',
        partition_projection_range = array ['0', '10']
    )
)
WITH (
    partitioned_by = array['regionkey'],
    partition_projection_enabled = true
);

INSERT INTO integer_table_name VALUES ('POLAND_1', 'Czesc', 0, 5);
INSERT INTO integer_table_name VALUES ('POLAND_2', 'Czesc', 1, 5);
INSERT INTO integer_table_name VALUES ('CZECH_1', 'Ahoj', 2, 5);
INSERT INTO integer_table_name VALUES ('CZECH_2', 'Ahoj', 3, 5);
INSERT INTO integer_table_name VALUES ('GERMANY_1', 'Hallo', 12, 8);
INSERT INTO integer_table_name VALUES ('GERMANY_2', 'Hallo', 13, 8);
INSERT INTO integer_table_name VALUES ('MEXICO_1', 'Hola', 22, 18);

The INSERT INTO statements adds the data corresponding to the defined partitions.

SELECT * FROM "integer_table_name$partitions";

The SELECT * FROM statement returns only the partitions corresponding to the partition projection defined while creating the table, not the metastore.

| regionkey |
+-----------+
| 0         |
+-----------+
| 1         |
+-----------+
| 2         |
+-----------+
| 3         |
+-----------+
| 4         |
+-----------+
| 5         |
+-----------+
| 6         |
+-----------+
| 7         |
+-----------+
| 8         |
+-----------+
| 9         |
+-----------+
| 10        |
+-----------+

Create a projection with the integer projection range using partition_projection_interval:

CREATE TABLE interval_table_name (
    name varchar,
    comment varchar,
    nationkey integer,
    regionkey integer WITH (
        partition_projection_type = 'integer',
        partition_projection_range = array ['0', '10'],
        partition_projection_interval = 3
    )
)
WITH (
    partitioned_by = array['regionkey'],
    partition_projection_enabled = true
);

INSERT INTO interval_table_name VALUES ('POLAND_1', 'Czesc', 0, 5);
INSERT INTO interval_table_name VALUES ('POLAND_2', 'Czesc', 1, 5);
INSERT INTO interval_table_name VALUES ('CZECH_1', 'Ahoj', 2, 5);
INSERT INTO interval_table_name VALUES ('CZECH_2', 'Ahoj', 3, 5);
INSERT INTO interval_table_name VALUES ('GERMANY_1', 'Hallo', 12, 6);
INSERT INTO interval_table_name VALUES ('GERMANY_2', 'Hallo', 13, 9);
INSERT INTO interval_table_name VALUES ('MEXICO_1', 'Hola', 22, 18);

The INSERT INTO statements adds the data corresponding to the defined partitions.

SELECT * FROM "interval_table_name$partitions";

The SELECT * FROM statement returns only the partitions corresponding to the partition projection defined while creating the table, not the metastore.

In the following example output, only 0, 3, 6, and 9 are displayed because of the chosen partition_projection_interval value of 3.

| regionkey |
+-----------+
| 0         |
+-----------+
| 3         |
+-----------+
| 6         |
+-----------+
| 9         |
+-----------+

Create a projection with the date projection range:

CREATE TABLE date_table_name (
    name varchar,
    comment varchar,
    nationkey integer,
    dt varchar WITH (
        partition_projection_type = 'date',
        partition_projection_format = 'yyyy-MM-dd',
        partition_projection_range = array['NOW-5DAYS', 'NOW']
    )
)
WITH (
    partitioned_by = array['dt'],
    partition_projection_enabled = true
);

INSERT INTO date_table_name VALUES ('POLAND_1', 'Czesc', 0, '2023-06-01');
INSERT INTO date_table_name VALUES ('POLAND_2', 'Czesc', 1, '2023-06-01');
INSERT INTO date_table_name VALUES ('CZECH_1', 'Ahoj', 2, '2023-06-29');
INSERT INTO date_table_name VALUES ('CZECH_2', 'Ahoj', 3, '2023-06-01');
INSERT INTO date_table_name VALUES ('GERMANY_1', 'Hallo', 12, '2023-06-01');
INSERT INTO date_table_name VALUES ('GERMANY_2', 'Hallo', 13, '2023-07-02');
INSERT INTO date_table_name VALUES ('MEXICO_1', 'Hola', 22, '2023-07-01');

The INSERT INTO statements adds the data corresponding to the defined partitions.

SELECT * FROM "date_table_name$partitions";

The SELECT * FROM statement returns only the partitions corresponding to the partition projection defined while creating the table, not the metastore.

The following SELECT * FROM statement example returns the partition, dt:

| dt         |
+------------+
| 2023-08-26 |
+------------+
| 2023-08-27 |
+------------+
| 2023-08-28 |
+------------+
| 2023-08-29 |
+------------+
| 2023-08-30 |
+------------+
| 2023-08-31 |
+------------+

Session properties #

See session properties on the Great Lakes connectivity page to understand how these properties are used.

The following table describes the session properties supported by the Hive table type.

Session property	Description
`avro_native_reader_enabled`	Use the native Avro file reader. The default value is `false`.
`avro_native_writer_enabled`	Use the native Avro file writer. The default value is `false`.
`bucket_execution_enabled`	Enable bucket-aware execution: only use a single worker per bucket. The default value is `true`.
`collect_column_statistics_on_write`	Enables automatic column level statistics collection on write. The default value is `true`.
`create_empty_bucket_files`	Create empty files for buckets that have no data. The default value is `false`.
`csv_native_reader_enabled`	Use the native CSV reader. The default value is `false`.
`csv_native_writer_enabled`	Use the native CSV writer. The default value is `false`.
`force_local_scheduling`	Only schedule splits on workers colocated with the data node. The default value is`false`.
`hive_storage_format`	Default storage format for new tables or partitions. The possible values are: `ORC` `PARQUET` `AVRO` `RCBINARY` `RCTEXT` `SEQUENCEFILE` `JSON` `OPENX_JSON` `TEXTFILE` `CSV` `REGEX` The default value is `ORC`.
`hive_views_legacy_translation`	Use `hive_views_legacy_translation` for catalog specific use to translate Hive views. Allow read access to the data. Hive views are defined in HiveQL and stored in the Hive Metastore Service. They are analyzed to allow read access to the data. Not available when using the Glue or Starburst metastore options. The default value is `false`.
`ignore_absent_partitions`	Ignore partitions when the file system location does not exist rather than failing the query. The default value is `false` .
`ignore_corrupted_statistics`	Ignore the corrupted statistics rather than failing the query. The default value is `false`.
`insert_existing_partitions_behavior`	Behavior on insert existing partitions; this session property does not control behavior on insert existing unpartitioned table. The default value is `APPEND`.
`json_native_reader_enabled`	Use the native JSON reader. The default value is `false`.
`minimum_assigned_split_weight`	The minimum assigned split weight when size based split weighting is enabled. The default value is `0.05`.
`json_native_writer_enabled`	Use the native JSON writer. The default value is `false`.
`openx_json_native_reader_enabled`	Use the native OpenX JSON reader. The default value is `true`.
`openx_json_native_writer_enabled`	Use the native OpenX JSON writer. The default value is `true`.
`optimize_mismatched_bucket_count`	Enable optimization to avoid shuffle when bucket count is compatible but not the same. The default value is `false`.
`optimize_symlink_listing`	Optimize the listing for `SymlinkTextFormat` tables with files in a single directory. The default value is `true`.
`orc_optimized_writer_max_dictionary_memory`	The maximum dictionary memory. The default value is `16MB`.
`orc_optimized_writer_max_stripe_rows`	The maximum stripe row count. The default value is `10000000`.
`orc_optimized_writer_max_stripe_size`	The maximum stripe size. The default value is `64MB`.
`orc_optimized_writer_min_stripe_size`	The minimum stripe size. The default value is `32MB`.
`orc_optimized_writer_validate`	Force validation for all files. The default value is `false`.
`orc_optimized_writer_validate_mode`	Level of detail in ORC validation. Possible values include `HASHED`, `DETAILED`, and `BOTH`. The default value is `BOTH`.
`orc_optimized_writer_validate_percentage`	The sample percentage for validation of files. The default value is `0`.
`orc_use_column_names`	Access ORC columns by name. By default, columns in ORC files are accessed by their ordinal position in the Hive table definition. The default value is `false`.
`parallel_partitioned_bucketed_writes`	Improve parallelism of partitioned and bucketed table writes. The default value is `true`.
`parquet_ignore_statistics`	Ignore statistics from Parquet to allow querying files with corrupted or incorrect statistics. The default value is `false`.
`parquet_optimized_writer_validation_percentage`	TPercentage of Parquet files to validate after write by rereading the whole file. Validation is turned off by setting the value to `0`. The default value is `5`.
`parquet_use_column_index`	Use the Parquet column index. The default value is `true`.
`parquet_use_column_names`	Access Parquet columns by name by default. Set this property to `false` to access columns by their ordinal position in the Hive table definition. The default value is `true`.
`partition_statistics_sample_size`	The maximum sample size of the partitions column statistics. The default value is `100`.
`propagate_table_scan_sorting_properties`	Use sorted table layout to generate more efficient execution plans. Note, this may lead to incorrect results if files are not sorted as per table definition. The default value is `false`.
`query_partition_filter_required`	Set to `true` to temporarily force a query to use a partition filter.
`query_partition_filter_required_schemas`	The list of schemas for which filter on partition column is enforced. The default value is `array['']`.
`rcfile_optimized_writer_validate`	Validate `RCF` writer files. The default value is `false`.
`regex_native_reader_enabled`	Use the native REGEX reader. The default value is `false`.
`respect_table_format`	Write new partitions using table format rather than the default storage format. The default value `true`.
`s3_select_pushdown_enabled`	Enable S3 Select pushdown. The default value is `false`.
`sequence_file_native_reader_enabled`	Use the native sequence file reader. The default value is `false`.
`sequence_file_native_writer_enabled`	Validate `RCF` writer files. The default value is `false`.
`size_based_split_weights_enabled`	Enable estimating split weights based on size in bytes. The default value is `true`.
`text_file_native_reader_enabled`	Use the native text file reader. The default value is `false`.
`text_file__writer_enabled`	Use the native text file writer. The default value is `false`.
`timestamp_precision`	Specifies the precision for the timestamp columns in Hive tables. The possible values are: `MILLISECONDS` `MICROSECONDS` `NANOSECONDS` The default value is `MILLISECONDS`.
`validate_bucketing`	Verify that data is bucketed correctly when reading. The default value is `true`.

Hive SQL support #

When using the Hive table format with Great Lakes connectivity, the general SQL support details apply, with the following additional consideration:

Hive views are not supported.

Is the information on this page helpful?

Yes

Cancel

Hive table format