Starburst Hive connector#

The Starburst Hive connector is an extended version of the Hive connector with configuration and usage identical.

Additional information:

Requirements#

Extensions#

The Starburst Hive connector supports improvements detailed in the security and performance sections, and includes following additional extensions:

Amazon Glue support#

Statistics collection is supported for Hive Metastore and Amazon Glue.

Configuring and using SEP with AWS Glue is described in the AWS Glue documentation section.

Cloudera support#

The connector supports the Cloudera Data Platform (CDP) and the predecessor Cloudera Distributed Hadoop (CDH) platform.

IBM Cloud Object Storage support#

The connector supports querying data on IBM Cloud Object Storage.

MinIO support#

The connector supports querying data on MinIO storage.

MapR Hive support#

The connector includes support for a MapR-based Hive metastore as well as the MapR filesystem.

OpenX JSON format support#

The connector supports reading and writing data to tables as JSON files, and use the OpenX JSON serialization and deserialization (serde) from the Java class org.openx.data.jsonserde.JsonSerDe.

Existing tables using that serde and all the associated serde properties are handled automatically.

The actual serde implementation is a fork of the original OpenX serde. It is updated to be compatible with the Hive 3 APIs used in SEP. The binary package of the forked serde implementation is available from Starburst Support. You can install the package in your systems reading and writing to your Hive-managed storage with Hive 3 for optimal compatibility.

Configuration#

The connector configuration is similar to the configuration for the base Hive connector, with these additional properties:

Starburst Hive connector properties#

Property name

Description

hive.accelerated-parquet-reader.enabled

Use the accelerated Parquet reader. Defaults to true.

hive.delta-lake-catalog-name

Specifies the catalog name that SELECT queries are redirected to when a Delta Lake table is detected.

hive.delta-catalog-name

Deprecated. Use hive.delta-lake-catalog-name instead.

hive.azure.abfs.oauth2.passthrough

Set to true to reuse the Azure Active Directory (AD) token for access to the Azure Blob Storage. For more information, see Azure AD credential pass-through.

SQL support#

The connector supports all of the SQL statements listed in the Hive connector documentation.

The following section describes additional SQL operations that are supported by SEP enhancements to the Trino connector.

Materialized views#

Note

If you are a data consumer, read our getting started page for a concise introduction to using materialized views.

The connector supports Materialized views management, with the following requirements:

In the underlying system, each materialized view consists of a view definition and a storage table. The storage table name is stored as a materialized view property. The materialized data is stored in that storage table.

Note

If you are a data engineer or platform administrator, read our cache service introduction for an overview of setting up the cache service and using materialized views.

Materialized views are populated with data and refreshed manually with the REFRESH MATERIALIZED VIEW command, or by the Automated materialized view management. Storage configuration for the storage tables must be supplied with the standard Hive connector table properties in the WITH statement:

Hive connector storage table properties#

Property name

Description

Default

format

File format to use in the storage. Valid values include ORC, PARQUET, CSV, JSON and others. The catalog property hive.storage-format sets the default value to ORC and can be used to set a different default.

ORC

partitioned_by

Partitioning column for the storage table.

[]

bucketed_by

Bucketing column for the storage table. Must be used with bucketed_count.

[]

bucket_count

Number of buckets to group data into. Must be used with bucketed_by.

sorted_by

Column to sort by to determine bucketing for row.

[]

The automatic refresh is configured with properties set in the WITH clause of the CREATE MATERIALIZED VIEW statement.

The CREATE MATERIALIZED VIEW statement specifies the query to define the data for the materialized view, the refresh schedule, and other parameters used by the cache service. The query can access any available catalog and schema.

Once the storage tables are populated, the materialized view is created, and you can access it like a table using the name of the materialized view.

Use the SHOW CREATE MATERIALIZED VIEW statement to view the complete CREATE MATERIALIZED VIEW statement for a materialized view, including the properties in the WITH clause.

Dropping a materialized view with DROP MATERIALIZED VIEW removes the definition and the storage table.

Configuration#

Specify the following configuration properties in the catalog properties file for each desired catalog to enable materialized view creation and usage in that catalog.

materialized-views.enabled=true
materialized-views.namespace=<your_namespace>
materialized-views.storage-schema=<your_storage_schema>
cache-service.uri=http://<my-cache-service-hostname>:8180
Catalog configuration properties for materialized views#

Property name

Description

Required

materialized-views.enabled

Set to true to enable materialized views.

true

materialized-views.storage-catalog

Specifies the catalog that contains the schema used to store the storage tables for the materialized views. Defaults to the catalog used for the materialized view itself.

false

materialized-views.storage-schema

Specifies the schema used to store the storage tables for the materialized views. Ensure that the proper access control exists on that schema to prevent users from directly accessing the storage tables.

true

materialized-views.namespace

Used by the cache service to create a fully-qualified name for the materialized views, and to identify which catalog is used to run the scheduled refresh queries.

true

materialized-views.allow-run-as-invoker

Directs SEP to run as the user submitting the query when present in a catalog and set to true. If not present or set to false, it attempts to run as the view’s owner. Equivalent catalog session property is materialized_views_allow_run_as_invoker.

false

cache-service.uri

The URI of the SEP cache service.

true

The storage schema must be defined in the catalog properties file.

materialized-views.storage-schema=mymvstorage

If it does not exist yet, you must create it with a defined location:

CREATE SCHEMA myhive.mymvstorage WITH (location = 's3a://mymvstorage/');

In addition the schema for the materialized view itself must exist.

With the cache service running, the catalog configured and the schemas defined, you can proceed to create a materialized view. In this example, a materialized view named myhive.myschema.my_materialized_view is created:

CREATE MATERIALIZED VIEW myhive.myschema.my_materialized_view
WITH (
  grace_period = '15.00m',
  max_import_duration = '1.00m'
) AS
  SELECT *
  FROM mycatalog.public.my_table
  WHERE my_field IN ( 'myvalue1', 'myvalue2' )
  ;

The query, specified after AS, can be any valid query, including queries accessing one or multiple other catalogs.

The properties for the view are stored in the cache service database, and the data in the storage schema, myhive.mymvstorage.

To query data in the materialized view, use a SELECT statement as you would for any other table:

SELECT * FROM myhive.myschema.my_materialized_view;

You can also use the materialized view in more complex queries, just like any other table.

Access to a materialized view is much faster, since the data is readily available in the storage table and no computation is necessary. The data however does use storage in addition to the source data. Use the following query to determine the catalog, schema, and name of the storage table:

SELECT ??? FROM system.metadata.materialized_views WHERE ???;

Then use the hidden properties of the storage table to compute an estimate for the used storage:

SELECT SUM(size) table_size, COUNT(file) num_files
FROM (
  SELECT
    "$path" AS file,
    "$file_size" AS size
  FROM <storage_catalog>.<storage_schema>.<storage_table>
)
GROUP BY file, size;

Troubleshooting#

Whether your materialized views are refreshed manually or by the cache service using WITH clause parameters, refreshes may fail if columns are added or renamed at the source. If this happens, drop the materialized view, and create it again.

Automated materialized view management#

The connector uses the cache service to manage metadata and maintenance of materialized views, related storage tables, and other aspects.

When the CREATE MATERIALIZED VIEW statement runs initially, the cache services picks up processing asynchronously after its configured refresh-interval and delay, which defaults to two minutes. The defined query is then run and used to populate the storage tables. The processing time depends on the complexity of the query, the cluster performance, and the performance of the storage.

Configuration properties are specified in the WITH clause when creating a materialized view:

Properties for automated materialized view management#

Property name

Description

Default

refresh-interval

Frequency at which the cache service triggers refresh of the materialized view, for instance refresh_interval = '1d' for one day. Refresh interval must be greater than or equal to five minutes, and is defined as a duration. Cannot be used with cron. The cache service refresh-interval and delay defines the specific frequency when the need for a refresh is checked.

cron

Unix cron expression specifying a schedule for regular refresh of the materialized view, for example 30 2 * * *. Cannot be used with refresh-interval. The cache service refresh-interval and delay defines the specific frequency when the need for a refresh is checked.

max_import_duration

Maximum allowed execution time for the refresh of the materialized view to complete. Measured from the scheduled time to the completion of the query execution. If a refresh fails for exceeding the maximum duration, the cache service attempts a refresh at the next scheduled time.

30.00m

grace_period

After a view’s TTL (Time to Live, calculated as refresh_interval + max_import_duration) has expired, the cache service waits the specified grace_period before queries running against it are terminated. If no grace_period is defined, the cache service defaultGracePeriod is used. NOTE: If you use this field, you must specify a value greater than the defaultGracePeriod defined in the cache service. Specifying a value less than the default results in error.

10.00m

incremental_column

Column used during incremental refresh by the service to apply an incremental_column > max(incremental_column) filter when loading data incrementally from the source table. This facilitates loading only newer data from the source table instead of the entire table in each refresh iteration. If no column is specified, the cache service execute a full refresh. The columns need to be monotonically increasing with each new record. Typically types are dates or increasing integer values used as identifiers.

namespace

Namespace used by the cache service to create a fully qualified name for materialized views in a catalog.

run_as_invoker

Validate access to tables and data referenced by materialized view as invoker.

false

In the following example, a materialized view named customer_total_return in myhive.myschema is created to automatically refresh daily at 2:30AM:

CREATE MATERIALIZED VIEW myhive.myschema.customer_total_return
WITH (
  grace_period = '5.00m',
  max_import_duration = '30.00m',
  cron = '30 2 * * *'
) AS
    SELECT
      sr_customer_sk ctr_customer_sk,
      sr_store_sk ctr_store_sk,
      sum(sr_return_amt) ctr_total_return
    FROM
    tpcds.sf1.store_returns,
    tpcds.sf1.date_dim
    WHERE ( (sr_returned_date_sk = d_date_sk) AND (d_year = 2000) )
    GROUP BY sr_customer_sk, sr_store_sk
;

After a materialized view is refreshed, at the end of the effective grace period, any new query requests that arrive after the new refresh is complete are run against the new contents. Query requests created before a refresh is complete are run against the previously existing contents until the effective grace period for that table is over.

Troubleshooting#

When you first create a materialized view, it is helpful to be able to determine the state of a refresh, particularly when a refresh has failed or the view data appears to be stale. This can happen when the refresh takes longer than the combined max_import_duration and effective grace period. Materialized views have their own metadata tables located in the default schema of the cache service database that contain current state and other information. Metadata tables are named as the materialized view name with $imports added to the end. For example, given a materialized view, myhive.myschema.customer_total_return, run the following query to refresh and view the metadata for your materialized view:

SELECT * FROM myhive.myschema."customer_total_return$imports"

Note

You must enclose <your_table_name>$imports in quotes so that the query parser handles the dollar sign correctly.

The resulting metadata table contains the following fields:

  • status - Scheduled, Running, Finished, Failed, Timeout

  • max_import_duration - The value originally set in the CREATE AS statement

  • start_time - As computed from the cron or refresh_interval

  • finish_time

  • row_count

  • error

The metadata for a refresh is maintained until the effective grace period passes.

Performance#

The connector includes a number of performance improvements, detailed in the following sections.

Dynamic row filtering#

Dynamic filtering, and specifically also dynamic row filtering, is enabled by default. Row filtering improves the effectiveness of dynamic filtering for a connector by using dynamic filters to remove unnecessary rows during a table scan. It is especially powerful for selective filters on columns that are not used for partitioning, bucketing, or when the values do not appear in any clustered order naturally.

As a result the amount of data read from storage and transferred across the network is further reduced. You get access to higher query performance and a reduced cost.

You can use the following properties to configure dynamic row filtering:

Dynamic row filtering properties#

Property name

Description

dynamic-row-filtering.enabled

Toggle dynamic row filtering. Defaults to true. Catalog session property name is dynamic_row_filtering_enabled.

dynamic-row-filtering.selectivity-threshold

Control the threshold for the fraction of the selected rows from the overall table above which dynamic row filters are not used. Defaults to 0.7. Catalog session property name is dynamic_row_filtering_selectivity_threshold.

dynamic-row-filtering.wait-timeout

Duration to wait for completion of dynamic row filtering. Defaults to 0. The default causes query processing to proceed without waiting for the dynamic row filter, it is collected asynchronously and used as soon as it becomes available. Catalog session property name is dynamic_row_filtering_wait_timeout.

Storage caching#

The connector supports the default storage caching. In addition, if HDFS Kerberos authentication is enabled in your catalog properties file with the following setting, caching takes the relevant permissions into account and operates accordingly:

hive.hdfs.authentication.type=KERBEROS

Additional configuration for Kerberos is required.

If HDFS Kerberos authentication is enabled, you can also enable user impersonation using:

hive.hdfs.impersonation.enabled=true

The service user assigned to SEP needs to be able to access data files in underlying storage. Access permissions are checked against impersonated user, yet with caching in place, some read operations happen in context of system user.

Any access control defined with the integration of Apache Ranger or the Privacera platform is also enforced by the storage caching.

Starburst Cached Views#

The connector supports Starburst Cached Views and can therefore be configured for table scan redirection and materialized views to improve performance.

Accelerated Parquet reader#

The connector includes a Parquet reader with numerous performance improvements for reading data files using the Parquet format. The reader is enabled by default.

You can disable it, and therefore use the default reader from Trino, in your catalog properties file:

hive.accelerated-parquet-reader.enabled=false

Alternatively you can disable it for a specific user session with the catalog session property accelerated_parquet_reader_enabled.

Security#

The connector includes a number of security-related features, detailed in the following sections.

Azure AD credential pass-through#

To use Azure Active Directory (AD) with credential pass-through, you must include the following configuration in Config properties:

http-server.authentication.oauth2.scopes=https://storage.azure.com/user_impersonation,openid
http-server.authentication.oauth2.additional-audiences=https://storage.azure.com

If you use the Azure AD as the identity provider when you integrate with Azure Storage, you can reuse the AD token to access Azure Storage blobs.

To reuse the AD token, set the following access properties in the catalog properties file:

hive.azure.abfs.oauth2.passthrough=true

Enable the OAuth 2.0 token pass-through authentication type with the following configuration in Config properties:

http-server.authentication.type=delegated-oauth

You must grant the Azure application user_impersonation API permissions for Azure Storage. For more information about delegated-oauth, see OAuth 2.0 token pass-through.

You can only set one type or group of access properties in the catalog properties file. Setting more than one prevents SEP from running properly. The possible, valid access property combination for the catalog properties file include the following:

  • hive.azure.abfs.oauth2.passthrough.

  • hive.azure.abfs-storage-account and hive.azure.abfs-access-key.

  • hive.azure.abfs.oauth.endpoint, hive.azure.abfs.oauth.client-id, and hive.azure.abfs.oauth.secret.

Authorization options#

SEP includes provides several authorization options for use with the Hive connector:

HDFS permissions#

Before running any CREATE TABLE or CREATE TABLE ... AS statements for Hive tables in SEP, you need to check that the operating system user running the SEP server has access to the Hive warehouse directory on HDFS.

The Hive warehouse directory is specified by the configuration variable hive.metastore.warehouse.dir in hive-site.xml, and the default value is /user/hive/warehouse. If that is not the case, either add the following to jvm.config on all of the nodes: -DHADOOP_USER_NAME=USER, where USER is an operating system user that has proper permissions for the Hive warehouse directory, or start the SEP server as a user with similar permissions. The hive user generally works as USER, since Hive is often started with the hive user. If you run into HDFS permissions problems on CREATE TABLE ... AS, remove /tmp/presto-* on HDFS, fix the user as described above, then restart all of the SEP servers.

Limitations#

The following limitation apply in addition to the limitations of the Hive connector.

  • Reading ORC ACID tables created with Hive Streaming ingest is not supported.

  • Redirections are supported for Hive tables but not Hive views.

  • For security reasons, sys system catalog is not accessible in SEP.

  • Hive’s timestamp with local zone data type is not supported in SEP. It is possible to read from a table having a column of this type, but the column itself will not be accessible. Writing to such a table is not supported.

  • SEP does not correctly read timestamp values from Parquet, RCFile with binary serde and Avro file formats created by Hive 3.1 or later due to Hive issues HIVE-21002, HIVE-22167. When reading from these file formats, SEP returns different results.