Starburst Hive connector#

The Starburst Hive connector is an extended version of the Hive connector with configuration and usage identical.

Note

The additional features of the connector require a valid Starburst Enterprise license, unless otherwise noted.

The following improvements are included:

Configuration#

The connector configuration is identical to the configuration for the base Hive connector.

The following additional information applies:

Amazon Glue support#

Statistics collection is supported for Hive Metastore and Amazon Glue.

Configuring and using SEP with AWS Glue is described in the AWS Glue support documentation section.

Cloudera compatibility matrix#

The Starburst Hive connector can query the Cloudera Data Platform (CDP), available as version 7.x. It also supports the predecessor Cloudera Distributed Hadoop (CDH) platform, available in versions 5.x and 6.x. Support and compatibility vary based on the version you use, and is detailed in the following table:

CDP/CDH and SEP compatibility matrix#

Cloudera version

350-e

345-e

338-e

332-e

CDP 7

Yes

Yes

Yes

Yes

CDH 6.x

Yes

Yes

Yes

Yes

CDH 5.14+

Yes

Yes

Yes

Yes

CDH 5.13

Yes

Yes

Yes

Yes

CDP 5.12 and lower

No

No

No

No

The following details apply for CDH 6.x users:

  • reading tables and data files created by CDH 6.x is supported

  • transactional table usage is not supported

  • CDH 6.x Hive cannot read ORC files created by SEP, due to the behavior of the included Hive version

  • using the included Apache Sentry is not supported

The following details apply for CDH 5.x users:

  • reading tables and data files created by CDH 5.x is supported

  • transactional table usage is not supported

Performance#

The connector includes a number of performance improvements, detailed in the following sections.

Storage caching#

The connector supports the default storage caching. In addition, if HDFS Kerberos authentication is enabled in your catalog properties file with the following setting, caching takes the relevant permissions into account and operates accordingly:

hive.hdfs.authentication.type=KERBEROS

Additional configuration for Kerberos is required.

If HDFS Kerberos authentication is enabled, you can also enable user impersonation using:

hive.hdfs.impersonation.enabled=true

The service user assigned to SEP needs to be able to access data files in underlying storage. Access permissions are checked against impersonated user, yet with caching in place, some read operations happen in context of system user.

Any access control defined with the integration of Apache Ranger or the Privacera platform is also enforced by the storage caching.

Table scan redirection#

The connector supports table scan redirection to improve performance and reduce load on the data source.

Security#

The connector includes a number of security-related features, detailed in the following sections.

Authorization options#

SEP includes provides several authorization options for use with the Hive connector:

HDFS permissions#

Before running any CREATE TABLE or CREATE TABLE ... AS statements for Hive tables in SEP, you need to check that the operating system user running the SEP server has access to the Hive warehouse directory on HDFS.

The Hive warehouse directory is specified by the configuration variable hive.metastore.warehouse.dir in hive-site.xml, and the default value is /user/hive/warehouse. If that is not the case, either add the following to jvm.config on all of the nodes: -DHADOOP_USER_NAME=USER, where USER is an operating system user that has proper permissions for the Hive warehouse directory, or start the SEP server as a user with similar permissions. The hive user generally works as USER, since Hive is often started with the hive user. If you run into HDFS permissions problems on CREATE TABLE ... AS, remove /tmp/presto-* on HDFS, fix the user as described above, then restart all of the SEP servers.

Limitations#

The following limitation apply in addition to the limitations of the Hive connector.

  • Writing to and creation of transactional tables is not supported.

  • Reading ORC ACID tables created with Hive Streaming ingest is not supported.

  • For security reasons, sys system catalog is not accessible in SEP.

  • Hive’s timestamp with local zone data type is not supported in SEP. It is possible to read from a table having a column of this type, but the column itself will not be accessible. Writing to such a table is not supported.

  • SEP does not correctly read timestamp values from Parquet, RCFile with binary serde and Avro file formats created by Hive 3.1 or later due to Hive issues HIVE-21002, HIVE-22167. When reading from these file formats, SEP returns different results.