AWS Glue support#

AWS Glue is a supported metadata catalog for Starburst Enterprise platform (SEP). It is intended to be used as a alternative to the Hive Metastore with the SEP Hive plugin to work with your S3 data.

AWS Glue with SEP AMI#

When you deploy a Starburst Enterprise platform (SEP) AMI from the AWS Marketplace, you need to configure the Hive connector to use Glue. The minimal setup is to do the following on all SEP nodes:

1. create a /etc/starburst/catalogs/glue.properties file with at least the contents below:

connector.name=hive-hadoop2
hive.metastore=glue

You should add all other Hive connector specific properties for your use case. See Starburst Hive connector for more details.

  1. restart SEP with:

sudo service starburst restart

AWS Glue with CloudFormation template#

When using the CloudFormation template in AWS, you can leverage Glue by simply choosing AWS Glue Data Catalog in MetastoreType field of the stack creation form (Starburst Enterprise Configuration section).

SEP with AWS Glue usage#

When configured as above, the Glue catalog is available via the hive catalog from within the CLI or any other SEP connection. As usual remember to specify the location of the data on S3. Either for the entire schema or on the table level. For example to create a schema foo in Glue, with the S3 base directory (root folder for per table subdirectories) pointing to the root of my-bucket S3 bucket, you would write:

CREATE SCHEMA hive.foo WITH (location = ‘s3://my-bucket/’)

You can also create and edit the schema and tables directly from AWS Glue. In AWS Glue terminology the schema is called “database”.

Prerequisites#

Both the AMI and CloudFormation approach mentioned above require the SEP instances to have permissions to access both S3 and Glue AWS services.

When using |sep| via our CloudFormation template by default you do not need to provide anything, the template creates all necessary resources automatically.

If you need to provide your own IAM Instance Profile for the SEP instances (IamInstanceProfile field in the Stack creation form) consult the IAM role Permissions for cluster nodes section. Same applies when launching the AMI manually, make sure you choose an IAM Role that satisfies the requirements.

Table and column statistics support#

SEP supports standard Glue table and column statistics via the AWS Glue API. You can create and manage the statistics with the ANALYZE statement.

Legacy statistics#

SEP releases before 354-e used custom statistics. These SEP-based statistics are no longer supported, and replaced by the standard Glue statistics.

The hive.metastore.glue.column-statistics.enabled configuration property is deprecated.

You can enable read-only access to the SEP-based table statistics for transition purposes. Set the property hive.metastore.glue.read-properties-based-column-statistics to true during the migration time, until the standard statistics are available. The legacy statistics are only used if standard statistics are not present for a table or a partition.

SEP-based statistics are still stored in JSON format as Glue table and partition parameters. After the migration these can be deleted.

Known limitations of AWS Glue support#

There are a couple SEP features that are not yet supported with the Glue catalog:

  • When a column is renamed, its statistics are not preserved. Therefore the table needs to be re-analyzed.

  • Renaming tables from within AWS Glue is not supported.

  • Partition values containing quotes and apostrophes are not supported (for example, PARTITION (owner="Doe's").

  • Using Hive authorization is not supported.