AWS Glue #

AWS Glue data catalogs are a supported metadata catalog for Starburst Enterprise platform (SEP), and can be used as an alternative to the Hive Metastore to query your S3 data with the following connectors:

Ensure the requirements for the connector are fulfilled.

Requirements #

Before you configure the Glue metastore, verify the following prerequisites:

  • Your SEP instance must have permissions to access both S3 and Glue AWS services.
  • For CFT deployments, review the IAM role permissions requirements if you are providing your own IAM Instance.
  • When using the AMI and launching it manually, make sure you choose an IAM Role that satisfies the requirements.

Configuration #

  1. Configure to use Glue as metastore in the catalog properties file
  connector.name=hive
  hive.metastore=glue
  1. Add other desired Glue properties such as the AWS region or credentials to use.

  2. Restart the cluster to apply the changes.

AWS Glue with SEP AMI #

You can use the SEP AMI from the AWS Marketplace, with the Hive connector to use Glue.

After the configuration as described in the preceding section, you can restart the AMI:

sudo service starburst restart

AWS Glue with CloudFormation template #

When using the CloudFormation template in AWS, you can leverage Glue by navigating to the Stack Creation form and choosing AWS Glue Data Catalog in the MetastoreType field in the Starburst Enterprise Configuration section.

SEP with AWS Glue usage #

When configured, the Glue data catalog is available via the catalog from within the CLI or any other SEP connection. You must specify the location of the data on S3 for either the entire schema or at the table level. For example, to create a schema myschema in the Glue data catalog, with the S3 base directory (root folder for per-table subdirectories) pointing to the root of my-bucket S3 bucket, run the following SQL command:

    CREATE SCHEMA mycatalog.myschema
    WITH (location = 's3://my-bucket/')

You can also create and edit the schema and tables directly from Glue. In Glue terminology, a schema is referred to as a “database”.

Table and column statistics support #

SEP supports standard AWS Glue table and column statistics via the AWS Glue API. You can create and manage the statistics with the ANALYZE statement.

Legacy statistics #

SEP releases prior to 354-e use custom statistics. These SEP-based statistics are no longer supported in 354-e and later releases, and are replaced by the standard Glue statistics.

You can enable read-only access to the SEP-based table statistics for transition purposes. Set the property hive.metastore.glue.read-properties-based-column-statistics to true during the migration time, until the standard statistics are available. The legacy statistics are only used if standard statistics are not present for a table or a partition.

SEP-based statistics are still stored in JSON format as Glue table and partition parameters. After the migration these can be deleted.

Known limitations of AWS Glue support #

The following SEP features are not supported with the Glue data catalog:

  • Statistics are not preserved when a column is renamed. Tables with renamed columns must be re-analyzed.
  • Renaming tables from within AWS Glue is not supported.
  • Partition values containing quotes and apostrophes are not supported (for example, PARTITION (owner="Doe's").
  • Using Hive authorization is not supported.