Amazon S3 #

You can query data stored on the Amazon Simple Storage Service (Amazon S3) with Starburst Enterprise platform (SEP). The most common usage is with the Hive connector and the dedicated configuration options for S3-access.

In addition, you need to store the meta data about the object storage. Typically this is done with AWS Glue, but you can also use your own Hive Metastore Service (HMS).

Amazon S3 can also be used as storage backend with other systems, and then queried with the dedicated connector:

Ensure the requirements for the connector are fulfilled.

Requirements #

To enable access to S3 with Starburst Enterprise platform (SEP), the following conditions must be met:

  • Your SEP cluster must have the necessary permissions needed to access your S3 data.
  • Requirements of the connector used for the catalog you use to access S3 need to be fulfilled. Typically this includes a meta store, either AWS Glue, or a HMS.

Ensuring SEP access to S3 #

If your S3 data is publicly available, you do not need to do anything. However, S3 data is not typically publicly available, and you must grant SEP access to it.

With the CFT support, you must select an appropriate instance profile when creating a cluster.

Validate that the selected instance profile is sufficient for SEP to read S3 data by opening an SSH connection to the coordinator and issuing the following commands:

$ aws s3 ls s3://your-bucket/path/to/dataset/
$ aws s3 cp s3://your-bucket/path/to/dataset/data-file - > /dev/null

Configuring a catalog and meta store #

A catalog must be created for your S3 bucket.

Users of the CFT can create a new HMS via a CloudFormation template. It must have an instance profile granting access to S3. Alternatively, you can use an existing HMS:

  1. Set the MetastoreType parameter to External Hive Metastore Service and ExternalMetastoreHost to IP address of your HMS.

Users of the [Kubernetes support(../../../latest/k8s.html) can perform similar configuration.

If the HMS uses authentication, review the Hive connector security documentation to configure the connector for your environment.

Reading data from S3 #

If you chose an existing HMS instance when configuring the Hive catalog, chances are that your S3 data is already mapped to SQL tables in the HMS. In that case, you should be able to query it immediately.

If you created a new HMS instance, it is likely that your S3 data is not yet mapped. In the HMS you must provide the schema of the data, the file format, and the data location. For example, if you have ORC or Parquet files in an S3 bucket, my_bucket, create a queryable table in the desired schema of your catalog.

  USE hive.default;
  CREATE TABLE orders (
       orderkey bigint,
       custkey bigint,
       orderstatus varchar(1),
       totalprice double,
       orderdate date,
       orderpriority varchar(15),
       clerk varchar(15),
       shippriority integer,
       comment varchar(79)
    ) WITH (
      external_location = 's3://my_bucket/path/to/folder',
      format = 'ORC' -- or 'PARQUET'
    );

Query the newly mapped table as usual:

  SELECT * FROM orders;

Statistics #

If your queries are complex and include joining large data sets, you may run into performance issues. This is because SEP does not know the statistical properties of the data necessary for the cost-based optimizer’s decisions.

To gather table statistics, execute the following command:

  ANALYZE orders;

Writing data to S3 #

After configuring SEP to read data from S3, you can write to it.

If your HMS contains schema(s) mapped to S3 locations, you can use them to export data to S3. If you don’t want to use existing schemas (or there are no appropriate schemas in the HMS), create a new one.

  CREATE SCHEMA hive.s3_export WITH (location = 's3://my_bucket/some/path');

Once you have a schema pointing to a location where you want to export the data, export data using a CREATE TABLE AS statement that specifies your desired file format:

  CREATE TABLE hive.s3_export.my_table
  WITH (format = 'ORC')
  AS <source query>;

Data is written to one or more files within the s3://my_bucket/some/path/my_table namespace. The number of files depends on the size of the data being exported and possible parallelization of the source of the data.