Querying AWS S3#
Ensure access to S3#
Before you start querying the data on S3, you need to make sure the Starburst Enterprise platform (SEP) cluster is allowed to query the data. If your S3 data is publicly available, you do not need to do anything. However, typically the data is not publicly available, and you need to grant SEP access. For this, you need to create an instance profile and select it when creating a cluster. You can then validate the selected instance profile is sufficient for SEP to read S3 data by opening an SSH connection to the coordinator and issuing the following commands. For more information on creating Instance profiles, refer to the Deploying SEP section.
aws s3 ls s3://your-bucket/path/to/dataset/ aws s3 cp s3://your-bucket/path/to/dataset/data-file - > /dev/null
Another requirement is that you have enabled and configured a Hive catalog. The Hive Metastore should also have an instance profile granting access to S3.
Configuring Hive catalog
If you have not yet done so, you need to configure catalog using the Hive connector in SEP. The connector uses a Hive Metastore for data discovery and is not limited to data residing on HDFS. Metastore should be viewed as a database of metadata about your data. You can use an existing Hive Metastore or start a new one.
Existing Hive Metastore:
To use an existing Hive Metastore instance with a SEP cluster, you need to set the MetastoreType parameter to
External Hive Metastore Serviceand ExternalMetastoreHost to IP address of your Hive Metastore Service (see Configuring Hive Metastore section). If the Metastore uses authentication, refer to the Configurations section of this documentation along with the following link.
New Hive Metastore:
You can create a new metastore for SEP via a CloudFormation template. See Configuring Hive Metastore.
Accessing S3 Data from SEP for the first time#
Accessing data already mapped in the Hive Metastore#
If you chose an existing Hive Metastore instance when configuring the Hive catalog, chances are that your S3 data is already mapped to SQL tables in the Metastore. In that case, you should be able to query it.
Accessing data not yet mapped in the Hive Metastore#
To access S3 data that is not yet mapped in the Hive Metastore you need to
provide the schema of the data, the file format, and the data location. For
example, if you have ORC or Parquet files in an S3 bucket,
need to execute a command similar to the following.
trino> -- select schema in which the table will be defined, must already exist trino> USE hive.default; USE trino> -- create table trino> CREATE TABLE orders ( -> orderkey bigint, -> custkey bigint, -> orderstatus varchar(1), -> totalprice double, -> orderdate date, -> orderpriority varchar(15), -> clerk varchar(15), -> shippriority integer, -> comment varchar(79) -> ) WITH ( -> external_location = 's3://my_bucket/path/to/folder', -> format = 'ORC' -- or 'PARQUET' -> ); CREATE TABLE
Now you should be able to query the newly mapped table:
trino> SELECT * FROM orders;
It’s very easy to query data stored on S3, as all you need to do is define the mapping between the S3 data and SQL tables. However, if your queries are complex and include joining large data sets, you may run into performance issues. This is because SEP does not know the statistical properties of the data and the such properties are a base for the cost-based optimizer’s decisions.
To gather table statistics for tables in Hive connector, you need to execute the following command:
trino> ANALYZE orders;
Writing S3 data with SEP#
Before you attempt to write data to S3, make sure you have configured everything necessary to read data from S3.
Create export schema#
If the Hive Metastore contains schema(s) mapped to S3 locations, you can use them to export data to S3. If you don’t want to use existing schemas (or there are no appropriate schemas in the Hive Metastore), you need to create a new one.
trino> CREATE SCHEMA hive.s3_export WITH (location = 's3://my_bucket/some/path');
Export data to S3#
Once you have a schema pointing to a location where you want to export the data,
you can issue the export using a
CREATE TABLE AS statement and select your
desired file format.
trino> CREATE TABLE hive.s3_export.my_table WITH (format = 'ORC') AS <source query>;
The data is written to one or more files within the
s3://my_bucket/some/path/my_table namespace. The number of files depends on
the size of the data being exported and possible parallelization of the source
of the data.
trino> CREATE TABLE hive.s3_export.orders_export WITH (format = 'ORC') AS SELECT * FROM tpch.sf1.orders;