Starburst catalog#

Starburst Enterprise automatically creates the starburst catalog during system startup.

Note

You cannot configure the starburst catalog using properties files. If you have an existing starburst.properties file or if you previously created a starburst catalog using dynamic catalogs, you must remove these configurations or startup fails.

You must migrate all configuration properties set in starburst.properties to config.properties on both coordinator and worker nodes.

Storage functions#

Use storage functions to read from or write to external object storage locations directly within your SQL queries.

Storage functions support Amazon S3, Google Cloud Storage, and Azure Data Lake Storage.

Note

Storage functions are available as a public preview feature. Contact your Starburst account team with questions or feedback.

Configuration#

Storage functions require credentials to authenticate with object storage systems. You must specify exactly one of the following properties in config.properties:

Property name

Description

io.credentials-key

Base64-encoded credential key

io.credentials-file

Path to a JSON credential file

Warning

You must specify either io.credentials-key or io.credentials-file, but not both. If you set both properties or neither property, the system fails to start.

Credential file format#

The credential file uses JSON format to define storage configurations. The connector matches locations based on prefix comparison and selects the most specific match.

Note

The connector throws an exception if a specified location matches no configuration or matches multiple configurations. Ensure your location prefixes are specific enough to avoid conflicts.

{
    "configurations": [
        {
            "id": "s3-us-east-1",
            "location": "s3://my-bucket-east/",
            "configuration": {
                "fs.native-s3.enabled": "true",
                "s3.aws-access-key": "YOUR_ACCESS_KEY",
                "s3.aws-secret-key": "YOUR_SECRET_KEY",
                "s3.region": "us-east-1"
            }
        },
        {
            "id": "gcs-bucket",
            "location": "gs://my-gcs-bucket/",
            "configuration": {
                "fs.native-gcs.enabled": "true",
                "gcs.json-key": "YOUR_JSON_KEY"
            }
        },
        {
            "id": "azure-container",
            "location": "abfs://container@account.dfs.core.windows.net/",
            "configuration": {
                "fs.native-azure.enabled": "true",
                "azure.auth-type": "ACCESS_KEY",
                "azure.access-key": "YOUR_ACCESS_KEY"
            }
        }
    ]
}

To generate the base64-encoded key from a credential file, use the following command:

cat credentials.json | base64

UNLOAD#

The UNLOAD function is a table function that exports query results directly to external storage locations.

The function is available in the io schema for Starburst functions.

Warning

Fault-tolerant clusters do not support the UNLOAD function.

Syntax#

SELECT * FROM TABLE(io.unload(
    input => TABLE(...) [PARTITION BY col [, ...]],
    location => 'storage_path',
    format => 'file_format'
    [, compression => 'compression_type']
    [, separator => 'delimiter']
    [, header => true|false]
  )
)

Parameters#

The following parameters are required:

  • input: The source data

  • location: The target storage path

  • format: The output file format

The following parameters are optional:

  • compression: The compression algorithm to apply to files. Choose from the following options:

    • NONE (default)

    • SNAPPY

    • LZ4

    • ZSTD

    • GZIP

  • separator: Field delimiter for CSV and TEXTFILE formats

  • header: Boolean that specifies whether to include column headers in the output file when using CSV and TEXTFILE formats. Defaults to false.

Supported formats#

The UNLOAD function supports the following file formats:

  • ORC

  • PARQUET

  • AVRO

  • CSV

  • JSON

  • TEXTFILE

  • RCBINARY

  • RCTEXT

  • SEQUENCEFILE

  • OPENX_JSON

For more information, see Object storage file formats.

Examples#

The following example unloads the orderkey, custkey, and orderstatus columns from the tpch.sf1.orders table to the s3://mybucket/my/unload/location folder in ORC file format:

SELECT
  *
FROM
  TABLE (
    io.unload (
      input => TABLE (
        SELECT
          orderkey,
          custkey,
          orderstatus
        FROM
          tpch.sf1.orders
      )
      PARTITION BY
        (orderstatus),
      location => 's3://mybucket/my/unload/location',
      format => 'ORC'
    )
  )

The following example unloads the entire tpch.sf1.orders table:

SELECT
  *
FROM
  TABLE (
    io.unload (
      input => TABLE (tpch.sf1.orders)
      PARTITION BY
        (orderstatus),
      location => 's3://mybucket/my/unload/location',
      format => 'ORC'
    )
  )

The following example unloads the table into multiple partitions:

SELECT
  *
FROM
  TABLE (
    io.unload (
      input => TABLE (
        SELECT
          orderkey,
          custkey,
          orderdate,
          orderstatus
        FROM
          tpch.sf1.orders
      )
      PARTITION BY
        (orderdate, orderstatus),
      location => 's3://mybucket/my/unload/location',
      format => 'TEXTFILE',
      compression => 'GZIP',
      separator => '|'
    )
  )

The following example unloads the table as a compressed, delimited text file with column headers:

SELECT
  *
FROM
  TABLE (
    io.unload (
      input => TABLE (
        SELECT
          orderkey,
          custkey,
          orderstatus
        FROM
          tpch.sf1.orders
      )
      PARTITION BY
        (orderstatus),
      location => 's3://mybucket/my/unload/location',
      format => 'TEXTFILE',
      compression => 'GZIP',
      separator => '|',
      header => true,
    )
  )

LOAD#

The LOAD function is a table function that reads data from external storage locations directly into queries.

The function is available in the io schema for Starburst functions.

Warning

Fault-tolerant clusters do not support the LOAD function.

Syntax#

SELECT * FROM TABLE(io.load(
    location => 'storage_path'
    [, format => 'file_format']
    [, columns => DESCRIPTOR(column_name type, ...)]
  )
)

Parameters#

The following parameter is required:

  • location: The source storage path

The following parameters are optional:

  • format: The input file format

  • columns: Defines column names and types

Note

You must specify both the format and columns parameters together, or omit both. When you omit these parameters, the LOAD function automatically discovers the format and schema.

Supported formats#

The LOAD function supports the following file formats:

  • ORC

  • PARQUET

  • AVRO

  • CSV

  • JSON

  • TEXTFILE

  • RCBINARY

  • RCTEXT

  • SEQUENCEFILE

  • OPENX_JSON

For more information, see Object storage file formats.

Examples#

The following example loads a Parquet file and uses automatic schema discovery:

SELECT * FROM TABLE(
    io.load(
        location => 's3://mybucket/data/'
    )
)

The following example loads ORC files from a directory and defines an explicit schema:

SELECT * FROM TABLE(
    io.load(
        location => 's3://mybucket/data/orc_files/',
        format => 'ORC',
        columns => DESCRIPTOR(
            "customer_id" BIGINT,
            "customer_name" VARCHAR,
            "order_date" DATE
        )
    )
)