Storage functions#

Use storage functions to read from or write to external object storage locations directly within your SQL queries.

Storage functions support Amazon S3, Google Cloud Storage, Azure Storage, and local file systems.

Note

Storage functions are available as a public preview feature. Contact your Starburst account team with questions or feedback.

Configuration#

To use storage functions, you must configure file system access and credentials.

File system access#

You must enable and configure file system access. Storage functions support the following file systems:

Credentials#

Storage functions require credentials to authenticate with object storage systems. You must specify exactly one of the following properties in config.properties:

Property name

Description

io.credentials-key

Base64-encoded credential key

io.credentials-file

Path to a JSON credential file

Warning

You must specify either io.credentials-key or io.credentials-file, but not both. If you set both properties or neither property, the system fails to start.

Credential file format#

The credential file uses JSON format to define storage configurations. The connector matches locations based on prefix comparison and selects the most specific match.

Note

The connector throws an exception if a specified location matches no configuration or matches multiple configurations. Ensure your location prefixes are specific enough to avoid conflicts.

{
    "configurations": [
        {
            "id": "s3-us-east-1",
            "location": "s3://my-bucket-east/",
            "configuration": {
                "fs.native-s3.enabled": "true",
                "s3.aws-access-key": "YOUR_ACCESS_KEY",
                "s3.aws-secret-key": "YOUR_SECRET_KEY",
                "s3.region": "us-east-1"
            }
        },
        {
            "id": "gcs-bucket",
            "location": "gs://my-gcs-bucket/",
            "configuration": {
                "fs.native-gcs.enabled": "true",
                "gcs.json-key": "YOUR_JSON_KEY"
            }
        },
        {
            "id": "azure-container",
            "location": "abfs://container@account.dfs.core.windows.net/",
            "configuration": {
                "fs.native-azure.enabled": "true",
                "azure.auth-type": "ACCESS_KEY",
                "azure.access-key": "YOUR_ACCESS_KEY"
            }
        }
    ]
}

To generate the base64-encoded key from a credential file, use the following command:

cat credentials.json | base64

starburst.io.unload#

The starburst.io.unload function is a table function that exports query results directly to external storage locations.

Warning

Fault-tolerant clusters do not support the starburst.io.unload function.

Syntax#

SELECT * FROM TABLE(starburst.io.unload(
    input => TABLE(...) [PARTITION BY col [, ...]],
    location => 'storage_path',
    format => 'file_format'
    [, compression => 'compression_type']
    [, separator => 'delimiter']
    [, header => true|false]
    )
)

Parameters#

The following parameters are required:

  • input: The source data

  • location: The target storage path

  • format: The output file format

The following parameters are optional:

  • compression: The compression algorithm to apply to files. Choose from the following options:

    • NONE (default)

    • SNAPPY

    • LZ4

    • ZSTD

    • GZIP

  • separator: Field delimiter for CSV and TEXTFILE formats

  • header: Boolean that specifies whether to include column headers in the output file when using CSV and TEXTFILE formats. Defaults to false.

Supported formats#

The starburst.io.unload function supports the following file formats:

  • ORC

  • PARQUET

  • AVRO

  • CSV

  • JSON

  • TEXTFILE

  • RCBINARY

  • RCTEXT

  • SEQUENCEFILE

  • OPENX_JSON

For more information, see Object storage file formats.

Examples#

The following example unloads the orderkey, custkey, and orderstatus columns from the tpch.sf1.orders table to the s3://mybucket/my/unload/location folder in ORC file format:

SELECT * FROM TABLE(
    starburst.io.unload(
        input => TABLE(
            SELECT orderkey, custkey, orderstatus
            FROM tpch.sf1.orders
        ) PARTITION BY (orderstatus),
        location => 's3://mybucket/my/unload/location',
        format => 'ORC'
    )
)

The following example unloads the entire tpch.sf1.orders table:

SELECT * FROM TABLE(
    starburst.io.unload(
        input => TABLE(tpch.sf1.orders)
        PARTITION BY (orderstatus),
        location => 's3://mybucket/my/unload/location',
        format => 'ORC'
    )
)

The following example unloads the table into multiple partitions:

SELECT * FROM TABLE(
    starburst.io.unload(
        input => TABLE(
            SELECT orderkey, custkey, orderdate, orderstatus
            FROM tpch.sf1.orders
        ) PARTITION BY (orderdate, orderstatus),
        location => 's3://mybucket/my/unload/location',
        format => 'TEXTFILE',
        compression => 'GZIP',
        separator => '|'
    )
)

The following example unloads the table as a compressed, delimited text file with column headers:

SELECT * FROM TABLE(
    starburst.io.unload(
        input => TABLE(
            SELECT orderkey, custkey, orderstatus
            FROM tpch.sf1.orders
        ) PARTITION BY (orderstatus),
        location => 's3://mybucket/my/unload/location',
        format => 'TEXTFILE',
        compression => 'GZIP',
        separator => '|',
        header => true
    )
)

starburst.io.load#

The starburst.io.load function is a table function that reads data from external storage locations directly into queries.

Warning

Fault-tolerant clusters do not support the starburst.io.load function.

Syntax#

SELECT * FROM TABLE(starburst.io.load(
    location => 'storage_path'
    [, format => 'file_format']
    [, columns => DESCRIPTOR(column_name type, ...)]
    )
)

Parameters#

The following parameter is required:

  • location: The source storage path

The following parameters are optional:

  • format: The input file format

  • columns: Defines column names and types

Note

You must specify both the format and columns parameters together, or omit both. When you omit these parameters, the starburst.io.load function automatically discovers the format and schema.

Supported formats#

The starburst.io.load function supports the following file formats:

  • ORC

  • PARQUET

  • AVRO

  • CSV

  • JSON

  • TEXTFILE

  • RCBINARY

  • RCTEXT

  • SEQUENCEFILE

  • OPENX_JSON

For more information, see Object storage file formats.

Examples#

The following example loads a Parquet file and uses automatic schema discovery:

SELECT * FROM TABLE(
    starburst.io.load(
        location => 's3://mybucket/data/'
    )
)

The following example loads ORC files from a directory and defines an explicit schema:

SELECT * FROM TABLE(
    starburst.io.load(
        location => 's3://mybucket/data/orc_files/',
        format => 'ORC',
        columns => DESCRIPTOR(
            "customer_id" BIGINT,
            "customer_name" VARCHAR,
            "order_date" DATE
        )
    )
)