Starburst Galaxy

  •  Get started

  •  Working with data

  •  Data engineering

  •  Developer tools

  •  Cluster administration

  •  Security and compliance

  •  Troubleshooting

  • Galaxy status

  •  Reference

  • File ingestion #

    Starburst Galaxy’s file ingestion service lets you continuously ingest data from JSON files in an AWS S3 bucket location into a managed Iceberg table, also known as a live table. Live tables are stored in your AWS S3 bucket.

    Live tables can be queried as part of a Starburst Galaxy cluster or by any query engine that can read Iceberg tables.

    The Data ingest option is only available to users assigned to roles that have the Manage ingest streams account level privilege.

    Configure Starburst Galaxy’s file ingest by clicking Data > Data ingest in the navigation menu.

    Before you begin #

    Galaxy’s file ingest is supported on Amazon Web Services. You must provide:

    • An Amazon S3 Standard tier storage location for which you have read and write access.
    • AWS credentials such as an IAM role or API key and API secret to allow access to the S3 bucket.
    • A Galaxy S3 catalog configured to use an AWS Glue metastore or a Starburst Galaxy metastore.
    • A Galaxy cluster located in the AWS us-east-1 region. To inquire about support for other regions, contact Starburst Support.

    Galaxy ingests uncompressed JSON files, newline deliminated JSON (NDJSON) files, and compressed ZSTD, LZ4, SNAPPY, GZIP, or DEFLATE files.

    The files are subject to the following constraints:

    • Maximum 10GB for a single file
    • Maximum 500GB total data per ingest interval
    • Maximum 500,000 file limit in the source path

    Getting started #

    To begin file ingestion, create an ingest source and live table.

    The following sections walk you through the configuration process:

    Connect to a source #

    From the navigation menu, go to Data > Data ingest, then click Connect new source.

    Select the Amazon S3 source, then click Next to go to the Create new source dialog:

    • In the File ingest source details section, enter a name for the source and a description.

    • In the Configure file ingest source section, enter the name of the S3 bucket in the S3 bucket field, and the S3 file prefix in the S3 file prefix field.

      Live tables created using this source specify an exact location under the prefix. If you are creating multiple file ingest live tables, we recommended choosing a prefix which points to the root of all the files you want to ingest.

    • In the Configure auth for file ingest section, authenticate with an API key and API secret or select Cross account IAM role, and select a cross account IAM role from the drop-down menu.

    • Click Test connection to confirm that you have access to the source. If the test fails, check your entries, correct any errors, and try again.

    • If the connection is successful, click Save new source.

    • You can create a live table now or at a later time.

      In the dialog, click No, do this later to create a live table at a later time and exit the configuration. To create a live table now, click Yes, select topic.

    Select a target and sub-path #

    To create a live table anytime, select a data source from the Connected sources section, and click Create live table.

    • In the Source section, you can select a different data source using the Select existing connection drop-down menu or click Connect new source to create a new data source.

    • In the Live table target section, select a Catalog and Schema from the respective drop-down menus, and provide a Table name.

    • In the Enter source S3 suffix, enter the suffix of the S3 source.

    • In the Choose the file format section, choose a file format.

    • In the Select the polling interval section, select a polling frequency to specify the duration between scans: 30 minutes, 60 minutes, 90 minutes, and 120 minutes. If files are not processed during the specified interval, Galaxy attempts to ingest them at the next polling interval.

    • Click Test connection to confirm that you have access to the data. If the test fails, check your entries, correct any errors, and try again.

    • If the connection is successful, click Map columns.

    Map to columns #

    Configure the mapping from the JSON-formatted data source files to the target live Iceberg table.

    Starburst Galaxy automatically suggests a schema by inferring from the source S3 location. If the S3 source location does not contain supported JSON format files, you must load a record sample or add columns manually.

    • Use the Edit column mapping panel to map using the following columns:

      • Source path: The location of the record information within the JSON row.
      • Column name: Provide a column name for the live table.
      • Data type: Specify a data type for the live table column.

    Themore_vertoptions menu in the header section includes a Reload detected columns option that lets you restore any altered field entries to the original inferred values.

    Use themore_vertoptions menu at the end of each row to add or remove columns.

    Click visibility to show and hide any nested columns.

    • The Record sample panel shows the JSON message sample used to generate the columns. You can manually enter a new JSON sample by clicking upload_2. Type or paste the new JSON sample in the text area then click Load sample.

    • The Table sample panel previews the mapped table.

    • To complete the configuration, click Create schema.

    Data ingest begins in approximately 1-2 minutes. You can run SELECT queries on the live table like you would on any other table.

    Connected sources #

    All connected data sources appear in this section in the order they were created. Create new live table lets you create a new live table.

    Delete an ingest source #

    Deleting an ingest source decommissions all associated live tables and converts them into unmanaged Iceberg tables.

    To delete a data ingest source, follow these steps:

    • In the Connected sources section, locate the data source of interest.
    • Click themore_vertoptions menu.
    • SelectdeleteDelete ingest source.
    • In the Delete ingest source dialog, click Yes, delete.
    • If a live table is associated with the data ingest source you must delete it first.

      File ingest S3 connected sources

    Live table management #

    Manage live tables from the list of live tables, which shows the following columns:

    • Table name: The name of the table.
    • Catalog name: The name of the catalog.
    • Schema name: The name of the schema.
    • Source: The icon representing the data source.
    • Status: The status of the live table.

    The total number of tables appears next to the search field. By default, Starburst Galaxy limits an account to five live tables. To increase your quota limit, contact Starburst Support.

    The following sections detail important management and operational information about live tables.

    File ingest S3 live tables.png

    Connect a live table #

    To connect a live table, click Connect new live table, then choose one of the following options in the Source section:

    • To select an existing connection, select a source from the Select existing connection drop-down menu, then Follow the steps in Select a target and sub-path.
    • To connect a new source, click Create new source. Follow the steps in Connect to a source.

    Start and stop ingestion #

    You can start and stop data ingestion from the live tables list. Stop prevents data from being ingested into the live table. Start resumes ingestion from where it left off.

    DDL and DML #

    You cannot directly modify the live table definition or delete or update data with SQL statements.

    You cannot set a data retention period or purge data from a live table connected to a file ingest data source. To filter out unwanted data, create a view on the live table with a predicate to filter out the data you do not want, or perform the filtering as part of a data pipeline for the data in the source location.

    If you still need to perform DML and DDL operations as done on any other Iceberg table, you can decommission the live table.

    Data management #

    Galaxy performs the following data management operations on live tables automatically:

    Compaction
    Improves performance by optimizing your data file size.
    Snapshot expiration
    Reduce storage by deleting older data snapshots. This action runs several times per day and expires snapshots older than 30 days.

    Errors table #

    Every live table is associated with an errors table that serves as the dead letter table. If Galaxy were to ever encounter an issue while trying to ingest a file, for example, an unrecognized file format, the file exceeds the allowed limit, or a syntax error in a JSON record, an entry is added to the error table corresponding to that file, and the file is skipped.

    You can query the errors table the same way you query a live table. The table is hidden and does not show up when running SHOW TABLES. The table name follows the convention: "table_name__raw$errors". You must enclose the name in quotes or the query engine fails to parse the table name.

    Decommission a live table #

    Decommissioning a live table deletes it from file ingestion and stores it as an unmanaged Iceberg table.

    To decommission a live table, follow these steps:

    • Locate the table of interest.
    • Click themore_vertoptions menu, and select deleteDelete live table.
    • In the Delete ingest source dialog, click Yes, delete.

    Best practices #

    Adhere to the following recommendations to ensure the best results when ingesting file data.

    • Confirm that the source JSON files contain newline delimited records before attempting to create a live table schema mapping or before attempting to ingest file data into a live table.

    • Verify that the data in the source files was successfully ingested into the live table before purging it. If Galaxy has not ingested a file after several polling intervals have transpired, check the errors table for any detected errors.

    • Exceeding the 500,000 file limit stops the ingestion process. Ensure that the total number of files in the S3 source path does not exceed the maximum limit by deleting files from the S3 source location after they have been successfully ingested.

    Security #

    Any modifications made to the data or metadata files may corrupt the Iceberg table. Starburst Galaxy cannot ingest to or manage data in a corrupted table.

    Recommended: Apply the principles of least privilege to users who are granted permissions to perform operations on data in the S3 bucket where Iceberg tables are stored.

    Recommended: Place Iceberg managed tables in a separate bucket with tighter AWS governance.

    Limitations #

    File ingest throughput and sizing cannot be adjusted.