Starburst Galaxy

  •  Get started

  •  Working with data

  •  Data engineering

  •  Developer tools

  •  Cluster administration

  •  Security and compliance

  •  Troubleshooting

  • Galaxy status

  •  Reference

  • Kafka streaming ingestion #

    Starburst Galaxy’s streaming ingestion service lets you continuously ingest data from a Kafka topic into a managed Iceberg table, also known as a live table. Live tables are stored in your AWS S3 bucket.

    The Iceberg tables can be queried by Galaxy clusters or any engine that can read Iceberg tables.

    To ingest data using Kafka, your role must have the Manage ingest streams account level privilege.

    Configure Starburst Galaxy’s data ingest by clicking Data > Data ingest in the navigation menu. The Data ingest option is only present when your username is a member of a role that has the Manage ingest streams privilege.

    Before you begin #

    Galaxy streaming ingest is supported on Amazon Web Services. You must provide:

    • An AWS S3 bucket location for which you have read and write access.
    • AWS credentials such as an IAM role to allow access to the S3 bucket.
    • A known working Kafka stream with at least one active topic.
    • A Kafka topic with a minimum data retention period of seven days.
    • API key and API secret credentials to connect to the Kafka stream.

    Starburst Galaxy supports streaming data from the following Kafka streaming services:

    AWS PrivateLink is available for Apache Kafka and Confluent Cloud. Amazon MSK multi-VPC private connectivity is supported for Amazon MSK. To configure Galaxy to connect to your Kafka with private connectivity, contact Starburst Support.

    Getting started #

    To begin ingesting stream data, create an ingest source and live table.

    The following sections walk you through the configuration process:

    Connect to a stream source #

    From the navigation menu’s Data > Data ingest menu:

    • If this is the first ingest source for this Galaxy account, click Connect new source.
    • If you are adding an additional ingest source, click Create new source.

    Select a Kafka data source type, then click Next.

    In the Connect new source dialog:

    • In the Source details section, enter a name for this stream source and a description.

    • In the Configure connection to Apache Kafka section:

      • Enter one or more Kafka brokers as host:port. Separate multiple brokers with a comma.

      • Select Username/password to authenticate with an API key and API secret or select Cross account IAM role (MSK), and select a cross account IAM role from the drop-down menu. For MSK streams, you must select the IAM role option.

    • Click Test connection to confirm that you have access to the stream. If the test fails, check your entries, correct any errors, and try again.

    • If the connection is successful, click Save new source.

    • You can create a live table now or postpone that step.

      In the dialog, click No, do this later to create a live table at a later time and exit the configuration. To create a live table now, click Yes, select topic.

      Data ingest kafka connect new source

    Select a target and topic #

    To create a live table anytime, click Connect a new live table.

    • In the Set up live table section, select a Catalog and Schema from the respective drop-down menus, and provide a Table name.

    • In the Select Kafka topic, section select the Topic name from the drop-down menu.

    • In the Data retention section:
      • Set a data retention threshold. By default, Retain forever is preselected to specify that all data is to be stored in the live table indefinitely. Select a different value to specify how long data is retained before it is automatically purged: 1 day, 7 days, 14 days, 30 days, or 90 days.
      • Set a throughput limit: 1MB/s, 2MB/s, 4MB/s, 8MB/s, or 16MB/s. By default, 8MB/s is preselected.
    • In the Select streaming ingest start point section, select Start from latest message to begin streaming new data or Start from earliest messages to ingest existing data plus new data. New data is delayed until existing data is ingested.

    • Click Test connection to confirm that you have access to the data. If the test fails, check your entries, correct any errors, and try again.

    • If the connection is successful, click Create target.

      Data ingest kafka select a target and topic

    Map to columns #

    Configure the mapping from the JSON-formatted Kafka message to the target live Iceberg table.

    Starburst Galaxy automatically suggests a schema by inferring from the Kafka messages on the topic. Modify the inferred schema by changing field entries and adding and removing columns.

    • Use the Edit column mapping panel to map using the following columns:

      • Source path: The location of the record information within the JSON row.
      • Column name: Provide a column name for the live table.
      • Data type: Specify a data type for the live table column.
      • Varchar type: For a VARCHAR type, specify a SCALAR or JSON VARCHAR type. For TIMESTAMP and TIMESTAMP WITH TIMEZONE types, specify a iso601 or unixtime type.

    The more_vert options menu in the header section includes a Reload detected columns option that lets you restore any altered field entries to the original inferred values.

    Use the more_vert options menu at the end of each row to add or remove columns.

    Click visibility to show and hide any nested columns.

    • The Message sample panel shows the JSON message sample used to generate the columns. If your Kafka topic is new and does not have any messages for Galaxy to infer, you can manually enter a new JSON sample by clicking upload_2. Type or paste the new JSON sample in the text area then click Load sample.

    • The Table sample panel previews the mapped table.

    • To complete the configuration, click Map stream to columns.

    Data ingest begins in approximately 1-2 minutes. You can run SELECT queries on the live table like you would on any other table.

    Connected ingest sources #

    All connected data sources appear in this section in the order they were created. Load topic lets you create a new live table.

    Delete an ingest source #

    Deleting an ingest source decommissions all associated live tables and converts them into unmanaged Iceberg tables.

    To delete a data ingest source, follow these steps:

    • In the Connected sources section, locate the data source of interest.
    • Click themore_vertoptions menu.
    • SelectdeleteDelete ingest source.
    • In the Delete ingest source dialog, click Yes, delete.
    • If a live table is associated with the data ingest source you must delete it first.

      Data ingest kafka connected ingest sources

    Live table management #

    Manage live tables from the list of live tables, which shows the following columns:

    • Table name: The name of the table.
    • Schema name: The name of the schema.
    • Status: The status of the live table.

    The total number of tables appears next to the search field. By default, Starburst Galaxy limits an account to five live tables and 40 MB/s of total throughput. To increase your quota limit, contact Starburst Support.

    The following sections detail important management and operational information about live tables.

    Data ingest kafka live tables.png

    Create a live table #

    To create a new live table, click Create new live table,then choose one of the following options in the Connect to external source section:

    • To load a topic from an existing Kafka connection, select a source from the Select existing Kafka connection drop-down menu, then click Next. Follow the steps in Select a target and topic.
    • To connect a new source, click Create new Kafka connection. Follow the steps in Connect to a stream source.

    Start and stop ingestion #

    You can start and stop data ingestion from the live tables list. Stop prevents data from being ingested into the live table. Start resumes ingestion from where it left off.

    Stopping data ingestion for longer than your data retention period results in data ingestion starting from the earliest point in the Kafka topic to include missed messages. To prevent missed messages, resume data ingestion before you hit your Kafka retention threshold or choose a longer data retention period.

    DDL and DML #

    You cannot directly modify the live table definition or delete or update data with SQL statements. See Change columns to learn how to alter columns through the UI.

    To delete data, set a data retention threshold to purge data from the live table at a specified interval. To filter out unwanted data, create a view on the live table with a predicate to filter out the data you do not want, or perform the filtering as part of a data pipeline for the table.

    If you still need to perform DML and DDL operations as done on any other Iceberg table, you can decommission the live table. Note that once you decommission the table, you can no longer use it with streaming ingestion.

    Change columns #

    To edit the columns in the live table, click themore_vertoptions menu, and select edit column mapping to go to Map to columns.

    Once you click Save changes, Galaxy automatically performs the Iceberg DDL operations on your live table to add or remove columns. Rows present in the table prior to the column changes have NULL values in the newly added columns. Removed columns are no longer accessible to query.

    Column changes may take 1-2 minutes to become active.

    Data management #

    Galaxy performs the following data management operations on live tables automatically:

    Compaction
    Improves performance by optimizing your data file size.
    Snapshot expiration
    Reduce storage by deleting older data snapshots. This action runs several times per day and expires snapshots older than 30 days.
    Data retention
    Reduces storage by deleting data snapshots according to the retention period you specify.
    Vacuuming
    Reduces storage by deleting orphaned data files.

    Errors table #

    Every live table is associated with an errors table that serves as the dead letter table. When Galaxy is unable to parse a message according to the schema or if Galaxy cannot read the message due to its large size, a new row is added for that message.

    You can query the errors table the same way you query a live table. The table is hidden and does not show up when running SHOW TABLES. The table name follows the convention: "table_name__raw$errors". You must enclose the name in quotes or the query engine fails to parse the table name.

    Decommission a live table #

    Decommissioning a live table deletes it from streaming ingestion and stores it as an unmanaged Iceberg table.

    To decommission a live table, follow these steps:

    • Locate the table of interest.
    • Click themore_vertoptions menu, and select deleteDelete live table.
    • In the Delete stream dialog, click Yes, delete.

    Best practices #

    Adhere to the following recommendations to ensure the best results when ingesting data with Kafka.

    Throughput and sizing #

    Galaxy can automatically scale with your throughput requirements. The number of Kafka partitions for a topic determines the number of pipes Starburst uses to ingest the data.

    Pipes are a unit of compute used to ingest the data, and are mapped one-to-one with Kafka partitions.

    You can increase ingestion throughput by adding more Kafka partitions. As more partitions are added, Galaxy automatically increases the number of pipes required to map to the Kafka partitions.

    By default, a pipe has a Kafka read throughput limit of up to 8 MB/s per partition. The read throughput limit allows you to adjust the total throughput for a single partition. You can adjust the read throughput limit per partition to 1 MB/s, 2 MB/s, 4 MB/s, 8 MB/s or 16 MB/s. To increase the total throughout you can adjust the read throughput limit or increase the number of Kafka partitions.

    When determining sizing, Starburst recommends setting the pipe read throughput limit to more than 50% of the Kafka per partition throughput. This headroom is not required, but is often used to handle spikes, and to backfill to prevent ingest lag.

    For example, Starburst can ingest data from a topic with 20 partitions at a maximum rate of 160 MB/s when using the default Kafka read throughput limit of up to 8 MB/s per partition. In this case, we recommend configuring so that each partition produces data at no more than 4 MB/s.

    Security #

    Any modifications made to the data or metadata files may corrupt the Iceberg table. Starburst Galaxy cannot ingest to or manage data in a corrupted table.

    Recommended: Apply the principles of least privilege to users who are granted permissions to perform operations on data in the S3 bucket where Iceberg tables are stored.

    Recommended: Place Iceberg managed tables in a separate bucket with tighter AWS governance.