Ingest data from JSON files or CSV files in an AWS S3 bucket location into a managed Iceberg table, also known as a live table.
Live tables are stored in your AWS S3 bucket and can be queried as part of a Starburst Galaxy cluster, or by any query engine that can read Iceberg tables.
This page walks you through configuring a live table in three steps:
Review the guidelines on key configurations and see the prerequisites section for requirements.
Galaxy’s file ingest is supported on Amazon Web Services. You must provide:
A Galaxy cluster located in one of the following AWS regions:
To inquire about support for other regions, Starburst Support.
Galaxy ingests compressed or uncompressed newline deliminated JSON (NDJSON) files.
The files are subject to the following constraints:
To adjust file ingest limits, Starburst Support.
To begin ingesting file data, click Data > Data ingest in the navigation menu.
The Data ingest pane provides an overview of streaming resources, organized into Data sources, Live tables, and Schema registries tabs.
Click Connect new source, then select the Amazon S3 source.
In the Connect new Amazon S3 dialog:
In the Source details section, enter a name for the source and a description.
In the Connection to Amazon S3 section, enter the name of the S3 bucket, and the S3 file prefix.
Live tables created using this source specify an exact location under the prefix. If you are creating multiple file ingest live tables, we recommended choosing a prefix which points to the root of all the files you want to ingest.
To authenticate with an API key/API secret pairing, enter an API key and an API secret in the respective fields.
To authenticate with a cross-account IAM role:
Click Test connection to confirm that you have access to the source. If the test fails, check your entries, correct any errors, and try again.
If the connection is successful, click Save new source
In the Enter source S3 subdirectory section, you have the option to limit ingestion to a subdirectory with your S3 bucket. To do this, enter the source S3 directory.
In the Raw table target section:
DROP statement to remove the existing
table.In the File format section, select either the JSON file format or the CSV file format.
In the Table configuration section:
Set a data Retention threshold: By default, Retain forever is preselected to specify that all data is to be stored in the live table indefinitely. Select a different value to specify how long data is retained before it is automatically purged: 1 day, 7 days, 14 days, 30 days, or 90 days.
Set a Polling frequency to detect new files every 30, 60, 90, or 120 minutes.
Click Test connection to confirm that you have access to the data. If the test fails, check your entries, correct any errors, and try again.
If the connection is successful, click Save raw table.
For the CSV file format selection:
Choose an Escape character: Double quote uses the RFC 4180 standard. Select Backslash if your data was generated with backslash escapes.
In the Table configuration section:
Set a Polling frequency to detect new files every 30, 60, 90, or 120 minutes.
Click Test connection to confirm that you have access to the data. If the test fails, check your entries, correct any errors, and try again.
If the connection is successful, click Save raw table.
Transform your raw data into structured data, then define a schema for your data.
You can create a transform table now or later through the raw table options in Live Table Management.
To create a transform table now, proceed to Transform and schematize data.
In the Transform table target section:
DROP statement to remove the existing
table.In the Table configuration section:
Set a data Retention threshold: By default, Retain forever is preselected to specify that all data is to be stored in the live table indefinitely. Select a different value to specify how long data is retained before it is automatically purged: 1 day, 7 days, 14 days, 30 days, or 90 days.
Choose an Error handling policy: Continue continues the ingestion process and writes any records that fail to parse to the errors table. Pause and notify stops the ingestion process if any parsing errors occur.
Click Validate data. If the test fails, check your entries, correct any errors, and try again.
Click Next.
Starburst Galaxy automatically suggests a schema by inferring from the JSON file. Modify the inferred schema by changing field entries and adding and removing columns.
Use the Edit column mapping panel to map the following columns:
VARCHAR type, specify a SCALAR or JSON VARCHAR
type. For TIMESTAMP and TIMESTAMP WITH TIMEZONE types, specify a
iso601 or unixtime type.Use themore_vertoptions menu at the end of each row to add or remove columns.
Perform the following actions from the header:
refresh Reload detected columns: Restore any altered field entries to the original inferred values.
dashboard_customize Set default type: Apply a
data type to columns without a detected type or override the inferred types
for all columns.
graph_2 Partition and sort columns: Customize partitions and sort columns.
In the Partition columns section, select a Column and Transform type from the respective drop-down menus. Transform types change depending on the column. For the bucket transform type, the number of buckets must be between 2 and 512.
Partition changes are applied to new data immediately, while existing data is updated during the next compaction run.
Partition evolution is not supported.
The Record sample panel shows the message sample used to generate the columns. If your file has no data for Galaxy to infer, you can manually enter a new sample by clicking upload. Type or paste the new sample in the text area then click Load sample.
The Table sample panel previews the mapped table.
To complete the configuration, click Create transform table.
Data ingest begins in approximately 1-2 minutes. You can run SELECT queries on
the live table like you would on any other table.
Starburst Galaxy automatically suggests a schema by inferring from the CSV file.
The View column mapping panel displays mapped data in the following columns:
Use themore_vertoptions menu at the end of each row to add or remove columns.
Perform the following actions from the header:
refresh Reload detected columns: Restore any altered field entries to the original inferred values.
graph_2 Partition and sort columns: Customize partitions and sort columns.
In the Partition columns section, select a Column and Transform type from the respective drop-down menus. Transform types change depending on the column. For the bucket transform type, the number of buckets must be between 2 and 512.
Partitioning of DOUBLE and REAL columns is not supported. Change the
data type to a supported column type.
Partition changes are applied to new data immediately, while existing data is updated during the next compaction run.
Partition evolution is not supported.
The Record sample panel shows the message sample used to generate the columns.
The Table sample panel previews the mapped table.
To complete the configuration, click Create transform table.
Data ingest begins in approximately 1-2 minutes. You can run SELECT queries on
the live table like you would on any other table.
Adhere to the following recommendations to ensure the best results when ingesting file data.
Confirm that the source JSON files contain newline delimited (NDJSON) records before attempting to create a live table schema mapping or before attempting to ingest file data into a live table.
Place each source file in its own folder to ensure proper identification during ingestion.
Verify that the data in the source files was successfully ingested into the live table before purging it. If Galaxy has not ingested a file after several polling intervals have transpired, check the errors table for any detected errors.
Exceeding the 20,000,000 file limit stops the ingestion process. Ensure that the total number of files in the S3 source path does not exceed the maximum limit by deleting files from the S3 source location after they have been successfully ingested.
Any modifications made to the data or metadata files may corrupt the Iceberg table. Starburst Galaxy cannot ingest to or manage data in a corrupted table.
Is the information on this page helpful?
Yes
No