Starburst Galaxy

  •  Get started

  •  Working with data

  •  Data engineering

  •  Developer tools

  •  Cluster administration

  •  Security and compliance

  •  Troubleshooting

  • Galaxy status

  •  Reference

  • Discover new object storage files #

    The Schema discovery pane on the catalog level of the catalog explorer lets you examine the metadata of the specified object storage location. Schema discovery is for catalogs that connect object storage data sources only.

    Use schema discovery to identify and register tables or views that are newly added to a known schema location. For example, a logging process might drop a new log file every hour, rolling over from the previous hour’s log file. The purpose of schema discovery is to find the newly added files to make sure Starburst Galaxy knows how to query them.

    Schema discovery requires the catalog’s metastore to have Allow creating external tables enabled.

    Schema discovery is only available to catalogs that support write operations.

    Run schema discovery #

    If you are running schema discovery for the first time, click Run schema discovery to analyze a root object in an object storage location and return the structure of any discovered tables. If you have previously performed schema discovery for the specified location, click Run discovery:

    1. In the Catalog location URL field, enter the URL of the bucket and directory to scan.

      A role in your current active role set must have the location privilege for the specified location. For this reason, Add location privilege is pre-selected to automatically grant the location privilege if not already present.

    2. Enter the name of a schema in the Set default schema field. This is a backup schema name in which to place any discovered tables that are not already part of a schema.

    3. Advanced settings include the following scan types for schema discovery:
      • Incremental discovery from last run scans for tables created in the specified schemas since the last schema discovery run. This is the default selection.
      • Full discovery runs a full discovery scan on the specified location, potentially finding tables already registered with Starburst Galaxy.
    4. Optionally, under Advanced settings, select the maximum sample file lines, and the maximum files per table.

    5. Click Run discovery.

    catalog explorer schema discovery

    Results for an Incremental discovery from last run populate a list with the following information:

    • Schema: The name of the schema that contains the tables.
    • Tables: The tables added to the schema since the last scan.
    • Partitions: The partitions in the table, if any.
    • Path: The URL of the schema and table.

    Select the tables you would like to register, then click Create selected tables to go to the log events pane.

    Results for Full discovery populate a list with useful information for your discoveries:

    • Source: The source URL for the bucket used for discovery. Click the source to navigate to the discovery results pane.
    • Timestamp The timestamp when the discovery was run.
    • Status: The current status of the discovery, such as when the discovery was completed, or whether the discovery is in progress.
    • Changes: Displays a summary of the changes made during the discovery run, such as the number of tables created.
    • Log: The Log column shows an entry when schema discovery both succeeds and is applied with Create schemas, Create tables, or Update tables. Click an entry in this column to open the log events pane for that event.
    • Rerun: Click Rerun to run schema discovery on the source again. This option performs a diff on the location and returns any changes found.

    Log events #

    The log events pane lets you view a list of log entries for each discovery related event. The Summary dialog gives you the number of successful query executions, and the number of errors that occurred during the discovery run.

    The list of log events includes the following information:

    • Status: The outcome of the event. A green checkmark indicates a successful query execution, and a red exclamation mark indicates an error.
    • Timestamp: The timestamp when the event occurred.
    • Query text: The SQL query execution text, such as CREATE TABLE, or CREATE SCHEMA. Click the text to view the full query.
    • Message: A message detailing the log event, such as the successful creation of a schema, or an error message.

    Discovery results #

    The discovery results pane lists tables found from the source during discovery:

    schema discovery results pane

    • Schema: The name of the schema that contains the table.
    • Table name: The name of the table.
    • Format: The table’s file format.
    • Changes: A summary of changes made from the discovery run, such as the number of tables created.
    • Results: Click Preview to see a dialog that describes the columns of the table and its configuration options.

    Click Create all tables to navigate to the log events pane and to see each table being created. You can view your discovered schema in the schemas pane.

    Supported formats #

    Schema discovery identifies the Iceberg, Delta Lake, and Hive table formats supported by Starburst Galaxy’s Great Lakes connectivity. Schema discovery does not identify Hudi tables.

    Schema discovery identifies tables and views that are saved in the following file formats:

    • JSON
    • CSV
    • ORC
    • PARQUET

    Schema discovery identifies tables and views that use the following compression codecs:

    • ZSTD
    • LZ4
    • SNAPPY
    • GZIP
    • DEFLATE
    • BZIP2
    • LZO
    • LZOP

    Schema discovery locates certain file formats as described on the file formats page.