The Schema discovery pane on the catalog level of the catalog explorer lets you examine the metadata of the specified object storage location. Schema discovery is for catalogs that connect object storage data sources only.
Use schema discovery to identify and register tables or views that are newly added to a known schema location. For example, a logging process might drop a new log file every hour, rolling over from the previous hour’s log file. The purpose of schema discovery is to find the newly added files to make sure Starburst Galaxy knows how to query them.
To use schema discovery successfully, keep the following in mind:
For catalogs that have run discovery before, the Schema discovery tab shows a list of previous runs with the following columns:
View
log
. Click this link to open the log events pane for that
event.Rerun
to run schema discovery on the source again.
This option performs a diff on the location and returns any changes found.To discover newly added tables, click Run schema discovery. This lets you analyze a specified root object in an object store location and return the structure of any discovered tables.
In the Catalog location URL field, enter the URL of the bucket and
directory to scan. You can typically find a schema’s URL as the location
property in the Definition tab of the schema level
of the catalog explorer.
schema/table/<files/partition>
. It cannot run on a file. For example,
s3://my-s3-bucket/my_csv_file.csv
does not work.A role in your current active role set must have the location privilege for the specified location. This privilege is automatically added by the discovery process if not already present.
Enter the name of a schema in the Set default schema field. This is designated for newly discovered tables that are not part of an existing schema.
Optionally open the Advanced settings section.
a. Specify one of the following scan types for this schema discovery run:
b. Specify the maximum number of lines to show in sample files, and/or specify the maximum number of files per table.
Click Run discovery.
A successful discovery run opens the Select schemas pane, which shows a list of schemas with the following columns:
The next step is to register the discovered tables with Galaxy. Select one or more schemas, or select a set of tables within a schema that you would like to register. Then click Create all tables or Create selected tables. The table registration process runs for a few moments, then opens the Log events pane to show progress.
When the registration process completes, click Close to return to the main Schema discovery tab.
The Log events pane shows a list of log entries for each discovery related event. The Summary section shows the number of successful query executions and the number of errors that occurred during the discovery run.
The list of log events includes the following information:
CREATE TABLE
, or
CREATE SCHEMA
. Click the text to view the full query.Schema discovery identifies the Iceberg, Delta Lake, and Hive table formats supported by Starburst Galaxy’s Great Lakes connectivity. Schema discovery does not identify Hudi tables.
register_table
procedure. For Hive
tables, schema discovery registers tables using the table metadata.Schema discovery identifies tables and views that are saved in the following file formats:
JSON
CSV
ORC
PARQUET
Schema discovery identifies tables and views that use the following compression codecs:
ZSTD
LZ4
SNAPPY
GZIP
DEFLATE
BZIP2
LZO
LZOP
Schema discovery locates certain file formats as described on the file formats page.
Is the information on this page helpful?
Yes
No