Schema discovery#

Schema discovery identifies and registers tables and views that are newly added to a known schema location. For example, when a logging process creates a new log file every hour, rolling over from the previous hour’s log file, schema discovery locates the newly added files so that SEP can query them.

Configuration#

Schema discovery requires SEP’s built-in access control authorization. The feature is enabled by default when BIAC is configured, but can be disabled by setting starburst.schema-discovery.enabled to false.

If using Iceberg or Delta Lake catalogs, view the Iceberg and Delta Lake catalog configuration section.

Warning

Registering a table in different catalogs can lead to inconsistencies in the table.

Iceberg and Delta Lake catalog configuration#

The following tables list additional catalog configuration properties for the Iceberg and Delta Lake connectors. The iceberg.register-table-procedure.enabled and delta.register-table-procedure.enabled properties are required to enable schema discovery for the respective catalogs:

Iceberg schema discovery configuration properties#
Property	Description
`iceberg.register-table-procedure.enabled`	Set to `true` to allow user to call the `register_table` procedure. Required to enable schema discovery on Iceberg tables.
`iceberg.schema-discovery.concurrency`	Maximum number of parallel jobs used for scanning files by schema discovery. Default is `8`.
`iceberg.schema-discovery.max-buckets`	Maximum number of scanned buckets. Default is `10`.

Delta Lake schema discovery configuration properties#
Property	Description
`delta.register-table-procedure.enabled`	Set to `true` to allow user to call the `register_table` procedure. Required to enable schema discovery on Delta Lake tables.
`delta.schema-discovery.concurrency`	Maximum number of parallel jobs used for scanning files by schema discovery. Default is `8`.
`delta.schema-discovery.max-buckets`	Maximum number of scanned buckets. Default is `10`.

Built-in access control configuration#

If you are using SEP’s built-in access control, there are some additional configuration requirements.

The user executing the schema discovery run must have:

A user interface grant for the Schema Discovery UI feature.
A CREATE grant on the target catalog.
SELECT grants on the schema_discovery.discovery and schema_discovery.shallow_discovery tables. Alternatively, you can grant the entire catalog or just the schema_discovery schema.
A location grant for the source file system.
For Iceberg and Delta Lake catalogs there must be an EXECUTE privilege on the <catalog>.system.register_table procedure. The CREATE privilege is still needed in order to create the target schema.

Identifier constraints#

You can use the schema-discovery.identifier-constraint configuration property to adjust how table names are read by SEP during schema discovery. The property can be set to one of the following values:

ENFORCED_ALPHANUMERIC: Converts input to a lowercase, alphanumeric identifier using only basic ASCII Latin characters. A leading underscore is permitted. Normalization Form Compatibility Decomposition is applied to any character that is not a valid basic ASCII Latin character. Spaces and dashes are converted to underscores. The result matches a regular expression [a-z_][a-z0-9_]*. For example ấ ê ŏ õ ô ì produces the a_e_o_o_o_i identifier.
VALID_IN_TRINO: If a lowercase name matches a valid Trino identifier, it is used as the identifier during schema discovery.
VALID_IN_HIVE_AND_TRINO: Normalizes table names to match Hive metastore with the metastore.support.special.characters.tablename flag set to true.

The schema-discovery.identifier-constraint configuration property is set to ENFORCED_ALPHANUMERIC by default.

List of discoveries#

For catalogs that have run discovery before, the Query > Schema discovery page shows a list of previous runs with the following columns:

Source: The source URL for the bucket used for discovery. Click the source name to open the discovery results pane.
Timestamp The timestamp when the discovery was run.
Status: The current status of each discovery instance, such as the last successful discovery run, or whether the discovery is in progress.
Changes: Displays a summary of the changes made during the discovery run, such as the number of tables created.
Logs: Click to open the log events pane for that event.
Rerun: Click to run schema discovery on the source again. This option performs a diff on the location and returns any changes found.

Run schema discovery#

Run schema discovery to analyze a schema in an object storage location and return the structure of any discovered tables using a system table or a table function in that location.

Run in the UI#

To run schema discovery using the Starburst Enterprise web UI:

Click Query > Query editor in the left-hand navigation.
Use the ︙ options menu for the target entity in the cluster explorer and click Run discovery.
In the Catalog location URL field, enter the URL of the bucket and directory to scan.

Note

Schema discovery follows the pattern of schema/table/<files/partition>. It cannot run on a file. For example, s3://my-s3-bucket/my_csv_file.csv does not work.

Enter the name of a schema in the Default schema field. This is designated for newly discovered tables that are not part of an existing schema.
Optionally open the Advanced settings section:

Specify one of the following scan types for this schema discovery run:
- Incremental discovery from last run scans for tables created in the specified schemas since the last schema discovery run. This is the default selection.
- Full discovery runs a full discovery scan on the specified location, potentially finding tables already registered with SEP.
Specify the maximum number of lines to show in sample files, and/or specify the maximum number of files per table.
Click Run discovery.

Run with SQL#

To run schema discovery using a system table, specify the system table in the FROM clause, and the URI and schema in the WHERE clause, along with any options as shown in the following statement:

SELECT sql FROM hive.schema_discovery.discovery
WHERE uri = 'hdfs://hadoop-master:9000/user/hive/warehouse/customer_orders'
AND schema = 'starburst_demo'
AND options = 'sampleFilesPerTableModulo=1,excludePatterns=**/_SUCCESS'

To run schema discovery using a table function, you must call the schema_discovery.discovery function in the FROM clause, and specify the URI and schema along with any options as follows:

SELECT sql FROM TABLE(schema_discovery.discovery(uri => 's3://test-bucket/path/to/discoverable/table',
schema => 'demo_schema', options => 'sampleFilesPerTableModulo=1,excludePatterns=**/_SUCCESS'))

The following sections describe the available options.

Incremental discovery#

To run incremental discovery in SQL mode, include the previous_metadata_json parameter in your SQL statement. Use the metadata_json value from a previous discovery run as a value for this parameter.

SELECT sql FROM hive.schema_discovery.discovery
WHERE uri = 'hdfs://hadoop-master:9000/user/hive/warehouse/customer_orders'
AND schema = 'starburst_demo'
AND previous_metadata_json = '[metadata_json from previous run]'

Note

You may overwrite any custom modifications you made to schemas or tables since the last discovery run. Your modifications may also cause incremental discovery to fail if they conflict with the table structures from previous discovery runs.

Multiple tables#

When a directory contains multiple tables, schema discovery:

Scans the directory structure
Locates folders with metadata that indicates Iceberg and Delta Lake tables
Locates directory partitions that indicate Hive tables, considering the first non-partition directory up the tree as the table root
Scans files
Creates queryable tables

Schema discovery detects:

Multiple tables under a single schema
Multiple schemas, when you specify a scan location above the level of individual schemas in the directory hierarchy

Column type conversion#

When combining schemas, schema discovery attempts to match column values to the most appropriate Trino data type. If discovery does not locate a compatible type, it defaults to VARCHAR.

Default options#

Schema discovery is run by default with the following options and their default values. Specify a different value as shown in the previous examples to override the default values.

Option	Default value
`dateFormat`	yyyy-MM-dd
`positiveInf`	Inf
`generatedHeadersFormat`	COL%d
`nanValue`	NaN
`locale`	US
`lineSeparator`
`nullValue`
`complexHadoop`	false
`sampleLinesModulo`	3
`maxSampleFilesPerTable`	5
`skipFileExtensionsCheck`	false
`quote`	“
`includePatterns`	**
`delimiter`	,
`escape`	\
`headers`	true
`ignoreTrailingWhiteSpace`	false
`discoveryMode`	NORMAL
`maxSampleLines`	10
`excludePatterns`	.*
`sampleFilesPerTableModulo`	3
`partitionProjection`	false
`negativeInf`	-Inf
`encoding`	UTF-8
`inferJsonStringDecimal`	false
`supportBuckets`	false
`ignoreLeadingWhiteSpace`	false
`maxSampleTables`	2147483647
`hiveSkipHiddenFiles`	false
`timestampFormat`	yyyy-MM-dd HH:mm:ss[.SSSSSSS]
`comment`	`#`

Standard options#

The following options are set at runtime, and apply to all schema discovery queries:

Option	Description
`encoding`	Charset to use when reading files.
`locale`	Locale for date parsing.
`dateFormat`	Date format pattern.
`timestampFormat`	Time format pattern.
`nanValue`	Value to use for not-a-number.
`positiveInf`	Value to use for positive infinity.
`negativeInf`	Value to use for negative infinity.
`includePatterns`	HDFS include GLOB, split by.
`excludePatterns`	HDFS exclude GLOB, split by.
`maxSampleLines`	Max lines to sample from each file.
`sampleLinesModulo`	Each SAMPLE_LINES_MODULO line is sampled. i.e. if 3, every 3rd line is sampled.
`sampleFilesPerTableModulo`	Each SAMPLE_FILES_PER_TABLE_MODULO file is sampled. i.e. if 3, every 3rd file is sampled.
`maxSampleTables`	Max Tables to discovery.
`maxSampleFilesPerTable`	Max files per table to sample.
`supportBuckets`	If “true” attempt to infer buckets.
`forceTableFormat`	Optional = force the table format [JSON, CSV, ORC, PARQUET, ICEBERG, DELTA_LAKE, ERROR].
`discoveryMode`	Discovery mode, NORMAL - default discovery mode, directories in top level tables [RECURSIVE_DIRECTORIES, NORMAL].
`partitionProjection`	If “true” attempt to infer partition projection.
`skipFileExtensionsCheck`	Whether to skip looking at file extension when discovering file format. Use in case of mismatched file format/extension.
`inferJsonStringDecimal`	Try to parse string json values into Trino’s DECIMAL type.
`hiveSkipHiddenFiles`	If true, skips files and directories whose names begin with `_` or `.`

Note

Some options do not apply to file formats with embedded schema information, such as Parquet. These options are primarily useful for JSON and CSV files.

CSV and text file Options#

The following options are set at runtime, and apply only to schema discovery queries on CSV and text files:

Option	Description
`headers`	If “true” treat first line as columns names.
`generatedHeadersFormat`	Pattern to use if column names are auto-generated.
`delimiter`	Text file delimiter.
`quote`	Text file quote.
`escape`	Text file escape.
`comment`	Text file comment.
`nullValue`	Text file null value.
`ignoreLeadingWhiteSpace`	If “true” ignore leading white space.
`ignoreTrailingWhiteSpace`	If “true” ignore trailing white space.
`complexHadoop`	If “true” try to discover arrays, structs and maps.
`lineSeparator`	Text line separator.

Examine schema discovery run#

A successful discovery run opens the Select schemas pane, which shows a list of schemas with the following columns:

Schema: The name of the schema that contains the discovered tables.
Tables: The number of tables added to the schema since the last scan.
Partitions: The number of partitions in the table, if any.
Path: The URL of the schema and table.

The next step is to register the discovered tables with SEP. Select one or more schemas, or select a set of tables within a schema that you would like to register. Then click Create all tables or Create selected tables. The table registration process runs for a few moments, then opens the log events pane to show progress.

When the registration process completes, click Close to return to the Schema discovery page.

Log events#

The Log events pane shows a list of log entries for each discovery-related event. The Summary section shows the number of successful query executions and the number of errors that occurred during the discovery run.

The list of log events includes the following information:

Status: The outcome of the event. A green checkmark indicates a successful query execution. A red exclamation mark indicates an error.
Timestamp: The timestamp when the event occurred.
Query text: The SQL query execution text, such as CREATE TABLE or CREATE SCHEMA. Click the text to view the full query.
Message: A message detailing the log event, such as the successful creation of a schema, or an error message.

Supported formats#

Schema discovery supports creating tables in the Hive, Iceberg, and Delta Lake table formats. It does not support Hudi tables.

Caution

Schema discovery can detect all supported table formats, but it only allows creating tables that are supported by the underlining catalog connector. If a table format is not supported, execution of statements fail.

Schema discovery identifies tables and views that are stored in the following file formats:

JSON in Hive-JSON-SerDe format with one JSON object per line and no line termination characters separating the objects.
CSV
ORC
PARQUET

Schema discovery identifies tables and views that use the following compression codecs:

ZSTD
LZ4
SNAPPY
GZIP
DEFLATE
BZIP2
LZO
LZOP

Note

Schema discovery identifies compression codecs by file extension only. Ensure that your compression codec ends with the proper file extension.

Limitations and warnings#

Only JSON, CSV, ORC, and PARQUET file formats are supported.
Only ZSTD, LZ4, SNAPPY, GZIP, DEFLATE, BZIP2, LZO, and LZOP compression codes are supported.