Data sources and catalogs #

Trino, Starburst Enterprise platform (SEP), and Starburst Galaxy have a common architecture and share core concepts.

The following sections explain the main concepts about the data sources and how you can query them as catalogs using SQL.

Data sources #

Your organization likely has a multitude of data sources. A data source is a system where data is retrieved from, such as distributed object storage using HDFS/Hive, RDBMSs like PostgreSQL or Oracle, NoSQL databases and document databases like Elasticsearch, and systems such as Redshift or Kafka and many others. Without Trino, you have to connect to each separately to query the data.

Catalogs #

A catalog is the configuration that enables access to a specific data sources. Every cluster can have numerous catalogs configured, and therefore allow access to many data sources.

List all configured and available catalogs with the SQL statement SHOW CATALOGS in the Trino CLI or any other client:

SHOW CATALOGS;
 Catalog
---------
 hive_sales
 mysql_crm
(2 rows)

The query editor and other client tools also display a list of catalogs.

Typically data engineer creates catalogs.

Connectors #

A connector is specific to the data source it supports. It transforms the underlying data into the SQL concepts of schemas, tables, columns, rows, and data types.

Connectors provide the following between a data source and Starburst Enterprise or Starburst Galaxy:

  • Secure communications link
  • Translation of data types
  • Handling of variances in the SQL implementation, adaption to a provided API, or translation of data in raw files

Every catalog uses a specific connector. Connectors are built-in features.

Schemas #

Every catalog includes one or more schemas. They group together objects. Schemas are often equivalent to a specific database or schema in the underlying data source.

List all available schema in a specific catalog with the SQL statement SHOW SCHEMAS in the Trino CLI or any other client:

SHOW SCHEMA FROM exmaplecatalog;

Objects #

Every schema includes one or more objects. Typically these objects are tables.

List all available tables in a specific schema with the SQL statement SHOW TABLES in the Trino CLI or any other client:

SHOW TABLES FROM examplecatalog.exampleschema;

Some catalogs also support views and materialized views as objects.

More information about a table is available with the SQL statement SHOW COLUMNS:

SHOW COLUMNS FROM examplecatalog.exampleschema.exampletable;

This information includes the columns in the table, the data type of the columns and other information.

Context #

The default context for any SQL statement is the catalog level. As a result any query to access a table needs to specific the catalog, schema and table establishing a fully-qualified name.

SELECT * FROM <catalog>.<schema>.<object>

This allows identical table names in the underlying data sources to be addressed specifically. The following to queries access tables of the same name in completely separate data sources:

SELECT * FROM sales.apac.customer;
SELECT * FROM marketing.americas.users;

You can quote the catalog, schema, and table names if any names require escaping. For example, the dash character is not allowed in SQL, but some underlying data sources allow the character for objects that are exposed as schemas or objects.

SELECT * FROM "sales"."web-traffic"."page-views";

Set a specific schema or catalog as context with the USE COLUMNS statement. This allows you to omit the catalog and schema from subsequent queries:

USE sales.default;
SELECT * FROM customer;

Querying from multiple catalogs #

Starburst Enterprise and Starburst Galaxy let data consumers query anything, anywhere, and get the data they need in a single query. Specifically, they support queries that combine data from many different data sources at the same time.

Fully-qualified object names are critical when querying from multiple sources:

SELECT * FROM <catalog>.<schema>.<object>;

Here’s an example of data from two different sources, Hive and MySQL, combined into a single query:

SELECT
    sfm.account_number
FROM
    hive_sales.order_entries.orders oeo
JOIN
    mysql_crm.sf_history.customer_master sfm
ON sfm.account_number = oeo.customer_id
WHERE sfm.sf_industry = `medical` AND oeo.order_total > 300
LIMIT 2;

This query uses data from the following sources:

  • The orders table in the order_entries schema, which is defined in the hive_sales catalog
  • The customer_master table in the sf_history schema, which is defined in the mysql_crm catalog

Catalog session properties #

Users can customize the behavior of queries from a catalog using catalog session properties. A session is defined by a specific user accessing a cluster with a specific tool such as the CLI. Catalog session properties can control resource usage, enable or disable features, and change query processing.

Most of the session properties are named similar to their configuration properties counterparts in a catalog file, mostly differing by the use of underscores (_) in the name instead of dashes (-) to be SQL-compliant. These configuration properties are found in the catalog’s connector documentation. Session properties override catalog properties.

You can view current session properties using the SHOW SESSION command. Then implement your session properties using SET SESSION.