Configure and define catalogs #

You need to understand data sources and how they are connected to Starburst Enterprise platform (SEP), to take advantage of the query performance available to your data consumers. The following content provides an overview of data sources and catalogs and how they work with connectors in SEP.

Definitions #

Before you begin, here are definitions and explanations the key concepts of for data sources, catalogs, and connectors.

Data sources #

A data source is a system where data is retrieved from. You can query systems such as distributed object storage, RBDMSs, NoSQL databases, document databases, and many others.

In SEP, you must connect to a data source so you can query from that source. To query those sources, you need to create a catalog properties file to define a catalog.

Catalogs #

Catalogs define and name the configuration to connect to and query a data source. They are a key concept in SEP.

Without catalogs there is nothing to query.

The catalog name is defined by the name of the catalog properties file, located in etc/catalog. You can have as many catalogs as you want, and there are no restrictions for naming outside of valid character types. For example, a filename of etc/catalog/mydatabase.properties results in the catalog name mydatabase.

The catalog properties file content defines the access configuration to the data source. Properties are key=value pairs.

Each catalog defines one and only one connector using the required connector.name property. You must select the correct connector for your data source. For example, the PostgreSQL connector is defined by using connector.name=postgresql, and enables the catalog to access a PostgreSQL database. Another example is the Hive connector defined by connector.name=hive-hadoop2. It can enable a catalog to access Hadoop, Amazon S3 and many other object storage systems.

Access a list of catalogs with the SHOW CATALOGS command.

Configuration properties #

Each connector has a small set of required properties. While properties can vary, they minimally define the connection to the data source. Depending on your connector and data source, there are a number of different configuration properties available.

Optional properties enable further configuration of the catalog in areas such as security, performance, and query behavior. These connector-specific properties are defined in the documentation for each connector. For more information on connector-specific configuration properties, start with the list of all connectors.

Catalog session properties #

You can further customize the behavior of the connector in your catalog using catalog session properties. A session is defined by a specific user accessing SEP with a specific tool such as the CLI. Catalog session properties can control resource usage, enable or disable features, and change query processing.

Most of the session properties are similarly named to their config properties counterparts in a catalog file, mostly differing by the use of underscores (_) in the name to be SQL compliant. Session properties override catalog properties in certain circumstances.

You can view current session properties using the SHOW SESSION command. Then implement your session properties using SET SESSION.

Connectors #

A connector is specific to the data source it supports. It transforms the underlying data into the SEP concepts of schemas, tables, columns, rows, and data types.

Connectors provide the following between a data source and SEP:

  • Secure communications link
  • Translation of data types
  • Handling of variances in the SQL implementation, adaption to a provided API, or translation of data in raw files

Setup of connector #

Most SEP connectors include their configurations and anything else you might need by default, and therefore no setup is required.

If you’re using a connector that requires additional set up, such as the addition of a proprietary JDBC driver, you find that documented with the specific connector.

Create a catalog properties file #

You can create catalog to access a data source with a few simple steps:

  • Create the catalog properties file in etc/catalog/, for example etc/catalog/mydatabase.properties
  • Specify the required connector with connector.name= in file, for example connector.name=postgresql
  • Add any other properties required by the connector
  • Add any optional properties as desired
  • Copy the file onto the coordinator and all worker nodes
  • Restart the coordinator and all workers
  • Confirm the catalog is available with SHOW CATALOGS;
  • List the available schemas with SHOW SCHEMAS FROM mydatabase;
  • Start writing and running the desired queries

Many connectors use the similar properties in catalog properties files.

Most JDBC-based connectors require these minimum properties for their catalog files:

connector.name=[connectorname]
connection-url=[connectorprotocol]//<host>:<port>;database=<database>
connection-user=root
connection-password=secret

Examples of connector protocols include:

# JDBC-based connector
connection-url=jdbc:postgresql://example.net:5432/database

# Endpoint-driven connector
hive.cos.service-config=etc/catalog/cos-service.properties