Configure and define catalogs #
You need to understand data sources and how they are connected to Starburst Enterprise platform (SEP), to take advantage of the query performance available to your data consumers. The following content provides an overview of data sources and catalogs and how they work with connectors in SEP.
Definitions #
Before you begin, here are definitions and explanations the key concepts of for data sources, catalogs, and connectors.
Data sources #
A data source is a system where data is retrieved from. You can query systems such as distributed object storage, RBDMSs, NoSQL databases, document databases, and many others.
In SEP, you must connect to a data source so you can query from that source. To query those sources, you need to create a catalog properties file to define a catalog.
Catalogs #
Catalogs define and name the configuration to connect to and query a data source. They are a key concept in SEP.
Without catalogs there is nothing to query.
The catalog name is defined by the name of the catalog properties file, located
in etc/catalog
. You can have as many catalogs as you want, and there are no
restrictions for naming outside of valid character types. For example, a
filename of etc/catalog/mydatabase.properties
results in the catalog name
mydatabase
.
The catalog properties file content defines the access configuration to the data
source. Properties are key=value
pairs.
Each catalog defines one and only one connector using the required
connector.name
property. You must select the correct connector for your data
source. For example, the PostgreSQL connector is defined by using
connector.name=postgresql
, and enables the catalog to access a PostgreSQL
database. Another example is the Hive connector defined by
connector.name=hive-hadoop2
. It can enable a catalog to access Hadoop,
Amazon S3 and many other object storage systems.
Access a list of catalogs with the
SHOW CATALOGS
command.
Configuration properties #
Each connector has a small set of required properties. While properties can vary, they minimally define the connection to the data source. Depending on your connector and data source, there are a number of different configuration properties available.
Optional properties enable further configuration of the catalog in areas such as security, performance, and query behavior. These connector-specific properties are defined in the documentation for each connector. For more information on connector-specific configuration properties, start with the list of all connectors.
Catalog session properties #
You can further customize the behavior of the connector in your catalog using catalog session properties. A session is defined by a specific user accessing SEP with a specific tool such as the CLI. Catalog session properties can control resource usage, enable or disable features, and change query processing.
Most of the session properties are similarly named to their config properties counterparts in a catalog file, mostly differing by the use of underscores (_) in the name to be SQL compliant. Session properties override catalog properties in certain circumstances.
You can view current session properties using the
SHOW SESSION
command. Then
implement your session properties using
SET SESSION
.
Connectors #
A connector is specific to the data source it supports. It transforms the underlying data into the SEP concepts of schemas, tables, columns, rows, and data types.
Connectors provide the following between a data source and SEP:
- Secure communications link
- Translation of data types
- Handling of variances in the SQL implementation, adaption to a provided API, or translation of data in raw files
Setup of connector #
Most SEP connectors include their configurations and anything else you might need by default, and therefore no setup is required.
If you’re using a connector that requires additional set up, such as the addition of a proprietary JDBC driver, you find that documented with the specific connector.
Create a catalog properties file #
You can create catalog to access a data source with a few simple steps:
- Create the catalog properties file in
etc/catalog/
, for exampleetc/catalog/mydatabase.properties
- Specify the required connector with
connector.name=
in file, for exampleconnector.name=postgresql
- Add any other properties required by the connector
- Add any optional properties as desired
- Copy the file onto the coordinator and all worker nodes
- Restart the coordinator and all workers
- Confirm the catalog is available with
SHOW CATALOGS;
- List the available schemas with
SHOW SCHEMAS FROM mydatabase;
- Start writing and running the desired queries
Many connectors use the similar properties in catalog properties files.
Most JDBC-based connectors require these minimum properties for their catalog files:
connector.name=[connectorname]
connection-url=[connectorprotocol]//<host>:<port>;database=<database>
connection-user=root
connection-password=secret
Examples of connector protocols include:
# JDBC-based connector
connection-url=jdbc:postgresql://example.net:5432/database
# Endpoint-driven connector
hive.cos.service-config=etc/catalog/cos-service.properties