Catalogs overview #
A catalog contains the configuration that allows Starburst Galaxy to access a data source.
To query a data source in Galaxy, configure a catalog for it, and include that catalog in a cluster. Once the catalog is defined and used in a cluster, you can query the data source by accessing the catalog and its nested schemas and tables.
Data sources and clusters must be located in the same cloud provider and region to enable optimal performance and avoid unnecessary data transfer costs.
Access to create and manage catalogs is provided through the Catalogs item in the navigation menu.
Data sources #
Starburst Galaxy facilitates access to numerous different data sources. Configuration for object storage systems, data warehouses, relational databases, and other systems varies by cloud and hosting provider. If your data source has secured or locked down network access, you may need to configure its network to admit one or more of Starburst Galaxy’s outgoing IP blocks as shown on the IP allow list.
The following sections provide links to the configuration pages for the data source catalogs supported by Starburst Galaxy.
Object storage #
- Amazon S3, object storage on Amazon Web Services, combined with Amazon Glue, your own Hive Metastore Service, or the Starburst Galaxy metastore. Starburst Warp Speed is available for S3 catalogs to improve performance.
- Azure Data Lake Storage, object storage on Microsoft Azure, combined with your own Hive Metastore Service, or the Starburst Galaxy metastore.
- Google Cloud Storage, object storage on Google Cloud, combined with your own Hive Metastore Service, or the Starburst Galaxy metastore.
Tabular, a central table store based on Apache Iceberg tables, for AWS S3 data.
Additional data sources #
- Amazon Redshift, a fast, fully managed, petabyte-scale data warehouse service.
- Apache Druid, a high performance, real-time analytics database that delivers sub-second queries on streaming and batch data at scale and under load.
- Azure Synapse, an analytics service that brings together data integration, enterprise data warehousing, and big data analytics.
- Elasticsearch, a fast and scalable search and analytics engine.
- Google BiqQuery, a serverless, scalable, cost-effective multicloud data warehouse.
- Google Sheets, read-only access to spreadsheets stored on your Google Drive account.
- MariaDB, relational database in numerous variants on Amazon Web Services, Google Cloud, or Microsoft Azure.
- Microsoft SQL Server, relational database in numerous variants on Amazon Web Services, Google Cloud, or Microsoft Azure.
- MongoDB or MongoDB Atlas data platform.
- MySQL, relational database in numerous variants on Amazon Web Services, Google Cloud, or Microsoft Azure.
- PostgreSQL, relational database in numerous variants on Amazon Web Services, Google Cloud, or Microsoft Azure.
- Salesforce, a cloud-based customer relationship management system.
Snowflake, a cloud-based data platform.
Coming soon #
Support for the following data sources is expected in upcoming releases of Starburst Galaxy.
Sample datasets #
Starburst Galaxy also provides access to a number of full datasets. You can create a catalog for these datasets, and use them for a number of purposes:
- Easy demonstration of Starburst Galaxy features without the need to configure an external data source.
- Availability of a full dataset to query, learn SQL, and experiment with different clients.
- Performance and other benchmark tests with well known data and standardized queries.
The following dataset catalogs are available:
- AWS COVID-19 data lake.
See the Introductory project tutorials for examples of using this dataset.
- Sample dataset
Provides data in two tables that represent space mission data.
Provides a set of schemas to support the TPC Benchmark™ H database, which is a benchmark used to measure the performance of highly-complex decision support databases.
Provides a set of schemas to support the TPC Benchmark™ DS database, which is a benchmark used to measure the performance of complex decision support databases.
Manage catalogs #
You can create, view, and manage catalogs in the View catalogs pane.
This pane also provides the starting point to access the catalog explorer, which allows you to browse through any catalog’s metadata. Click the name of a catalog to enter the explorer for that catalog.
List of catalogs #
The list of catalogs displays the following information about each catalog:
- Name: The name of the catalog.
- Kind: The data source type, such as Amazon S3.
- Description: The description provided for the catalog, if any.
- Cloud: The cloud service provider.
- Region: The cloud service provider region.
- Tags: Any tags assigned to this catalog. Click the plus sign in the Tags column to add or remove tags.
- The options menu, containing further actions.
The default sort order is by Name, alphabetically ascending. Click any column heading to sort the list; click the heading again to reverse the sort order. The up or down arrows show ascending or descending sort order.
Create a catalog #
To create a new catalog, click Create catalog. The new catalog appears in the list of catalogs.
Search catalogs #
Use the search field to narrow the list of catalogs to those that match a search string in the name, kind, or description columns.
Edit a catalog #
You can edit a catalog’s configuration details on the Edit catalog pane. To access the editing pane, click the options menu, then Edit configuration. The same editing options are available during the catalog creation process.
Change owner #
To change the owner of the catalog to a different role, click the options menu, then Change owner.
Delete a catalog #
To delete a catalog, click the options menu, then Delete catalog.
Add a catalog to a cluster #
Once you have configured a catalog, add it to a cluster to query the data source. You have the option to add a catalog to a cluster during the catalog configuration process or later using the Edit cluster option.
Is the information on this page helpful?