Catalogs overview #

Each catalog contains the configuration to allow Starburst Galaxy to access a data source. To query a data source in Galaxy, configure a catalog for it, and include that catalog in a cluster.

Data sources and clusters need to be located in the same cloud provider and region to enable optimal performance and avoid unnecessary data transfer costs.

Once the catalog is defined and used in a cluster, you can query the data source by accessing the catalog and its nested schemas and tables.

You can access the user interface to manage catalogs from the Catalogs item in the main navigation. In addition, this pane provides access to the features of the catalog explorer.

Data sources #

Starburst Galaxy facilitates access to numerous different data sources. Configuration for object storage systems, data warehouses, relational databases, and other systems varies by cloud and hosting provider. The following sections provide links to the configuration pages for the data source catalogs supported by Starburst Galaxy.

Object storage #

  • Amazon S3, object storage on Amazon Web Services, combined with Amazon Glue, your own Hive Metastore Service, or the Starburst Galaxy metastore
  • Azure Data Lake Storage, object storage on Microsoft Azure, combined with your own Hive Metastore Service, or the Starburst Galaxy metastore
  • Google Cloud Storage, object storage on Google Cloud, combined with your own Hive Metastore Service, or the Starburst Galaxy metastore

    Select object storage

Warehouses and databases #

  • MySQL, relational database in numerous variants on Amazon Web Services, Google Cloud, or Microsoft Azure
  • PostgreSQL, relational database in numerous variants on Amazon Web Services, Google Cloud, or Microsoft Azure
  • SQL Server, relational database in numerous variants on Amazon Web Services, Google Cloud, or Microsoft Azure
  • Amazon Redshift, a fast, fully managed, petabyte-scale data warehouse service.
  • Azure Synapse, an analytics service that brings together data integration, enterprise data warehousing, and big data analytics.
  • Google BiqQuery, a serverless, scalable, cost-effective multicloud data warehouse.
  • Snowflake, a cloud-based data platform.
  • MongoDB or MongoDB Atlas data platform.

    Select a warehouse or database

Sample datasets #

Starburst Galaxy also provides access to a number of full datasets. You can create a catalog for these datasets, and use them for a number of purposes:

  • Easy demonstration of Starburst Galaxy features without the need to configure an external data source.
  • Availability of a full dataset to query, learn SQL, and experiment with different clients.
  • Performance and other benchmark tests with well known data and standardized queries.

The following dataset catalogs are available:

AWS COVID-19 data lake.

See the Introductory project tutorials for examples of using this dataset.

Sample dataset

Provides data in two tables that represent space mission data.

TPC-H

Provides a set of schemas to support the TPC Benchmark™ H database, which is a benchmark used to measure the performance of highly-complex decision support databases.

TPC-DS

Provides a set of schemas to support the TPC Benchmark™ DS database, which is a benchmark used to measure the performance of complex decision support databases.

Select a dataset