Catalogs overview #

Each catalog contains configuration for Starburst Galaxy to access a data source. Configure catalogs and use them in clusters to query data sources in Starburst Galaxy.

Data sources and clusters need to be located in the same cloud provider and region to enable optimal performance and avoid unnecessary data transfer costs.

Once the catalog is defined and used in a cluster, you can query the data source by accessing the catalog and the nested schemas and tables.

Data sources #

Starburst Galaxy facilitates access to numerous different data sources. Configuration for these objects storage systems, relation databases, and other systems varies by cloud and hosting provider. The following sections provide information on how to configure these data sources so they can be used in catalogs.

  • Amazon S3, object storage on Amazon Web Services, combined with Amazon Glue, your own Hive Metastore Service, or Starburst Galaxy metastore
  • Azure Data Lake Storage, object storage on Microsoft Azure, combined with your own Hive Metastore Service, or Starburst Galaxy metastore
  • Google Cloud Storage, object storage on Google Cloud, combined with your own Hive Metastore Service, or Starburst Galaxy metastore
  • MySQL, relational database in numerous variants on Amazon Web Services, Google Cloud, or Microsoft Azure
  • PostgreSQL, relational database in numerous variants on Amazon Web Services, Google Cloud, or Microsoft Azure

Datasets #

Starburst Galaxy also provides access to a number of full datasets. You can create a catalog using the datasets, and use them for a number of purposes:

  • Easy demonstration of Starburst Galaxy features without the need to configure an external data source.
  • Availability of a full dataset to query, learn SQL, and experiment with different clients.
  • Performance and other benchmark tests with well known data and standardized queries.

The following dataset catalogs are available: