Using object storage systems #

Starburst products can query data from object storage systems.

This guide describes object storage fundamentals, requirements, and configuration steps to connect Starburst products to an object storage system.

Object storage fundamentals #

Rather than a traditional relational database, data in object storage systems is stored in a series of files that live in nested directories. These files can then be accessed with a query language using data warehouse systems like Hive. Some examples of object storage systems include the following:

  • Apache Hadoop Distributed File System (HDFS)
  • Amazon S3
  • Azure Data Lake Storage (ADLS)
  • Google Cloud Storage (GCS)
  • S3-compatible systems such as MinIO

The data files typically use one of the following supported binary formats:

  • ORC
  • Parquet
  • AVRO

Text-based formats such as CSV and JSON can also be used for object storage.

The information about the object storage directory structure, file format, and metadata about the stored data is contained in a metastore. The most commonly used metastore is the Hive Metastore Service (HMS).

Requirements #

All Starburst products use catalogs to connect to data sources. Similarly to relational data sources, object stores also require a configured catalog.

A metastore is a requirement to query object storage systems. Without a metastore, a query engine has no way of knowing where and how data is stored within the object storage. The metastore configuration depends on which Starburst product you are using, as detailed in the following sections.

Starburst Enterprise #

Starburst Enterprise platform (SEP) supports the following metastores:

  • Hive (HMS)
  • AWS Glue

In the catalog configuration file for an object storage system, you must add configuration properties that describe how to connect to the metastore for that system. If you have multiple catalogs that connect to object storage, each catalog configuration file requires metastore configuration properties.

The following example configuration properties are for an Iceberg catalog that connects to a Hive metastore with the Thrift protocol:

Each object storage connector supports different configuration properties for connecting to a metastore. Refer to the connector documentation for more information.

Starburst Galaxy #

Starburst Galaxy supports the following metastores:

  • For AWS S3, Azure ADLS, and GCS:
    • Starburst Galaxy (built-in metastore)
    • Hive (HMS)
  • Additional support for AWS S3:
    • AWS Glue
    • Databricks Unity Catalog

The connection to a metastore is handled within the Starburst Galaxy interface when you create a catalog for connection to an object storage system. Steps for configuration and supported metastores differ based on object storage, so refer to the documentation for each object storage catalog for configuration instructions: