Data mesh with Starburst #

Starburst Enterprise is the world’s fastest distributed SQL query engine. It lets data consumers query anything, anywhere, and get the data they need in a single query, no matter where it lives. This idea of combining data from disparate sources is called a data mesh. Specifically Starburst Enterprise supports query federation where a query access many different datasources at the same time. These data sources are defined as catalogs and expose data in schemas and tables.

It allows you to combine, for instance, historical data from HDFS or objects stores with the most recent incoming data from Kafka one query.

Federating data is simple. You just need to use the fully-qualified name of the tables in your FROM clause. Table names are fully-qualified when they include the catalog and schema name:

<catalog>.<schema>.<table>

A catalog defines the schemas in a data source such as Snowflake, Oracle and Hive.

Here’s an example of data from two different sources, Hive and MySQL, federated into a single query:

SELECT
    sfm.account_number
FROM
    hive_sales.order_entries.orders oeo
JOIN
    mysql_crm.sf_history.customer_master sfm
ON sfm.account_number = oeo.customer_id
WHERE sfm.sf_industry = `medical` AND oeo.order_total > 300
LIMIT 2;

This query uses data from the following sources:

  • The orders table in the order_entries schema defined in the hive_sales catalog
  • The customer_master table in the sf_history schema defined in the mysql_crm catalog

To help you learn how Starburst uses federated queries in popular analytics tools, here is a handy walk-through of federating queries for rapid visualization in Looker with Starburst Enterprise.