Materialized views #

Starburst Enterprise platform (SEP) supports materialized views with the Hive and Iceberg connectors. You can use to take a federated query and store the results local to SEP

Like regular views, materialized views standardize complex, analytical queries. With (SEP you can use these materialized with any query, including a federated query that accesses data in multiple catalogs and schemas. They provide a pre-computed result set to query against in physical storage, increasing query performance.

Materialized views created in a Hive catalog may be automatically refreshed, as discussed in this document.

Prerequisites #

Your SEP platform administrator or a data engineer with administrative access to SEP has to first enable materialized views in one or more Hive or Iceberg catalogs. Check with them to learn which catalogs are enabled, and which schemas may be used.

You also must have the necessary access privileges to create data in the schema designated in each enabled catalog.

Create and use a materialized view #

Any federated query that runs successfully in SEP can be used to create a materialized view. Materialized views in SEP are created in the same way as in other data platforms, with a CREATE MATERIALIZED VIEW statement. In the following example, the mysalescatalog has been configured to allow materialized views:

CREATE MATERIALIZED VIEW mysalescatalog.mysalesschema.mv_cust_tot_return AS
    SELECT
      sr_customer_sk ctr_customer_sk,
      sr_store_sk ctr_store_sk,
      sum(sr_return_amt) ctr_total_return
    FROM
    tpcds.sf1.store_returns,
    tpcds.sf1.date_dim
    WHERE ( (sr_returned_date_sk = d_date_sk) AND (d_year > 2000) )
    GROUP BY sr_customer_sk, sr_store_sk
;

Once a materialized view exists, you can query it like any regular table:

SELECT * FROM mycatalog.mysalesschema.mv_cust_tot_return;

Automatically refresh materialized views in Hive #

Hive catalogs in SEP configured to allow materialized views provide several WITH clause properties to configure refresh schedules and how new data is imported:

  • refresh_interval and cron: Choose one method to specify a refresh frequency.
  • max_import_duration: Specifies how long to allow a refresh to complete before failing.
  • grace_period: Specifies the amount of time in-flight queries can run against an expiring snapshot.
  • incremental_column: Specifies the column to be used to identify new data since the last refresh. If you do not use this field, SEP performs a full refresh.

We suggest that you review our reference documentation, which has more information about these properties.

In the following example, the refresh_interval property is used to automatically refresh the data every 24 hours from the time the CREATE statement initially runs:

CREATE MATERIALIZED VIEW myhive.mysalesschema.mv_cust_tot_return
WITH (
  refresh_interval = '24.0h',
  grace_period = '5.00m',
  max_import_duration = '30.00m'
) AS
    SELECT
      sr_customer_sk ctr_customer_sk,
      sr_store_sk ctr_store_sk,
      sum(sr_return_amt) ctr_total_return
    FROM
    tpcds.sf1.store_returns,
    tpcds.sf1.date_dim
    WHERE ( (sr_returned_date_sk = d_date_sk) AND (d_year > 2000) )
    GROUP BY sr_customer_sk, sr_store_sk
;

In this example, the refresh runs for a maximum of 30 minutes. Unless the cron property is specified, the time at which data in a materialized view is refreshed is based on the moment the CREATE MATERIALIZED VIEW statement first runs, plus the refresh_interval.

You can run refreshes on a set schedule by using the cron property instead. The cron property uses normal cron expressions. Here is the same materialized view, created with a cron schedule and an incremental column:

CREATE MATERIALIZED VIEW myhive.mysalesschema.mv_cust_tot_return
WITH (
  cron = '30 2 * * *'
  grace_period = '5.00m',
  max_import_duration = '30.00m',
  incremental_column = 'sr_returned_date_sk'
) AS
    SELECT
      sr_customer_sk ctr_customer_sk,
      sr_store_sk ctr_store_sk,
      sum(sr_return_amt) ctr_total_return
    FROM
    tpcds.sf1.store_returns,
    tpcds.sf1.date_dim
    WHERE ( (sr_returned_date_sk = d_date_sk) AND (d_year > 2000) )
    GROUP BY sr_customer_sk, sr_store_sk
;

This causes the refresh to execute at 2:30 AM daily, and loads only new data as determined by the sr_returned_date_sk date column.