CTE reuse#

CTE reuse optimizes queries by identifying common table expressions and subqueries that appear multiple times within a query and executing them only once.

When you enable CTE reuse, Starburst Enterprise platform (SEP) executes matching query fragments once, saves the intermediate results, and reuses them at different points in the query plan. CTE reuse also detects similar but non-identical query fragments.

CTE reuse operates within the scope of a single SQL query. Results are not reused across different queries.

Note

CTE reuse is available as a public preview feature. Contact your Starburst account team with questions or feedback.

Configuration#

To enable CTE reuse globally, add the following property to your coordinator’s config.properties file:

optimizer.reuse-common-subqueries=true

To enable CTE reuse on a per-query basis, use the reuse_common_subqueries session property:

SET SESSION reuse_common_subqueries = true;

The following object storage connectors support CTE reuse:

The following non-object storage connectors support CTE reuse:

CTE supports most basic query operators, including:

  • Table scans, including filters and projections

  • Aggregations

  • Joins

  • Unions

Exchange manager#

CTE reuse requires an exchange manager to transfer intermediate results between query stages. Use one of the following options:

Embedded buffer service#

The embedded buffer service runs within your SEP cluster and temporarily stores intermediate query results in memory. When the buffer service runs out of memory, it writes data to object storage.

Enable the embedded buffer service#

To enable the embedded buffer service, add the following property to the config.properties file on all nodes, including the coordinator:

embedded-buffer-service-enabled=true

Configure the exchange manager to use the embedded buffer service by adding the following properties to the exchange-manager.properties file on all nodes:

exchange-manager.name=buffer
exchange.use-embedded-buffer-service=true

Configure buffer service memory#

To control how much memory the buffer service uses, configure the following properties in the config.properties files on worker nodes:

Embedded buffer service memory configuration properties#

Property name

Description

Default

buffer.memory.base

The total memory assigned to the buffer service subprocess as a fraction of heap memory. Do not set higher than 30%.

10%

buffer.memory.chunks

The memory reserved for storing intermediate result data as a fraction of heap memory. Set this value approximately 5 percentage points lower than buffer.memory.base.

8%

Configure chunks spooling#

When the embedded buffer service runs out of memory, it writes data to an external object storage location. Configure this location in the config.properties on worker nodes using the buffer.spooling.directory and the appropriate storage-specific properties.

Note

Ensure workers have direct read access to the spooling location. Workers inherit the object storage configuration from the embedded buffer service.

AWS S3#

To configure AWS S3 as the spooling location, use the following configuration properties:

S3 spooling configuration properties#

Property name

Description

buffer.spooling.directory

The object storage location for spooled data. Use an s3:// URI.

buffer.spooling.s3.aws-access-key

The access key for the S3 API.

buffer.spooling.s3.aws-secret-key

The secret key for the S3 API.

buffer.spooling.s3.region

The AWS region of the S3 spooling bucket.

buffer.spooling.s3.endpoint

The S3 API endpoint. Required when using an S3-compatible service such as MinIO.

Google Cloud Storage#

To configure Google Cloud Storage as the spooling location, use the following configuration properties:

GCS spooling configuration properties#

Property name

Description

buffer.spooling.directory

The object storage location for spooled data. Use a gs:// URI.

buffer.spooling.s3.endpoint

The S3 API endpoint. Set to https://storage.googleapis.com.

buffer.spooling.s3.aws-access-key

The GCS HMAC access key.

buffer.spooling.s3.aws-secret-key

The GCS HMAC secret key.

buffer.spooling.gcs.json-key

The GCS service account credentials in JSON format.

Azure Blob Storage#

To configure Azure Blob Storage as the spooling location, use the following configuration properties:

Azure Blob Storage spooling configuration properties#

Property name

Description

buffer.spooling.directory

The object storage location for spooled data. Use an abfs:// URI.

buffer.spooling.azure.connection-string

The Azure storage connection string, including credentials.

Examples#

The following query defines a CTE named emea and references it twice: once in the main query and once in a subquery. With CTE reuse enabled, SEP computes the emea result set once and reuses it for both references.

WITH emea AS (
    SELECT country, price
    FROM transactions
    WHERE region = 'EMEA'
)
SELECT country, COUNT(*), SUM(price)
FROM emea
WHERE price > (SELECT avg(price) FROM emea)
GROUP BY country;

CTE reuse also optimizes repeated subqueries that are not explicitly defined as CTEs. In the following query, the same subquery appears in both branches of the UNION ALL:

SELECT min(name)
FROM nation
WHERE regionkey > 2
GROUP BY regionkey
UNION ALL
SELECT name
FROM region
WHERE name > ALL (
    SELECT max(name)
    FROM nation
    WHERE regionkey > 2
    GROUP BY regionkey
);