CTE reuse#

CTE reuse optimizes queries by identifying common table expressions and subqueries that appear multiple times within a query and executing them only once.

When you enable CTE reuse, Starburst Enterprise platform (SEP) executes matching query fragments once, saves the intermediate results, and reuses them at different points in the query plan. CTE reuse also detects similar but non-identical query fragments.

CTE reuse operates within the scope of a single SQL query. Results are not reused across different queries.

Note

CTE reuse is a best effort optimization and Starburst does not guarantee that it will merge and reuse all table scans, common blocks, or subqueries for every query.

Supported connectors and operators#

The following object storage connectors support CTE reuse:

The following non-object storage connectors support CTE reuse:

CTE reuse supports most basic query operators, including:

  • Table scans, including filters and projections

  • Aggregations

  • Joins

  • Unions

Configuration#

To enable CTE reuse globally, add the following property to your coordinator’s config.properties file:

optimizer.reuse-common-subqueries=true

To enable CTE reuse on a per-query basis, use the reuse_common_subqueries session property:

SET SESSION reuse_common_subqueries = true;

Exchange manager#

CTE reuse requires an exchange manager to transfer intermediate results between query stages. Use one of the following options:

Embedded buffer service#

The embedded buffer service runs within your SEP cluster and temporarily stores intermediate query results in memory. When the buffer service runs out of memory, it writes data to object storage.

Enable the embedded buffer service#

To enable the embedded buffer service, add the following property to the config.properties file on all nodes, including the coordinator:

embedded-buffer-service-enabled=true

Configure the exchange manager to use the embedded buffer service by adding the following properties to the exchange-manager.properties file on all nodes:

exchange-manager.name=buffer
exchange.use-embedded-buffer-service=true

Configure buffer service memory#

To control how much memory the buffer service uses, configure the following property in the config.properties files on worker nodes:

Embedded buffer service memory configuration properties#

Property name

Description

Default

buffer.memory.base

The total memory assigned to the buffer service subprocess as a fraction of heap memory. Do not set higher than 30%.

10%

Configure chunks spooling#

When the embedded buffer service runs out of memory, it writes data to an external object storage location. Configure this location in the config.properties on coordinator and worker nodes using the buffer.spooling.directory and the appropriate storage-specific properties.

Note

Ensure all nodes (both coordinator and workers) have direct read access to the spooling location. Workers inherit the object storage configuration from the embedded buffer service.

AWS S3#

To configure AWS S3 as the spooling location, use the following configuration properties:

S3 spooling configuration properties#

Property name

Description

buffer.spooling.directory

The object storage location for spooled data. Use an s3:// URI.

buffer.spooling.s3.aws-access-key

The access key for the S3 API.

buffer.spooling.s3.aws-secret-key

The secret key for the S3 API.

buffer.spooling.s3.region

The AWS region of the S3 spooling bucket.

buffer.spooling.s3.endpoint

The S3 API endpoint. Required when using an S3-compatible service such as MinIO.

Google Cloud Storage#

To configure Google Cloud Storage as the spooling location, use the following configuration properties:

GCS spooling configuration properties#

Property name

Description

buffer.spooling.directory

The object storage location for spooled data. Use a gs:// URI.

buffer.spooling.s3.aws-access-key

The GCS HMAC access key.

buffer.spooling.s3.aws-secret-key

The GCS HMAC secret key.

buffer.spooling.s3.endpoint

The S3 API endpoint. Set to https://storage.googleapis.com.

buffer.spooling.gcs.json-key

The GCS service account credentials in JSON format.

buffer.spooling.gcs.json-key-file-path

Path to a JSON file containing the GCS service account credentials.

Azure Blob Storage#

To configure Azure Blob Storage as the spooling location, use the following configuration properties:

buffer.spooling.directory=abfs://<FILESYSTEM_CONTAINER>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/<SPOOL_PATH>
buffer.spooling.azure.connection-string=DefaultEndpointsProtocol=https;AccountName=<STORAGE_ACCOUNT_NAME>;AccountKey=<STORAGE_ACCOUNT_KEY>
Azure Blob Storage spooling configuration properties#

Property name

Description

buffer.spooling.directory

The object storage location for spooled data. Use an abfs:// URI.

buffer.spooling.azure.connection-string

The Azure storage connection string, including credentials.

Examples#

The following query defines a CTE named urgent that aggregates order counts and total prices for urgent orders, grouped by customer and clerk. The main query references urgent twice: once to filter results using a subquery that computes the average price from urgent, and once to aggregate the filtered results by clerk. With CTE reuse enabled, SEP computes the urgent result set once and reuses it for both references.

WITH urgent AS (
     SELECT custkey, clerk, COUNT(*) AS order_count, SUM(totalprice) AS price
     FROM orders
     WHERE orderpriority = '1-URGENT'
     GROUP BY custkey, clerk
 )
 SELECT clerk, COUNT(*), SUM(price)
 FROM urgent
 WHERE price > (SELECT avg(price) FROM urgent)
 GROUP BY clerk;

CTE reuse also optimizes repeated subqueries that are not explicitly defined as CTEs. In the following query, the same subquery appears in both branches of the UNION ALL:

SELECT max(name)
FROM nation
WHERE regionkey > 2
GROUP BY regionkey
UNION ALL
SELECT name
FROM region
WHERE name > ALL (
    SELECT min(name)
    FROM nation
    WHERE regionkey > 2
    GROUP BY regionkey
);