CTE reuse#
CTE reuse optimizes queries by identifying common table expressions and subqueries that appear multiple times within a query and executing them only once.
When you enable CTE reuse, Starburst Enterprise platform (SEP) executes matching query fragments once, saves the intermediate results, and reuses them at different points in the query plan. CTE reuse also detects similar but non-identical query fragments.
CTE reuse operates within the scope of a single SQL query. Results are not reused across different queries.
Note
CTE reuse is available as a public preview feature. Contact your Starburst account team with questions or feedback.
Configuration#
To enable CTE reuse globally, add the following property to your coordinator’s
config.properties file:
optimizer.reuse-common-subqueries=true
To enable CTE reuse on a per-query basis, use the reuse_common_subqueries
session property:
SET SESSION reuse_common_subqueries = true;
The following object storage connectors support CTE reuse:
The following non-object storage connectors support CTE reuse:
CTE supports most basic query operators, including:
Table scans, including filters and projections
Aggregations
Joins
Unions
Exchange manager#
CTE reuse requires an exchange manager to transfer intermediate results between query stages. Use one of the following options:
Embedded buffer service#
The embedded buffer service runs within your SEP cluster and temporarily stores intermediate query results in memory. When the buffer service runs out of memory, it writes data to object storage.
Enable the embedded buffer service#
To enable the embedded buffer service, add the following property to the
config.properties file on all nodes, including the coordinator:
embedded-buffer-service-enabled=true
Configure the exchange manager to use the embedded buffer service by adding the
following properties to the exchange-manager.properties file on all nodes:
exchange-manager.name=buffer
exchange.use-embedded-buffer-service=true
Configure buffer service memory#
To control how much memory the buffer service uses, configure the following
properties in the config.properties files on worker nodes:
Property name |
Description |
Default |
|---|---|---|
|
The total memory assigned to the buffer service subprocess as a
fraction of heap memory. Do not set higher than |
|
|
The memory reserved for storing intermediate result data as a fraction of
heap memory. Set this value approximately 5 percentage points lower than
|
|
Configure chunks spooling#
When the embedded buffer service runs out of memory, it writes data to an
external object storage location. Configure this location in the
config.properties on worker nodes using the buffer.spooling.directory and
the appropriate storage-specific properties.
Note
Ensure workers have direct read access to the spooling location. Workers inherit the object storage configuration from the embedded buffer service.
AWS S3#
To configure AWS S3 as the spooling location, use the following configuration properties:
Property name |
Description |
|---|---|
|
The object storage location for spooled data. Use an |
|
The access key for the S3 API. |
|
The secret key for the S3 API. |
|
The AWS region of the S3 spooling bucket. |
|
The S3 API endpoint. Required when using an S3-compatible service such as MinIO. |
Google Cloud Storage#
To configure Google Cloud Storage as the spooling location, use the following configuration properties:
Property name |
Description |
|---|---|
|
The object storage location for spooled data. Use a |
|
The S3 API endpoint. Set to |
|
The GCS HMAC access key. |
|
The GCS HMAC secret key. |
|
The GCS service account credentials in JSON format. |
Azure Blob Storage#
To configure Azure Blob Storage as the spooling location, use the following configuration properties:
Property name |
Description |
|---|---|
|
The object storage location for spooled data. Use an |
|
The Azure storage connection string, including credentials. |
Examples#
The following query defines a CTE named emea and references it twice: once in
the main query and once in a subquery. With CTE reuse enabled, SEP
computes the emea result set once and reuses it for both references.
WITH emea AS (
SELECT country, price
FROM transactions
WHERE region = 'EMEA'
)
SELECT country, COUNT(*), SUM(price)
FROM emea
WHERE price > (SELECT avg(price) FROM emea)
GROUP BY country;
CTE reuse also optimizes repeated subqueries that are not explicitly defined as
CTEs. In the following query, the same subquery appears in both branches of the
UNION ALL:
SELECT min(name)
FROM nation
WHERE regionkey > 2
GROUP BY regionkey
UNION ALL
SELECT name
FROM region
WHERE name > ALL (
SELECT max(name)
FROM nation
WHERE regionkey > 2
GROUP BY regionkey
);