CTE reuse#
CTE reuse optimizes queries by identifying common table expressions and subqueries that appear multiple times within a query and executing them only once.
When you enable CTE reuse, Starburst Enterprise platform (SEP) executes matching query fragments once, saves the intermediate results, and reuses them at different points in the query plan. CTE reuse also detects similar but non-identical query fragments.
CTE reuse operates within the scope of a single SQL query. Results are not reused across different queries.
Note
CTE reuse is a best effort optimization and Starburst does not guarantee that it will merge and reuse all table scans, common blocks, or subqueries for every query.
Supported connectors and operators#
The following object storage connectors support CTE reuse:
The following non-object storage connectors support CTE reuse:
CTE reuse supports most basic query operators, including:
Table scans, including filters and projections
Aggregations
Joins
Unions
Configuration#
To enable CTE reuse globally, add the following property to your coordinator’s
config.properties file:
optimizer.reuse-common-subqueries=true
To enable CTE reuse on a per-query basis, use the reuse_common_subqueries
session property:
SET SESSION reuse_common_subqueries = true;
Exchange manager#
CTE reuse requires an exchange manager to transfer intermediate results between query stages. Use one of the following options:
Filesystem exchange (only when used with FTE mode)
Embedded buffer service#
The embedded buffer service runs within your SEP cluster and temporarily stores intermediate query results in memory. When the buffer service runs out of memory, it writes data to object storage.
Enable the embedded buffer service#
To enable the embedded buffer service, add the following property to the
config.properties file on all nodes, including the coordinator:
embedded-buffer-service-enabled=true
Configure the exchange manager to use the embedded buffer service by adding the
following properties to the exchange-manager.properties file on all nodes:
exchange-manager.name=buffer
exchange.use-embedded-buffer-service=true
Configure buffer service memory#
To control how much memory the buffer service uses, configure the following
property in the config.properties files on worker nodes:
Property name |
Description |
Default |
|---|---|---|
|
The total memory assigned to the buffer service subprocess as a
fraction of heap memory. Do not set higher than |
|
Configure chunks spooling#
When the embedded buffer service runs out of memory, it writes data to an
external object storage location. Configure this location in the
config.properties on coordinator and worker nodes using the
buffer.spooling.directory and the appropriate storage-specific properties.
Note
Ensure all nodes (both coordinator and workers) have direct read access to the spooling location. Workers inherit the object storage configuration from the embedded buffer service.
AWS S3#
To configure AWS S3 as the spooling location, use the following configuration properties:
Property name |
Description |
|---|---|
|
The object storage location for spooled data. Use an |
|
The access key for the S3 API. |
|
The secret key for the S3 API. |
|
The AWS region of the S3 spooling bucket. |
|
The S3 API endpoint. Required when using an S3-compatible service such as MinIO. |
Google Cloud Storage#
To configure Google Cloud Storage as the spooling location, use the following configuration properties:
Property name |
Description |
|---|---|
|
The object storage location for spooled data. Use a |
|
The GCS HMAC access key. |
|
The GCS HMAC secret key. |
|
The S3 API endpoint. Set to |
|
The GCS service account credentials in JSON format. |
|
Path to a JSON file containing the GCS service account credentials. |
Azure Blob Storage#
To configure Azure Blob Storage as the spooling location, use the following configuration properties:
buffer.spooling.directory=abfs://<FILESYSTEM_CONTAINER>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/<SPOOL_PATH>
buffer.spooling.azure.connection-string=DefaultEndpointsProtocol=https;AccountName=<STORAGE_ACCOUNT_NAME>;AccountKey=<STORAGE_ACCOUNT_KEY>
Property name |
Description |
|---|---|
|
The object storage location for spooled data. Use an |
|
The Azure storage connection string, including credentials. |
Examples#
The following query defines a CTE named urgent that aggregates order counts
and total prices for urgent orders, grouped by customer and clerk. The main
query references urgent twice: once to filter results using a subquery that
computes the average price from urgent, and once to aggregate the filtered
results by clerk. With CTE reuse enabled, SEP computes the urgent result
set once and reuses it for both references.
WITH urgent AS (
SELECT custkey, clerk, COUNT(*) AS order_count, SUM(totalprice) AS price
FROM orders
WHERE orderpriority = '1-URGENT'
GROUP BY custkey, clerk
)
SELECT clerk, COUNT(*), SUM(price)
FROM urgent
WHERE price > (SELECT avg(price) FROM urgent)
GROUP BY clerk;
CTE reuse also optimizes repeated subqueries that are not explicitly defined as
CTEs. In the following query, the same subquery appears in both branches of the
UNION ALL:
SELECT max(name)
FROM nation
WHERE regionkey > 2
GROUP BY regionkey
UNION ALL
SELECT name
FROM region
WHERE name > ALL (
SELECT min(name)
FROM nation
WHERE regionkey > 2
GROUP BY regionkey
);