Starburst shared connector features#

A number of Starburst Enterprise platform (SEP) connectors shared features and behavior. For example, some use JDBC connections to the data source, support identical configuration option, performance improvements, or security features. These are all documented in the following sections and linked from the connectors which implement a specific feature.

Configuration#

Standard JDBC catalog properties#

A common JDBC connector implementation includes a catalog file with the following properties:

# Connector name (required property)
connector.name=generic-jdbc

# Connection URL (required for JDBC connectors)
connection-url=[connectorprotocol]//<host>:<port>;database=<database>

# Connection credentials
connection-user=USERNAME
connection-password=PASSWORD

While the connector.name is the only required property in a catalog file, JDBC connectors need a specified connection-url. Typically, username and password credentials are also required.

Once the catalog properties file is established, you can add any additional configuration properties as needed.

Additional configuration properties#

After establishing a base catalog properties file, you can add other configuration properties. Connector-specific properties are defined in the documentation for each connector.

Available config properties#

Property name

Description

Default value

case-insensitive-name-matching

Supports case insensitive database and collection names

False

case-insensitive-name-matching.cache-ttl

1 minute

jdbc-types-mapped-to-varchar

Allow forced mapping of comma separated lists of data types to convert to unbounded VARCHAR

metadata.cache-ttl

Duration for which metadata, including table and column statistics, is cached

0 - disabled caching

metadata.cache-missing

Cache the fact that metadata, including table and column statistics, is not available

False

Performance#

The JDBC connector base includes a number of features to improve connector performance. These features need to be extended in the connector to be available as user facing features.

Table statistics#

Connectors supports table and column statistics. They are used for cost based optimizations to improve query processing performance based on the actual data in the data source.

Collection of statistics varies for each data source.

Table and column statistics are enabled by default. You can disable it for a catalog by setting statistics.enabled to false. The metadata properties from the preceding section further control the statistics caching.

The statistics can be viewed in SEP using SHOW STATS.

Join pushdown#

Join pushdown allows the connector to delegate the table join operation to the underlying data source. This can result in performance gains, and allows Trino to perform the remaining query processing on a smaller amount of data.

The specifics for the supported pushdown of table joins varies for each data source and therefore for each connector.

Limit pushdown#

A LIMIT N clause reduces the number of returned records for a statement to N rows. LIMIT pushdown enables a connector to push processing of such queries to the underlying data source.

A pushdown of this clause to the underlying data source can improve the performance of the query and significantly reduce the amount of data transferred from the data source to Trino.

Implementation and support is connector-specific since different data sources support different SQL syntax.

Order by/limit pushdown#

A ORDER BY ... LIMIT N clause reduces the number of returned records for a statement to N rows. It relies on the ordering to determine which records need to be returned and is therefore quite different to optimize compared to a Limit pushdown. ORDER BY … LIMIT pushdown enables a connector to push processing of such queries to the underlying data source.

A pushdown of this clause to the underlying data source can improve the performance of the query and significantly reduce the amount of data transferred from the data source to Trino.

Implementation and support is connector-specific since different data sources support different SQL syntax.

Aggregation pushdown#

Some aggregation functions can be pushed down to the data source for processing. This can result in significant performance gain. Each data source supports different aggregation functions, and pushdown is therefore different for each connector.

Dynamic filtering#

Dynamic filtering is enabled by default. It causes the connector to wait for dynamic filtering to complete before starting a JDBC query.

You can disable dynamic filtering by setting the property dynamic-filtering.enabled in your catalog properties file to false.

Wait timeout#

By default, table scans on the connector are delayed up to 20 seconds until dynamic filters are collected from the build side of joins. Using a large timeout can potentially result in more detailed dynamic filters. However, it can also increase latency for some queries.

You can configure the dynamic-filtering.wait-timeout property in your catalog properties file:

dynamic-filtering.wait-timeout=1m

You can use the dynamic_filtering_wait_timeout catalog session property in a specific session:

SET SESSION mycatalog.dynamic_filtering_wait_timeout = 1s;

Compaction#

The maximum size of dynamic filter predicate, that is pushed down to the connector during table scan for a column, is configured using the domain-compaction-threshold property in the catalog properties file:

domain-compaction-threshold=100

You can use the domain_compaction_threshold catalog session property:

SET SESSION domain_compaction_threshold = 10;

By default, domain-compaction-threshold is set to 32. When the dynamic predicate for a column exceeds this threshold, it is compacted into a single range predicate.

For example, if the dynamic filter collected for a date column dt on the fact table selects more than 32 days, the filtering condition is simplified from dt IN ('2020-01-10', '2020-01-12',..., '2020-05-30') to dt BETWEEN '2020-01-10' AND '2020-05-30'. Using a large threshold can result in increased table scan overhead due to a large IN list getting pushed down to the data source.

Metrics#

Metrics about dynamic filtering are reported in a JMX table for each specific catalog mycatalog using any connector:

jmx.current."com.starburstdata.presto.plugin.jdbc.dynamicfiltering:name=mycatalog,type=dynamicfilteringstats"

For the Snowflake connector, the JMX table also includes the name:

jmx.current."com.starburstdata.presto.plugin.jdbc.dynamicfiltering:catalog=snowflake,name=snowflake,type=dynamicfilteringstats"

Metrics include information about the total number of dynamic filters, the number of completed dynamic filters, the number of awaitable dynamic filters and the time spent waiting for dynamic filters.

JDBC connection pooling#

JDBC connection pooling is disabled by default. When enabled each worker creates and maintains a connection pool instead of opening and closing separate connections to the data source sequentially. Each connection is available to connect to the data source and retrieve data. After completion of the operation, the connection is returned to the pool and can be reused. This improves performance by a small amount, and reduces the load on any required authentication system used for establishing the connection.

You can enable JDBC connection pooling by setting connection-pool.enabled=true in your catalog properties file.

Additional configuration properties to tune the connection pooling are listed below with their default value.

connection-pool.enabled=true
connection-pool.max-size=10
connection-pool.max-connection-lifetime=30m
connection-pool.pool-cache-ttl=30m
connection-pool.pool-cache-max-size=1000

connection-pool.max-size controls the maximum number of idle and active connections in the pool.

connection-pool.max-connection-lifetime controls the maximum lifetime of a connection. When a connection reaches the timeout, even if recently used, it is removed.

connection-pool.pool-cache-max-size controls the maximum size of the JDBC data source cache.

connection-pool.pool-cache-ttl controls the expiration of a cached data source when they are no longer accessed.

Table scan redirection#

Table scan redirection enables SEP to offload data access to tables accessed in one catalog to equivalent tables accessed in another catalog. This can improve performance by shifting data access to a more performant system. It can also reduce load on a data source.

Redirection is transparent to the user, and therefore provides performance improvements without the need to modify queries.

A typical use case is the redirection from a catalog configuring a relational database to a catalog using the Hive connector to access a data lake. That catalog can also take advantage of Hive connector storage caching.

Redirection of table scans is performed by SEP after applying authentication and permission checks from the source catalog.

Table scan redirection for a catalog can be disabled using the table_scan_redirection_enabled catalog session property:

SET SESSION mycatalog.table_scan_redirection_enabled = false;

Configuration#

Table scan redirection is enabled for a specific catalog mycatalog in etc/catalog/mycatalog.properties using either a local JSON file, or the cache service as the source of redirections.

File-based configuration

redirection.config-source=FILE
redirection.config-file=etc/redirection-rules.json

The configuration is located in a JSON file. The following sample redirection-rules.json file configures redirection of table scans from a catalog rdbms to a catalog datalake.

{
  "rdbms": {
    "schemas": {
      "schema_name_a": {
        "table_name_1": {
          "targetCatalog": "datalake",
          "targetSchema": "cache",
          "targetTable": "table_name_a"
        }
      },
      "schema_name_b": {
        "table_name_2": {
          "targetCatalog": "datalake",
          "targetSchema": "cache",
          "targetTable": "table_name_b",
          "columns": [
            "col1",
            "col2",
            "col3"
          ]
        },
        "table_name_3": {
          "targetCatalog": "datalake",
          "targetSchema": "cache",
          "targetTable": "table_name_c"
        }
      }
    }
  }
}

If the target table stores a subset of the columns from the source table, then the columns stored by the target table must be specified as shown above for rdbms.schema_name_b.table_name_2. When columns for a target table are not specified, it is assumed that it contains all the columns from the source table.

The above configuration results in the following behavior:

  • All scans on rdbms.schema_name_a.table_name_1 are redirected to datalake.cache.table_name_a.

  • Scans on rdbms.schema_name_b.table_name_2 using the columns col1, col2, col3 are redirected to datalake.cache.table_name_b.

  • All scans on rdbms.schema_name_b.table_name_3 are redirected to datalake.cache.table_name_c.

Redirections for multiple catalogs can be defined by adding key-value pairs along the lines of "rdbms": {...} to the above configuration file.

By default, SEP reloads rules from the configuration file every minute. This can controlled using redirection.refresh-period property in etc/config.properties.

Configuration with the cache service

The cache service can be used as the source of redirections as an alternative to the configuration file, using the following configuration:

redirection.config-source=SERVICE
cache-service.uri=http://starburst-cache-service:8180

By default, the redirections obtained from the cache service are cached within SEP for 1 minute. This can be changed using cache-service.cache-ttl property.

When authentication is enabled for the cache service, the below configuration can be used to connect to it.

redirection.config-source=SERVICE
cache-service.uri=https://starburst-cache-service:8180
cache-service.user=test
cache-service.password=test
cache-service.http-client.trust-store-path=localhost.truststore
cache-service.http-client.trust-store-password=changeit
cache-service.http-client.https.hostname-verification=false

Metrics#

Metrics about table scan redirection are reported in a JMX table for each specific catalog mycatalog using any connector:

jmx.current."com.starburstdata.presto.plugin.jdbc.redirection:name=mycatalog,type=redirectionstats"

For the Snowflake connector, the JMX table also includes the name:

jmx.current."com.starburstdata.presto.plugin.jdbc.redirection:catalog=snowflake,name=snowflake,type=redirectionstats"

Metrics include information about the number of successful redirections and the number of times redirection could not be performed due to the target table missing some columns from the source table.

Limitations#

  • Creation and refresh of target tables is not managed by SEP.

  • Data types of columns stored in the target table must match with the data types of those columns in the source table.