Table scan redirection#

Table scan redirection enables SEP to offload data access to tables accessed in one catalog to equivalent tables accessed in another catalog. This can improve performance by shifting data access to a more performant system. It can also reduce load on a data source.

Redirection is transparent to the user, and therefore provides performance improvements without the need to modify queries.

A typical use case is the redirection from a catalog configuring a relational database to a catalog using the Hive connector to access a data lake. That catalog can also take advantage of Hive connector storage caching.

Redirection of table scans is performed by SEP after applying authentication and permission checks from the source catalog.

The origin catalog must use one of the following connectors:

The target catalog can use an identical connector for maximum compatibility, or any other connector. Data types are translated based on the type mapping of the connector used for the source and target catalog. This type mapping can be customized to work around unsupported types by setting an explicit type mapping for the target catalog.

If table properties like partitioning, bucketing, sorting are used, then the target can only be Hive as other connectors don’t support these table properties.

Table scan redirection for a catalog can be disabled using the table_scan_redirection_enabled catalog session property:

SET SESSION mycatalog.table_scan_redirection_enabled = false;

Configuration#

Table scan redirection is enabled for a specific catalog mycatalog in etc/catalog/mycatalog.properties using either the cache service, or a local JSON file as the source of redirections.

Configuration with the cache service

The cache service is the recommended source of redirections for production usage. It is responsible for the automatic creation and refresh of cached tables based on the configured refresh rules. You can customize the type mapping using configured type mapping rules The cache service is configured as the source of redirections using the following properties:

redirection.config-source=SERVICE
cache-service.uri=http://starburst-cache-service:8180

By default, the redirections obtained from the cache service are cached within SEP for 1 minute. This can be changed using cache-service.cache-ttl property.

When authentication is enabled for the cache service, the below configuration can be used to connect to it.

redirection.config-source=SERVICE
cache-service.uri=https://starburst-cache-service:8180
cache-service.user=test
cache-service.password=test
cache-service.http-client.trust-store-path=localhost.truststore
cache-service.http-client.trust-store-password=changeit
cache-service.http-client.https.hostname-verification=false

File-based configuration

File-based configuration is not recommended as the source of redirections for production usage. Creation and refresh of target tables is not managed by SEP with this configuration.

redirection.config-source=FILE
redirection.config-file=etc/redirection-rules.json

The configuration is located in a JSON file. The following sample redirection-rules.json file configures redirection of table scans from a catalog rdbms to a catalog datalake.

{
  "rdbms": {
    "schemas": {
      "schema_name_a": {
        "table_name_1": {
          "targetCatalog": "datalake",
          "targetSchema": "cache",
          "targetTable": "table_name_a"
        }
      },
      "schema_name_b": {
        "table_name_2": {
          "targetCatalog": "datalake",
          "targetSchema": "cache",
          "targetTable": "table_name_b",
          "columns": [
            "col1",
            "col2",
            "col3"
          ]
        },
        "table_name_3": {
          "targetCatalog": "datalake",
          "targetSchema": "cache",
          "targetTable": "table_name_c"
        }
      }
    }
  }
}

If the target table stores a subset of the columns from the source table, then the columns stored by the target table must be specified as shown above for rdbms.schema_name_b.table_name_2. When columns for a target table are not specified, it is assumed that it contains all the columns from the source table.

The above configuration results in the following behavior:

  • All scans on rdbms.schema_name_a.table_name_1 are redirected to datalake.cache.table_name_a.

  • Scans on rdbms.schema_name_b.table_name_2 using the columns col1, col2, col3 are redirected to datalake.cache.table_name_b.

  • All scans on rdbms.schema_name_b.table_name_3 are redirected to datalake.cache.table_name_c.

Redirections for multiple catalogs can be defined by adding key-value pairs along the lines of "rdbms": {...} to the above configuration file.

By default, SEP reloads rules from the configuration file every minute. This can controlled using redirection.refresh-period property in etc/config.properties.

Metrics#

Metrics about table scan redirection are reported in a JMX table for each specific catalog mycatalog using any connector:

jmx.current."com.starburstdata.presto.plugin.jdbc.redirection:name=mycatalog,type=redirectionstats"

Metrics include information about the number of successful redirections and the number of times redirection could not be performed due to the target table missing some columns from the source table.

Limitations#

  • Redirections are supported for Hive tables but not Hive views.