Cache service#
The Starburst Enterprise platform (SEP) cache service provides the ability to configure and automate the management of table scan redirections and materialized views in supported connectors.
The service connects to an existing SEP installation to run queries for copying data from the source catalog to the target catalog. The target catalog is regularly synchronized with the source and used as a cache.
The cache service can be run as a standalone service or within the coordinator process. You can interact with it using its REST API, or the cache service CLI.
Note
Table scan redirection and the cache service requires a valid Starburst Enterprise license.
Requirements#
The cache service has similar requirements to SEP, which are described on the Deploying page.
Linux Operating System#
64-bit required
Newer release preferred, especially when running on containers
Java Runtime Environment#
The cache service requires a 64-bit version of Java 11. Newer major versions such as Java 12 or 13 are not supported – they may work, but are not tested.
Python#
version 2.6.x, 2.7.x, or 3.x
required by the
bin/launcher
script only
Relational database#
The cache service requires an externally managed database for storing table scan redirections data and materialized view definitions. The following RDBMS are supported:
MySQL 8.0.12 or higher
PostgreSQL 9.6 or higher
OracleDB 12.2.0.1 or higher
Table scan redirection#
The cache service requires configured table scan redirection for each source catalog where table scans are desired.
Materialized views#
The cache service handles materialized view refreshes. To support this, each supported catalog where storage tables reside must be configured to allow the creation of these tables.
Installation#
The cache service can be deployed either as a standalone application separate from your SEP cluster or within the existing coordinator process.
Standalone deployment#
A standalone deployment ensures that the service is not affected by coordinator performance, or the deployment of a new release on the SEP cluster.
Deployment of the cache service in Kubernetes can be managed with the available Helm chart .
Manual deployment relies on using a tarball:
To download the cache service binary file, contact Starburst Support
Starburst Support provides access to a file named much like
starburst-cache-service-*.tar.gz
Extract it, for example with
tar xfvz starburst-cache-service-*.tar.gz
The resulting directory starburst-cache-service-nnn
, with nnn replaced by
the release number, is called the installation directory. It contains all
necessary resources.
Move the extracted directory into the desired location, such as /opt/
, and
you are ready to proceed with configuring the service.
Embedded mode#
The cache service can be setup to run within the coordinator process by
providing configuration for the cache service on the coordinator in
etc/cache.properties
. This mode of deployment does not require
installation of any additional packages or running a separate service.
Keep in mind that the coordinator JVM is used, and its JVM and logging configuration is also used for the cache service.
Configuration#
Create an etc
directory inside the installation directory to hold the
following configuration files:
etc/jvm.config
- command line configurations options for starting the Java Virtual Machine (JVM) used by the cache serviceetc/config.properties
- configuration for the cache serviceetc/log.properties
- optional logging configuration for the cache serviceetc/rules.json
- JSON file specifying the source tables and target connector for the cache along with the schedule for refreshing themetc/type-mapping.json
- optional JSON file to specify type mapping between source and target catalogs.
JVM configuration#
The Java Virtual Machine (JVM) config file, etc/jvm.config
, contains a list
of command line options used for launching the JVM running the cache service.
The format of the file is a list of options, one per line. These options are not
interpreted by the shell, so options containing spaces, or other special
characters, should not be quoted.
The following provides a good starting point for creating etc/jvm.config
:
-server
-Xmx512M
-XX:+ExitOnOutOfMemoryError
-XX:+HeapDumpOnOutOfMemoryError
An OutOfMemoryError
typically leaves the JVM in an inconsistent state. The
above configuration causes the JVM to write a heap dump file for debugging, and
forcibly terminate the process when this occurs.
Configuration properties#
The configuration properties file, etc/config.properties
, contains the
configuration for the cache service, when deployed as a standalone system. Users of the embedded mode use a etc/cache.properties
file on the coordinator
with the same properties.
The following is a minimal configuration for the service:
service-database.user=alice
service-database.password=test123
service-database.jdbc-url=jdbc:mysql://mysql-server:13306/redirections
starburst.user=bob
starburst.jdbc-url=jdbc:trino://coordinator:8080
rules.file=etc/rules.json
The properties to configure the cache service are explained in detail in the following sections.
Property name |
Description |
Default |
---|---|---|
|
Username to connect to the SEP cluster for executing queries to refresh the cached tables. This user must also be able to read the source tables for any materialized views it refreshes. |
|
|
Password to connect to the SEP cluster when password based authentication is enabled on the SEP cluster. |
|
|
JDBC URL of the SEP cluster used for executing queries to refresh the
cached tables. You can use JDBC driver parameters in the connection string to configure details of
the connection. For example, use |
|
|
URI of the cache service, including the hostname and port number. |
|
|
Path to the JSON file containing rules for identifying source tables and target connector for caching. It also specifies a schedule for refreshing cached tables. |
|
|
Frequency at which cache rules are refreshed from the |
1m |
|
Maximum number of table import jobs that can be run in parallel. |
20 |
|
Frequency at which the cache service triggers refresh of cached tables. |
2m |
|
Initial delay for startup of the refresh. |
0s |
|
Frequency at which cache service triggers cleanup of expired tables in the cache. |
5m |
|
Initial delay for startup of the cleanup. |
0s |
The following required properties allow you to configure the connectivity to the service database used for storing redirections.
Property name |
Description |
Default |
---|---|---|
|
Username used to connect to the database storing table redirections |
|
|
Password used to connect to the database storing table redirections |
|
|
JDBC URL of the database storing table redirections, only MySQL, PostgreSQL and Oracle URLs are supported |
|
|
Enables pooling for connections to the service database |
true |
|
Maximum number of connections in the pool |
10 |
|
Maximum time an idle connection is kept in the pool |
10m |
The following optional properties allow you to configure the table import configuration used when running queries on SEP to populate the cached table projection.
Property name |
Description |
Default |
---|---|---|
|
Number of writers per task when writing unpartitioned table |
4 |
|
Scale writers when writing unpartitioned table |
false |
|
Target minimum size of writer output when writing unpartitioned table with writers scaling |
32MB |
|
Use table partitioning to parallelize writes between worker nodes when writing to a partitioned table. This reduces import memory usage and improves cached table file sizes. |
true |
|
The minimum number of written partitions that is required to use connector
|
1 |
|
Number of writers per task when writing partitioned table |
4 |
|
Scale writers when writing partitioned table |
false |
|
Target minimum size of writer output when writing partitioned table with writers scaling |
32MB |
The following optional properties allow configuring type mapping rules:
Property name |
Description |
Default |
---|---|---|
|
Set the kind of type mapping to apply:
|
|
|
Path to the JSON file |
|
|
Frequency to use to refresh the type mapping rules from the
|
1m |
TLS and authentication#
File based password authentication can be configured for the cache service by adding the following properties:
http-server.https.enabled=true
http-server.authentication.type=password
http-server.https.keystore.path=etc/auth/localhost.keystore
http-server.https.keystore.key=changeit
file.password-file=etc/auth/password.db
Property name |
Description |
Default |
---|---|---|
|
HTTP port for the cache service |
8180 |
|
HTTPS port of the cache service |
8543 |
|
Flag to activate HTTPS/TLS |
false |
|
Authentication type used for the cache service, use |
none |
|
Path to the JKS keystore file used for TLS |
|
|
Name of the key in the JKS keystore used for TLS |
|
|
Path to the password file used with the file authentication type |
Log Levels#
The optional log levels file, etc/log.properties
, allows setting the
minimum log level for named logger hierarchies. Every logger has a name,
which is typically the fully qualified name of the class that uses the logger.
Loggers have a hierarchy based on the dots in the name, like Java packages.
For example, consider the following log levels file:
com.starburstdata.cache=INFO
This sets the minimum level to INFO
for both
com.starburstdata.cache.db
and
com.starburstdata.cache.rules
. The default minimum level is INFO
,
thus the above example does not actually change anything. There are four levels:
DEBUG
, INFO
, WARN
and ERROR
.
Cache refresh rules for redirections#
A JSON file, rules.json
, is used to define rules for which tables are cached
by the service, the target catalog, and the schedule for refreshing them. The
following is a sample showing the supported ways of configuring redirections.
{
"defaultGracePeriod": "42m",
"defaultMaxImportDuration": "1m",
"defaultCacheCatalog": "default_cache_catalog",
"defaultCacheSchema": "default_cache_schema",
"defaultUnpartitionedImportConfig": {
"usePreferredWritePartitioning": false,
"preferredWritePartitioningMinNumberOfPartitions": 1,
"writerCount": 128,
"scaleWriters": false,
"writerMinSize": "110MB"
},
"defaultPartitionedImportConfig": {
"usePreferredWritePartitioning": true,
"preferredWritePartitioningMinNumberOfPartitions": 40,
"writerCount": 256,
"scaleWriters": false,
"writerMinSize": "52MB"
},
"rules": [
{
"catalogName": "test_catalog",
"schemaNameLike": "foo",
"tableNameLike": "bar",
"refreshInterval": "10m"
},
{
"catalogName": "test_catalog",
"schemaNameLike": "foo",
"tableNameLike": "bar",
"cronExpression": "* * * 2 *"
},
{
"catalogName": "some_catalog",
"schemaName": "xyz",
"tableName": "ijk",
"columns": [
"column1",
"column2",
"column3"
],
"partitionColumns": [
"column1"
],
"bucketColumns": [
"column2"
],
"bucketCount": 5,
"sortColumns": [
"column3"
],
"refreshInterval": "123h",
"gracePeriod": "80m",
"maxImportDuration": "67h",
"cacheCatalog": "table_catalog",
"cacheSchema": "table_schema",
"importConfig": {
"usePreferredWritePartitioning": true,
"preferredWritePartitioningMinNumberOfPartitions": 4,
"writerCount": 32,
"scaleWriters": false,
"writerMinSize": "100MB"
},
"incrementalImportConfig": {
"usePreferredWritePartitioning": false,
"preferredWritePartitioningMinNumberOfPartitions": 1,
"writerCount": 4,
"scaleWriters": false,
"writerMinSize": "100MB"
}
},
{
"catalogName": "mysql",
"schemaName": "foo",
"tableName": "events",
"refreshInterval": "2m",
"gracePeriod": "15m",
"incrementalColumn": "event_id",
"deletePredicate": "event_date < date_add('day', -31, CURRENT_DATE)"
},
{
"catalogName": "postgresql",
"schemaName": "foo",
"tableName": "events",
"refreshInterval": "2m",
"gracePeriod": "15m",
"incrementalColumn": "event_hour",
"predicate": "event_hour < (SELECT date_add('hour', -1, latest_event_hour) FROM (SELECT max(event_hour) AS latest_event_hour FROM postgresql.foo.events))"
}
]
}
Each rule in the rules
array defines what is cached, where the cached table
is located and a schedule for automated refresh. The source catalog is defined
within each rule using catalogName
. A specific source table can be defined
by explicitly providing schemaName
and tableName
.
tableName
can point at either table or view. One rule can match
multiple source tables using schemaNameLike
and tableNameLike
patterns.
A rule must either specify schemaName
and tableName
or
schemaNameLike
and tableNameLike
.
For each rule, the frequency of refreshing cached table can be defined either by
specifying a time duration in refreshInterval
, or by a cron expression in
cronExpression
.
For bulk loaded cached tables, ttl is the time until the table expires. The
service removes cached tables when they are no longer needed, such as when a
newer redirection is present or ttl expired. The service waits for the
grace-period
before removing the cached table. This allows any running
queries, which started just before the cached table is expired, to finish
gracefully. The ttl is computed so that the service performs a complete source
table import before the current cached table expires.
Incremental refreshes#
By default, each refresh of a cached table results in all the rows from the specified columns of the source table getting bulk loaded into the cached table. When the source table is large and requires frequent refreshes, the refresh can be configured to load data from the source table incrementally.
For incrementally refreshed cached tables, an incrementalColumn
must be
specified. This column is used by the service to apply a incrementalColumn >
(SELECT max(incrementalColumn) FROM cachedTable)
filter when loading data
incrementally from the source table. This facilitates loading only newer data
from the source table instead of the entire table in each refresh iteration.
Depending on the properties of the source table, incremental refresh can be
setup in one of two ways shown in the example rules.json
above:
Use a strictly incrementing column from the source table to detect new rows. The service depends on the property of strictly increasing values to copy only the rows from source table where the value of
incrementalColumn
is greater than the values already present in the cached table. Modifications or deletions of existing rows are ignored.Use a monotonically incrementing column from the source table in combination with a configured
predicate
. Fact tables often contain time/date based columns which have the property of monotonically increasing, but not strictly increasing values. Thepredicate
field can be used to specify a filter condition that delays loading the latest data which could still be modified (e.g. new rows added or existing rows changed) in the source table. This filter is applied in addition to the filter applied onincrementalColumn
by the service to avoid loading the same data again.
The initial refresh for an incrementally cached table is a bulk load.
Modification of columns
, partitionColumns
, bucketColumns
,
bucketCount
, sortColumns
, incrementalColumn
, predicate
or deletePredicate
field results in the creation of a new cached table
which is re-populated from the source table from scratch.
For incrementally cached tables, the cached table is not removed until the
corresponding redirection rule is removed or modified in rules.json
. If the
cached table can not be successfully refreshed according to the specified
refresh frequency, it is not used for table scan redirection so that queries on
stale data are avoided.
The redirection rule for an incrementally cached table can also optionally
specify a deletePredicate
to clean up old data from the cached table. The
deletePredicate
is used to run a DELETE FROM cachedTable WHERE
<deletePredicate>
query at every cleanup-interval
. The connector used to
store the cached table must support the DELETE FROM
statement configured
through deletePredicate
.
The grace period can be specified for all cached tables using
defaultGracePeriod
, and overridden within each rule using gracePeriod
.
The target catalog for all cached tables can be defined using
defaultCacheCatalog
, each rule can override it using cacheCatalog
.
The target schema for all cached tables can be defined using
defaultCacheSchema
, each rule can override it using cacheSchema
.
Max import duration is the maximum allowed execution time for a query used to
populate the cached table. It can be defined globally using
defaultMaxImportDuration
, or for each rule using maxImportDuration
. It
must be smaller than the refresh interval and greater than or equal to 1m.
This field is compulsory when cronExpression
is used to define a schedule
for refresh of cached tables.
Each rule can optionally define partitioning for the target table by providing
a comma separated list of column names in partitionColumns
.
Similarly, bucketing and sorting can also be defined using bucketColumns
and sortColumns
respectively. bucketCount
must be provided when
bucketColumns
are provided. Table bucketing must be provided when
sortColumns
are provided.
The default import configuration for partitioned and non-partitioned cached
tables can be specified in the defaultUnpartitionedImportConfig
and
defaultPartitionedImportConfig
fields. Each rule can optionally specify an
importConfig
. It overrides the default import config when loading the cached
tables defined by the rule. The rule can also include an
incrementalImportConfig
which applies for incremental imports.
Note
The definition of the source table or view must remain compatible with it’s
corresponding redirection rule definition for incrementally cached tables. Any
table and columns changes need be followed with an update of the relevant
names to the caching rules. This includes rules that do not specify columns,
if any columns are added, removed or renamed in the source table. Such a
change must be followed with expiring the cache with the
expire_redirection
command from the cache service CLI to create a new redirection with the updated table
definition.
Cache type mapping rules#
You can configure the type mapping to use
definitions in a JSON file, type-mapping.json
, to define type
casting rules used when the target catalog is created or updated. The following
is a sample showing the supported ways of configuring redirections.
{
"rules": {
"tpch": {
"integer": "long"
},
"hive": {
"timestamp(0)": "timestamp(3)",
"timestamp(1)": "timestamp(3)",
"timestamp(2)": "timestamp(3)"
}
}
}
Each mapping in the rules
map refers to a different target catalog
where the data needs to be cast to a different type. The Trino name of any type
that is supported by the source catalog can be used as a key in the type map.
Map values must be Trino types supported by the target catalog.
Type casting is applied to regular columns, partition columns, and the column used for incremental update.
Columns can only be cast based on their type. It is impossible to cast one column of a given type without casting all columns of that same type.
Note
There are pre-configured type mappings CAST_TIMESTAMPS_TO_MILLISECONDS
,
CAST_TIMESTAMPS_TO_MICROSECONDS
and CAST_TIMESTAMPS_TO_NANOSECONDS
that can only be set using cache properties, not via the JSON type mapping.
These mappings extend the precision of all timestamp-related types, they never
truncate data, and ignore the target catalog name.
Note
Some mappings, like integer
-> varchar
, can change the semantics of
the max()
function used for calculating the contents of an incremental
update. For example "-1" > "7"
because string lengths are compared first,
so in incremental update the row with "-1"
is added even if the table
already contains data up to "7"
.
Note
Using timestamp columns for Hive partitioning is strongly discouraged.
Timestamp partitions are silently changed to timestamp(3)
while being
written by Hive, which can result in loss of precision for some columns.
Additionally, it can lead to a huge amount of partitions, which
negatively impacts performance.
Running the cache service#
The installation directory contains the launcher script in bin/launcher
. It
service can be started as a daemon by running the following:
bin/launcher start
Alternatively, it can be run in the foreground, with the logs and other output
written to stdout/stderr. Both streams should be captured if using a supervision
system like daemontools
:
bin/launcher run
Run the launcher with --help
to see the supported commands and command line
options. In particular, the --verbose
option is very useful for debugging
the installation.
The launcher configures default values for the configuration directory etc
,
configuration files, the data directory var
, and log files in the data
directory. You can change these values to adjust your usage to any requirements,
such as using a directory outside the installation directory, specific mount
points or locations, and even using other file names.
After starting the cache service, you can find log files in the log
directory inside the data directory var
:
launcher.log
: This log is created by the launcher and is connected to the stdout and stderr streams of the server. It contains a few log messages that occur while the server logging is being initialized, and any errors or diagnostics produced by the JVM.server.log
: This is the main log file used by the service. It typically contains the relevant information if the server fails during initialization. It is automatically rotated and compressed.http-request.log
: This is the HTTP request log which contains every HTTP request received by the server. It is automatically rotated and compressed.
JMX metrics#
Metrics about table import are reported in the JMX table
jmx.current."com.starburstdata.cache:name=TableImportService"
.
Metrics about cached table cleanup are reported in the JMX table
jmx.current."com.starburstdata.cache:name=CleanupService"
.
Metrics about redirections requests on the web service resources are reported in
the JMX table
jmx.current."com.starburstdata.cache.resource:name=RedirectionsResource"
.
Metrics about table import and expiration requests on the web service resource are reported in
the JMX table
jmx.current."com.starburstdata.cache.resource:name=CacheResource"
.