Cache service#
The Starburst Enterprise platform (SEP) cache service provides the ability to configure and automate the management of table scan redirections. The service connects to an existing SEP installation to run queries for copying data from the source catalog to the target catalog. The target catalog is regularly synchronized with the source and used as a cache.
The standalone application runs in your cluster. You can interact with it using its REST API, or the cache service CLI.
Note
Table scan redirection and the cache service requires a valid Starburst Enterprise license.
Requirements#
The cache service has similar requirements to SEP, which are described on the Deploying page.
Linux Operating System#
64-bit required
Newer release preferred, especially when running on containers
Java Runtime Environment#
The cache service requires a 64-bit version of Java 11. Newer major versions such as Java 12 or 13 are not supported – they may work, but are not tested.
Python#
version 2.6.x, 2.7.x, or 3.x
required by the
bin/launcher
script only
Relational database#
The cache service requires an externally managed database for storing table scan redirections data. The following RDBMS are supported:
MySQL 8.0.12+
OracleDB 12.2.0.1+
Installation#
The cache service must be deployed separately from SEP in your cluster. This ensures that it is not affected by coordinator performance, or the deployment of a new release on the SEP cluster.
To download the cache service binary file, contact Starburst Customer Support
Support provides access to a file named much like
starburst-cache-service-*.tar.gz
Extract it, for example with
tar xfvz starburst-cache-service-*.tar.gz
The resulting directory starburst-cache-service-nnn
, with nnn replaced by
the release number, is called the installation directory. It contains all
necessary resources.
Move the extracted directory into the desired location, such as /opt/
, and
you are ready to proceed with configuring the service.
Configuration#
Create an etc
directory inside the installation directory to hold the
following configuration files:
etc/jvm.config
- command line configurations options for starting the Java Virtual Machine (JVM) used by the cache serviceetc/config.properties
- configuration for the cache serviceetc/log.properties
- optional logging configuration for the cache serviceetc/rules.json
- JSON file specifying the source tables and target connector for the cache along with the schedule for refreshing them
JVM configuration#
The Java Virtual Machine (JVM) config file, etc/jvm.config
, contains a list
of command line options used for launching the JVM running the cache service.
The format of the file is a list of options, one per line. These options are not
interpreted by the shell, so options containing spaces, or other special
characters, should not be quoted.
The following provides a good starting point for creating etc/jvm.config
:
-server
-Xmx512M
-XX:+ExitOnOutOfMemoryError
-XX:+HeapDumpOnOutOfMemoryError
An OutOfMemoryError
typically leaves the JVM in an inconsistent state. The
above configuration causes the JVM to write a heap dump file for debugging, and
forcibly terminate the process when this occurs.
Configuration properties#
The configuration properties file, etc/config.properties
, contains the
configuration for the cache service.
The following is a minimal configuration for the service:
service-database.user=alice
service-database.password=test123
service-database.jdbc-url=jdbc:mysql://mysql-server:13306/redirections
starburst.user=bob
starburst.jdbc-url=jdbc:presto://presto-coordinator:8080
rules.file=etc/rules.json
The properties to configure the cache service are explained in detail in the following sections.
Property name |
Description |
Default |
---|---|---|
|
username to connect to the SEP cluster for executing queries to refresh the cached tables |
|
|
password to connect to the SEP cluster when password based authentication is enabled on the SEP cluster |
|
|
JDBC URL of the SEP cluster used for executing queries to refresh the cached tables |
|
|
path to the JSON file containing rules for identifying source tables and target connector for caching. It also specifies a schedule for refreshing cached tables. |
|
|
Frequency at which cache rules are refreshed from the |
1m |
|
Maximum number of table import jobs that can be run in parallel |
20 |
|
Frequency at which the cache service triggers refresh of cached tables |
10m |
|
Initial delay for startup of the refresh |
0s |
|
Frequency at which cache service triggers cleanup of expired tables in the cache |
10m |
|
Initial delay for startup of the cleanup |
0s |
The following required properties allow you to configure the connectivity to the service database used for storing redirections.
Property name |
Description |
Default |
---|---|---|
|
Username used to connect to the database storing table redirections |
|
|
Password used to connect to the database storing table redirections |
|
|
JDBC URL of the database storing table redirections, only MySQL and Oracle URLs are supported |
|
|
Enables pooling for connections to the service database |
true |
|
Maximum number of connections in the pool |
10 |
|
Maximum time an idle connection will be kept in the pool |
10m |
The following optional properties allow you to configure the table import configuration used when running queries on SEP to populate the cached table.
Property name |
Description |
Default |
---|---|---|
|
Number of writers per task when writing unpartitioned table |
4 |
|
Scale writers when writing unpartitioned table |
false |
|
Target minimum size of writer output when writing unpartitioned table with writers scaling |
32MB |
|
Use table partitioning to split data between writers when writing partitioned table |
true |
|
Number of writers per task when writing partitioned table |
4 |
|
Scale writers when writing partitioned table |
false |
|
Target minimum size of writer output when writing partitioned table with writers scaling |
32MB |
TLS and authentication#
File based password authentication can be configured for the cache service by adding the following properties:
http-server.https.enabled=true
http-server.authentication.type=password
http-server.https.keystore.path=etc/auth/localhost.keystore
http-server.https.keystore.key=changeit
file.password-file=etc/auth/password.db
Property name |
Description |
Default |
---|---|---|
|
HTTP port for the cache service |
8180 |
|
HTTPS port of the cache service |
8543 |
|
Flag to activate HTTPS/TLS |
false |
|
Authentication type used for the cache service, use |
none |
|
Path to the JKS keystore file used for TLS |
|
|
Name of the key in the JKS keystore used for TLS |
|
|
Path to the password file used with the file authentication type |
Log Levels#
The optional log levels file, etc/log.properties
, allows setting the
minimum log level for named logger hierarchies. Every logger has a name,
which is typically the fully qualified name of the class that uses the logger.
Loggers have a hierarchy based on the dots in the name, like Java packages.
For example, consider the following log levels file:
com.starburstdata.presto.cache=INFO
This sets the minimum level to INFO
for both
com.starburstdata.presto.cache.db
and
com.starburstdata.presto.cache.rules
. The default minimum level is INFO
,
thus the above example does not actually change anything. There are four levels:
DEBUG
, INFO
, WARN
and ERROR
.
Cache refresh rules#
A JSON file, rules.json
, is used to define rules for which tables are cached
by the service, the target catalog, and the schedule for refreshing them. The
following is a sample showing the supported ways of configuring redirections.
{
"defaultGracePeriod": "42m",
"defaultMaxImportDuration": "9s",
"defaultCacheCatalog": "default_cache_catalog",
"defaultCacheSchema": "default_cache_schema",
"defaultUnpartitionedImportConfig": {
"usePreferredWritePartitioning": false,
"writerCount": 128,
"scaleWriters": false,
"writerMinSize": "110MB"
},
"defaultPartitionedImportConfig": {
"usePreferredWritePartitioning": true,
"writerCount": 256,
"scaleWriters": false,
"writerMinSize": "52MB"
},
"rules": [
{
"catalogName": "test_catalog",
"schemaNameLike": "foo",
"tableNameLike": "bar",
"refreshInterval": "52s"
},
{
"catalogName": "test_catalog",
"schemaNameLike": "foo",
"tableNameLike": "bar",
"cronExpression": "* * * 2 *"
},
{
"catalogName": "some_catalog",
"schemaName": "xyz",
"tableName": "ijk",
"columns": [
"column1",
"column2"
],
"partitionColumns": [
"column1"
],
"refreshInterval": "123h",
"gracePeriod": "80m",
"maxImportDuration": "67h",
"cacheCatalog": "table_catalog",
"cacheSchema": "table_schema",
"importConfig": {
"usePreferredWritePartitioning": true,
"writerCount": 32,
"scaleWriters": false,
"writerMinSize": "100MB"
}
}
]
}
Each rule in the rules
array defines what is cached, where the cached table
is located and a schedule for automated refresh. The source catalog is defined
within each rule using catalogName
. A specific source table can be defined
by explicitly providing schemaName
and tableName
. One rule can match
multiple source tables using schemaNameLike
and tableNameLike
patterns.
A rule must either specify schemaName
and tableName
or
schemaNameLike
and tableNameLike
.
For each rule, the frequency of refreshing cached table can be defined either by
specifying a time duration in refreshInterval
, or by a cron expression in
cronExpression
.
The ttl for a cache table is the time until the table expires. The service
removes cached tables when they are no longer needed, such as when a newer
redirection is present or ttl expired. The service waits for grace-period
before removing the cached table. This allows any running queries, which started
just before the cached table is expired, to finish gracefully. The ttl is
computed so that the service performs a complete source table import before the
current cached table expires.
The grace period can be specified for all cached tables using
defaultGracePeriod
, and overridden within each rule using gracePeriod
.
The target catalog for all cached tables can be defined using
defaultCacheCatalog
, each rule can override it using cacheCatalog
.
The target schema for all cached tables can be defined using
defaultCacheSchema
, each rule can override it using cacheSchema
.
Max import duration is the maximum allowed execution time for a query used to
populate the cached table. It can be defined globally using
defaultMaxImportDuration
, or for each rule using maxImportDuration
. It
must be smaller than the refresh interval. This field is compulsory when
cronExpression
is used to define a schedule for refresh of cached tables.
Running the cache service#
The installation directory contains the launcher script in bin/launcher
. It
service can be started as a daemon by running the following:
bin/launcher start
Alternatively, it can be run in the foreground, with the logs and other output
written to stdout/stderr. Both streams should be captured if using a supervision
system like daemontools
:
bin/launcher run
Run the launcher with --help
to see the supported commands and command line
options. In particular, the --verbose
option is very useful for debugging
the installation.
The launcher configures default values for the configuration directory etc
,
configuration files, the data directory var
, and log files in the data
directory. You can change these values to adjust your usage to any requirements,
such as using a directory outside the installation directory, specific mount
points or locations, and even using other file names.
After starting the cache service, you can find log files in the log
directory inside the data directory var
:
launcher.log
: This log is created by the launcher and is connected to the stdout and stderr streams of the server. It contains a few log messages that occur while the server logging is being initialized, and any errors or diagnostics produced by the JVM.server.log
: This is the main log file used by the service. It typically contains the relevant information if the server fails during initialization. It is automatically rotated and compressed.http-request.log
: This is the HTTP request log which contains every HTTP request received by the server. It is automatically rotated and compressed.
Docker container#
It is possible to run the cache service in a Docker container for initial exploration and testing, as well as for deployments in Kubernetes.
The Docker image is available at starburstdata/starburst-cache-service
Getting started#
You can run the container locally, making sure you publish port 8180 of the service. For example:
docker run -p 8180:8180 -v ~/starburst-cache-service/etc:/usr/lib/starburst-cache-service/etc --network="host" --rm harbor.starburstdata.net/starburstdata/starburst-cache-service:latest
For administration, direct a locally installed cache service CLI to the same port:
cache-cli current_redirection --server localhost:8180 --source=mysql.test.nation_tmp
JMX metrics#
Metrics about table import are reported in the JMX table
jmx.current."com.starburstdata.presto.cache:name=TableImportService"
.
Metrics about cached table cleanup are reported in the JMX table
jmx.current."com.starburstdata.presto.cache:name=CleanupService"
.
Metrics about redirections requests on the web service resources are reported in
the JMX table
jmx.current."com.starburstdata.presto.cache:name=RedirectionsResource"
.
Metrics about table import and expiration requests on the web service resource are reported in
the JMX table
jmx.current."com.starburstdata.presto.cache:name=CacheResource"
.