Managed statistics#

You can configure Starburst Enterprise platform (SEP) to collect and store its own managed Table statistics for some data sources that only collect or expose a limited number of statistics. These additional statistics can then enable the query planner to make better-informed, cost-based optimizations.

Note

Managed statistics is a public preview feature. Contact Starburst Support with questions or feedback.

Configuration#

Managed statistics requires the following:

Note

The managed statistics feature is only supported in select connectors. Reference the connectors feature matrix for more information about which connectors support this feature.

Coordinator node configuration#

In order to enable managed statistics for a cluster, add the following configuration properties to config.properties on the coordinator node only:

starburst.managed-statistics.enabled=true

Catalog configuration#

In addition to enabling managed statistics for the cluster, you must also enable managed statistics for each catalog.

To enable managed statistics for a catalog, add the following catalog configuration property to the catalog configuration file under etc/catalog:

managed-statistics.enabled=true

Managed statistics configuration properties#

The following configuration properties are used to control managed statistics on the coordinator:

Managed statistics configuration properties#

Property name

Description

starburst.managed-statistics.enabled

Enables managed statistics on a cluster. Defaults to false.

managed-statistics.cache-ttl

Maximum amount of time for managed statistics to be stored on the in-memory cache on the coordinator before they are retrieved from the database. For more information see statistics storage and management. Default value is 10m.

managed-statistics.cache-maximum-size

Maximum number of objects stored in the in-memory statistics cache. If this value is exceeded, the least-used statistics are discarded and only retrieved again from the database when the query planner needs to reference them. Defaults to 1000.

Collecting statistics#

Managed statistics must be collected manually with the ANALYZE statement or the collect_statistics command in an ALTER TABLE EXECUTE statement depending on the connector. The optional column parameter is used to limit statistics collection to a specified array of column names. See the connector documentation for more information on ALTER TABLE EXECUTE or ANALYZE statement support and syntax for that connector.

The following statistics are collected:

  • For a table:

    • Row count: The total number of rows in the table.

  • For each column in a table:

    • Nulls fraction: The fraction of null values.

    • Distinct value count: The number of distinct values.

    • Range: For data types that have a range, the range of values.

    • Data size: For textual data types, the total data size of all values.

Note

Other methods of collecting statistics continue to be used for a table until you execute the collect_statistics command on that table. Collecting statistics for one table using the collect_statistics command does not prevent SEP from continuing to collect statistics for other tables from the catalog.

Storage and management#

When statistics are collected, they are stored in the backend service database indefinitely until collect_statistics is run again and overwrites any stale statistics. These statistics are pulled from the database into an in-memory cache on the coordinator at a frequency defined by the managed-statistics.cache-ttl configuration property which defaults to 10 minutes. The query planner accesses this in-memory cache for statistics instead of the backend service database, reducing network I/O on the database at the cost of additional memory usage on the coordinator node. To disable the in-memory cache and have the query planner directly access the backend service database for statistics, set the managed-statistics.cache-ttl property to 0.

The in-memory cache retrieves its statistics from the SEP database, not from catalogs directly, so you must still routinely run the collect_statistics command to ensure that the cache does not store stale statistics.