Starburst Splunk connector#

The Starburst Splunk connector allows querying data stored in Splunk.

Requirements#

To connect to Splunk, you need:

  • Splunk username and password credentials.

  • Network access from the coordinator and workers to Splunk, the default port is 8089.

  • A valid Starburst Enterprise license.

Configuration#

Create a catalog properties file in etc/catalog named, for example, mysplunk.properties to access Splunk as configured in the mysplunk catalog.

connector.name=splunk
splunk.url=https://example.splunk.com:8089
splunk.user=admin
splunk.password=password
splunk.schema-directory=/path/to/schema/directory

The connector can only access Splunk with the access credentials specified in the catalog configuration file:

  1. Specify the connector.name property as splunk.

  2. Configure the catalog to the URL for your Splunk server hostname and management port, default 8089.

  3. Add your Splunk user and password for your account.

If you need to access Splunk with different credentials, configure a separate catalog.

If the Splunk server is secured with TLS, you must export the server certificate and store it on the SEP coordinator and all workers in the same path on each, relative to the SEP etc directory. Then, configure the splunk.ssl-server-cert property to specify this path to the Splunk server certificate file.

General configuration properties#

The following table describes general catalog configuration properties for the connector:

Property name

Description

Default value

case-insensitive-name-matching

Support case insensitive schema and table names.

false

case-insensitive-name-matching.cache-ttl

1m

case-insensitive-name-matching.config-file

Path to a name mapping configuration file in JSON format that allows Trino to disambiguate between schemas and tables with similar names in different cases.

null

case-insensitive-name-matching.refresh-period

Frequency with which Trino checks the name matching configuration file for changes.

0 (refresh disabled)

metadata.cache-ttl

Duration for which metadata, including table and column statistics, is cached.

0 (caching disabled)

metadata.cache-missing

Cache the fact that metadata, including table and column statistics, is not available

false

metadata.cache-maximum-size

Maximum number of objects stored in the metadata cache

10000

write.batch-size

Maximum number of statements in a batched execution. Do not change this setting from the default. Non-default values may negatively impact performance.

1000

join-pushdown.enabled

Enable join pushdown. Equivalent catalog session property is join_pushdown_enabled. Enabling this may negatively impact performance for some queries.

false

Access Splunk data#

In addition to Splunk system tables, the connector can use previously saved searches to generate reports for configured users. The name of the table in SEP is the name of the report in Splunk. To create a new table, run any search in Splunk, select Save As > Report, and specify a name.

The connector only supports accessing Splunk data via saved reports.

Metadata cache#

The connector caches report names that it reads from Splunk. If you create a new report in Splunk, it may not show up right away. You can manually reset the metadata cache by restarting SEP or running the following CALL statement:

CALL <catalog-name>.system.reset_metadata_cacha();

Generate a schema file#

A schema file for a Splunk report details the SEP table name, columns, and data types. If a table does not have a schema file, the connector scans up to 50 rows from Splunk to dynamically detect the types and columns to generate an internal schema. Alternatively, you can manually create a schema file or follow the example procedure to create a schema file.

When you manually create a schema file you must add it to the schema directory on all nodes to specify the correct data types. We recommend that you use the create_schema procedure to create the initial schema file, provide the table name, then edit the file as necessary. The coordinator generates the file in the configured schema directory with the name of the table and an rsd extension. If you edit the file, you must copy the new version to each host in the cluster.

CALL <catalog-name>.system.create_schema('orders');

The following example schema file for the TPC-H orders table was generated by the connector. The name of the file is orders.rsd. It must be placed in the directory specified by the splunk.schema-directory configuration property on every SEP node.

<api:script xmlns:api="http://apiscript.com/ns?v1" xmlns:xs="http://www.cdata.com/ns/rsbscript/2" xmlns:other="http://apiscript.com/ns?v1">
  <api:info title="orders" other:catalog="CData" other:schema="Splunk" description="null" other:earliest_time="0" other:savedsearch="true" other:tablename="orders" other:search=" from inputlookup:&quot;orders.csv&quot;" other:tablepath="servicesNS/admin/search/search/jobs"  other:version="20">
    <attr   name="clerk"            xs:type="string"   isrequired="false"   columnsize="2000"                    description=""                                                      other:internalname="clerk"            other:filterable="false"   />
    <attr   name="comment"          xs:type="string"   isrequired="false"   columnsize="2000"                    description=""                                                      other:internalname="comment"          other:filterable="false"   />
    <attr   name="custkey"          xs:type="int"      isrequired="false"   columnsize="4"      precision="10"   description=""                                                      other:internalname="custkey"          other:filterable="false"   />
    <attr   name="orderdate"        xs:type="date"     isrequired="false"   columnsize="3"                       description=""                                                      other:internalname="orderdate"        other:filterable="false"   />
    <attr   name="orderkey"         xs:type="int"      isrequired="false"   columnsize="4"      precision="10"   description=""                                                      other:internalname="orderkey"         other:filterable="false"   />
    <attr   name="orderpriority"    xs:type="string"   isrequired="false"   columnsize="2000"                    description=""                                                      other:internalname="orderpriority"    other:filterable="false"   />
    <attr   name="orderstatus"      xs:type="string"   isrequired="false"   columnsize="2000"                    description=""                                                      other:internalname="orderstatus"      other:filterable="false"   />
    <attr   name="shippriority"     xs:type="int"      isrequired="false"   columnsize="4"      precision="10"   description=""                                                      other:internalname="shippriority"     other:filterable="false"   />
    <attr   name="totalprice"       xs:type="double"   isrequired="false"   columnsize="8"      precision="15"   description=""                                                      other:internalname="totalprice"       other:filterable="false"   />
  </api:info>

  <api:script method="GET">
    <api:call op="splunkadoSelect">
      <api:push/>
    </api:call>
  </api:script>
</api:script>

Type mapping#

The connector maps the following SQL types in Splunk to SEP types:

  • BOOLEAN

  • INTEGER

  • BIGINT

  • DOUBLE

  • VARCHAR

  • DATE

  • TIME(3)

  • TIMESTAMP(3)

All other SQL types are not supported.

SQL support#

The connector supports globally available and read operation statements to access data and metadata in Splunk.

Performance#

The connector includes a number of performance improvements, detailed in the following sections.

Dynamic filtering#

Dynamic filtering is enabled by default. It causes the connector to wait for dynamic filtering to complete before starting a query.

You can disable dynamic filtering by setting the property dynamic-filtering.enabled in your catalog properties file to false.

Starburst Cached Views#

The connector supports table scan redirection to improve performance and reduce load on the data source.