Starburst Enterprise and Kubernetes #

Kubernetes (k8s) support for Starburst Enterprise platform (SEP) allows you to run your SEP clusters and additional components such as your Hive Metastore Service (HMS) or Apache Ranger. The features of k8s allow you to efficiently create, operate, and scale your clusters, and adapt them to your workload requirements. The Kubernetes support for SEP uses Helm charts.

Kubernetes platform services #

The following Kubernetes platform services are tested regularly and supported:

  • Amazon Elastic Kubernetes Service (EKS)
  • Google Kubernetes Engine (GKE)
  • Microsoft Azure Kubernetes Service (AKS)
  • Red Hat OpenShift

This topic focuses on k8s usage and practice that applies to all services.

Other Kubernetes distributions and installations can potentially work if the requirements are fulfilled, but they are not tested and not supported.

Available Helm charts #

The following Helm charts are available as part of the SEP k8s offering:

Whether you are new to Kubernetes or not, we strongly suggest that you first read about SEP Kubernetes cluster design and our best practices guide for Helm chart customization in this topic to learn how SEP uses them to build the configuration properties files it relies on.

Kubernetes cluster design for Starburst Enterprise #

SEP by its nature is built for performance. How it operates differs from other typical applications running in Kubernetes.

Typically, an enterprise application comprises many stateless microservices, each of which can be run on a small instance. SEP’s exceptional performance comes from its powerful query optimization engine, which expects all nodes to be identically sized for query planning. It also depends on each node to have large amounts of memory, to allow parallel processing within a node as well as processing of large amounts of data per node.

Once work is divided up among worker nodes, it is not redirected if a node dies, as this would obviate any performance gains. SEP coordinator and worker nodes are therefore stateful, and clusters by design rely on fewer, larger nodes.

Ideally SEP runs within a namespace dedicated to it and it alone. Separate pods can be defined for worker nodes and coordinator nodes in that namespace, and taints and tolerations can be defined for node selection in SEP.

You must review the SEP Kubernetes requirements before you begin installing the Helm charts to ensure that you have the correct credentials in place and understand sizing requirements.

Configuring SEP with Helm charts #

SEP uses a number of configuration files internally that determine how it behaves:

  • etc/catalog/<catalog name>.properties
  • etc/config.properties
  • etc/jvm.properties
  • etc/log.properties
  • etc/node.properties
  • etc/access-control.properties

With our Kubernetes deployment, these files are built using Helm charts, and are nested in the coordinator.etcFiles YAML node.

Catalog properties files are also built using Helm charts. They are defined under the top-level catalogs: YAML node. The catalogs and properties you create depend on what data sources you connect to with SEP. You can configure as many as you need.

Customization best practices #

Every helm-based deployment includes a values.yaml file, and SEP is no exception. It contains the default values, any and all of which can be overridden.

Along with basic instance configuration for the various cloud platforms, values.yaml also includes the key-value pairs necessary to build the required *.properties files that configure SEP.

Our SEP configuration properties reference for Kubernetes contains the complete list of available configuration options.

A recommended set of customization files is included later in this topic, including recommendations for creating specific override files, with examples.

Default YAML files #

Default values are provided in for a minimum configuration only, not including security or any catalog connector properties, as these vary by customer needs. The configuration in the Helm chart also contains deployment information such as registry credentials and instance size.

Each new release of SEP includes new chart version, and the default values may change. For this reason, we highly recommend that you follow best practices and leave the values.yaml file in the chart untouched, overriding only very specific values as needed in one or more separate YAML files.

Using version control #

We very strongly recommend that you manage your customizations in a version control system such as a git repository. Each cluster and deployment has to be managed in separate files.

Creating and using YAML files #

The following is a snippet of default values from the SEP values.yaml file embedded in the Helm chart:

  coordinator:
    resources:
      memory: "60Gi"
      requests:
        cpu: 16

The default 60GB of required memory is potentially larger than any of the available pods in your cluster. As a result the default prevents your deployment success, since no suitable pod is available.

To create a customization that overrides the default size for a test cluster, copy and paste only that section into a new file named sep-test-setup.yaml, and make any changes. You must also include the relevant structure above that section. The memory settings for workers have the same default values and need to be overridden as well:

  coordinator:
    resources:
      memory: "10Gi"
      requests:
        cpu: 2
  worker:
    resources:
      memory: "10Gi"
      requests:
        cpu: 2

Store the new file in a path accessible from the helm upgrade --install command.

When you are ready to install, specify the new file using the --values argument as in the following example. Replace 4XX.0.0 with the Helm chart version of the desired SEP release as documented on the versions page:

  helm upgrade my-sep-test-cluster starburstdata/starburst-enterprise \
    --install \
    --version 4XX.0.0 \
    --values ./registry-access.yaml \
    --values ./sep-test-setup.yaml

You can chain as many override files as you need. If a value appears in multiple files, the value in the rightmost, last specified file takes precedence. Typically it is useful to limit the number of files as well as the size of the individual files. For example, it can be useful to create a separate file that contains all catalog definitions.

To view the built-in configuration of the Helm chart for a specific version of SEP, run the following command:

  helm template starburstdata/starburst-enterprise --version 4XX.0.0

Use this command with different version values to compare the configuration of different SEP releases as part of your upgrade process.

To generate the specific configuration files for your deployment, use the template command with your additional values files:

  helm template starburstdata/starburst-enterprise \
    --version 4XX.0.0 \
    --values ./registry-access.yaml \
    --values ./sep-test-setup.yaml

The file set described below describes a series of focused configuration files. If you have more than one cluster, such as a test cluster and a production cluster, name the files accordingly before you begin. Examples are provided in the sections that follow.

Recommended customization file set
File name Content
registry-access.yaml Docker registry access credentials file, typically to access the Docker registry on the Starburst Harbor instance. Include the registryCredentials: or imagePullSecrets: top level node in this file to configure access to the Docker registry. This file can be used for all SEP, HMS, and Ranger configuration for all clusters you operate.
sep-prod-catalogs.yaml Catalog configuration for all catalogs configured for SEP on the prod cluster. It is typically useful to separate catalog configurations out into a separate file to allow reuse across clusters, as well as to separate the large amount of configuration of the catalogs from all the cluster configuration.
sep-prod-setup.yaml Main configuration file for the prod cluster. Include any configuration for all other top level nodes that configure the coordinator, workers, and all other aspects of the cluster.

Create and manage additional configuration files, if you are operating multiple clusters, while reusing the credentials file. For example, if you run a dev and stage cluster use the following additional files:

  • sep-dev-catalogs.yaml
  • sep-dev-setup.yaml
  • sep-stage-catalogs.yaml
  • sep-stage-setup.yaml

There are several supporting services available for use with SEP, each with their own Helm chart:

If you opt to use these services, you can create a configuration file for each of these per cluster as well:

Production prod cluster:

  • cache-service-prod.yaml
  • hms-prod.yaml
  • ranger-prod.yaml

Development dev cluster:

  • cache-service-dev.yaml
  • hms-dev.yaml
  • ranger-dev.yaml

Staging stage cluster:

  • cache-service-stage.yaml
  • hms-stage.yaml
  • ranger-stage.yaml

registry-access.yaml #

You can get started with a minimal file that only adds your credentials to the Starburst Harbor instance, as shown below:

  registryCredentials:
    enabled: true
    registry: harbor.starburstdata.net/starburstdata
    username: <yourusername>
    password: <yourpassword>

In the examples throughout our documentation, this file is named registry-access.yaml.

sep-prod-catalogs.yaml #

A catalog YAML file adds all the configurations for defining the catalogs and their connection details to the underlying data sources. The following snippet contains a few completely configured catalogs that are ready to use:

  • tpch-testdata exposes the TPC-H benchmark data useful for learning SQL and testing.
  • tmpmemory uses the Memory connector to provide a small temporary test ground for users.
  • metrics uses the JMX connector and exposes the internal metrics of SEP for monitoring and troubleshooting.
  • clientdb uses the Starburst PostgreSQL connector to access the clientdb database.
  • datalake and s3 are stubs of catalogs using the Starburst Hive connector with a HMS and a Glue catalog as metastore.
  catalogs:
    tpch-testdata: |
      connector.name=tpch
    tmpmemory: |
      connector.name=memory
    metrics: |
      connector.name=jmx
    clientdb: |
      connector.name=postgresql
      connection-url=jdbc:postgresql://postgresql.example.com:5432/clientdb
      connection-password=${ENV:PSQL_PASSWORD}
      connection-user=${ENV:PSQL_USERNAME}
    datalake: |
      connector.name=hive
      hive.metastore.uri=thrift://hive:9083
    s3: |
      connector.name=hive
      hive.metastore=glue

sep-prod-setup.yaml #

This example provides a minimal starting point as a best practice. It achieves the following:

  • environment: provides the name production for the environment, which becomes visible in the Web UI.
  • sharedSecret: sets a shared random secret string for communications between the coordinator and all workers. NOTE: This is different than the shared secret set for the license file with the kubectl create secret command.
  • replicas: configures the cluster to use four workers.
  • resources: adjusts the memory and CPU requirements for the workers and the coordinator. In this example, it increases the values for use with more powerful servers than the default.
  environment: production
  sharedSecret: AN0Qhhw9PsZmEgEXAMPLEkIj3AJZ5/Mnyy5iRANDOMceM+SSV+APSTiSTRING

  coordinator:
    resources:
      memory: "256Gi"
      requests:
        cpu: 32

  worker:
    replicas: 4
    resources:
      memory: "256Gi"
      requests:
        cpu: 32