Deploy Starburst Enterprise on Kubernetes #

Kubernetes support for Starburst Enterprise platform (SEP) allows you to run your SEP clusters and additional components such as your Hive Metastore Service or Apache Ranger. The features of k8s allow you to efficiently create, operate and scale your clusters and adapt them to your workload requirements. The Kubernetes support for SEP uses Helm charts.

Kubernetes platform services #

The following Kubernetes platform services are tested regularly and supported:

  • Amazon Elastic Kubernetes Service (EKS)
  • Google Kubernetes Engine (GKE)
  • Microsoft Azure Kubernetes Service (AKS)
  • Red Hat OpenShift

This chapter aims to cover all the above, focusing on k8s usage that applies to all services.

Other Kubernetes distributions and installations can potentially work if the requirements are fulfilled, but they are not tested and not supported.

Available Helm charts #

The following Helm charts are available as part of the SEP k8s offering:

Whether you are new to Kubernetes or not, we strongly suggest that you first read about SEP Kubernetes cluster design and our best practices guide for Helm chart customization to learn how SEP uses them to build the configuration properties files it relies on.

Kubernetes cluster design for Starburst Enterprise #

SEP by its nature is built for performance. How it operates differs from other typical applications running in Kubernetes.

Typically, an enterprise application comprises many stateless microservices, each of which can be run on a small instance. SEP’s exceptional performance comes from its powerful query optimization engine, which expects all nodes to be identically sized for query planning. It also depends on each node to have large amounts of memory, to allow parallel processing within a node as well as processing of large amounts of data per node.

Once work is divided up among worker nodes, it is not redirected if a node dies, as this would obviate any performance gains. SEP coordinator and worker nodes are therefore stateful, and by design rely on fewer, larger nodes.

Ideally SEP runs within a namespace dedicated to it and it alone. Separate pods can be defined for worker nodes and coordinator nodes in that namespace, and taints and tolerations can be defined for node selection in SEP.

You must review the SEP Kubernetes requirements before you begin installing the Helm charts to ensure that you have the correct credentials in place and understand sizing requirements.

Configuring SEP with Helm charts #

SEP uses a number of configuration files that determine how it behaves:

  • etc/catalog/<catalog name>.properties
  • etc/config.properties
  • etc/jvm.properties
  • etc/log.properties
  • etc/node.properties
  • etc/access-control.properties

With our Kubernetes deployment, these files are built using Helm charts.

The necessary catalog properties files depend on what data sources you connect to with SEP. You can configure as many as you need. In our deployments, these configuration files are built using the YAML configuration files.

Customization best practices #

Every helm-based deployment includes a values.yaml file, and SEP is no exception. As with any Helm-based deployment, it contains the default values, any and all of which can be overridden.

Along with basic instance configuration for the various cloud platforms, values.yaml also includes the key-value pairs necessary to build the required *.properties files that configure SEP.

Read our SEP configuration guide for Kubernetes to learn how the configuration options are structured.

The installation guide provides concrete recommendations for creating specific override files, with examples.

More information on YAML file handling is found in the following section of this page.

Default YAML files #

Default values are provided in for a minimum configuration only, not including security or any catalog connector properties, as these vary by customer needs. The configuration in the Helm chart also contains deployment information such as registry credentials and instance size.

Each new release of SEP includes new charts, and the default values may change. For this reason, we highly recommend that you follow best practices and leave the values.yaml file in the chart untouched, overriding only very specific values as needed in one or more separate YAML files.

Using version control #

We very strongly recommend that you manage your customizations in a version control system. Each cluster and deployment has to be managed in separate files.

Creating and using YAML files #

The following is a snippet of default values from the SEP values.yaml file embedded in the Helm chart:

  coordinator:
    resources:
      memory: "60Gi"
      requests:
        cpu: 16
      limits:
        cpu: 16

The default 60GB of required memory is potentially larger than any of the available pods in your cluster. As a result the default prevents your deployment success, since no suitable pod is available.

To create a customization that overrides the default size for a test cluster, copy and paste only that section into a new file named sep-test-setup.yaml, and make any changes. You must also include the relevant structure above that section. The memory settings for workers have the same default values and need to be overridden as well:

  coordinator:
    resources:
      memory: "10Gi"
      requests:
        cpu: 2
      limits:
        cpu: 2
  worker:
    resources:
      memory: "10Gi"
      requests:
        cpu: 2
      limits:
        cpu: 2

Store the new file in a path accessible from the helm upgrade --install command.

When you are ready to install, specify the new file using the --values argument as in the following example:

  helm upgrade my-sep-test-cluster starburstdata/starburst-enterprise \
    --install \
    --version 367.0.0 \
    --values ./registry-access.yaml
    --values ./sep-test-setup.yaml

Ensure to use the Helm chart version of the desired SEP documented on the versions page.

You can chain as many override files as you need. If a value appears in multiple files, the value in the rightmost file, last specified file takes precedence. Typically it is useful to limit the number of files as well as the size of the individual files. For example, it can be useful to create a separate file that contains all catalog definitions.

The file set described below describes a series of focused configuration files. If you have more than one cluster, such as a test cluster and a production cluster, name the files accordingly before you begin. Examples are provided in the sections that follow.

Recommended customization file set
File name Content
registry-access.yaml Docker registry access credentials file, typically to access the Docker registry on the Starburst Harbor instance. Include the registryCredentials: or imagePullSecrets: top level node in this file to configure access to the Docker registry. This file can be used for all SEP, HMS, and Ranger configuration for all clusters you operate.
sep-prod-catalogs.yaml Catalog configuration for all catalogs configured for SEP on the prod cluster. It is typically useful to separate catalog configurations out into a separate file to allow reuse across clusters, as well as to separate the large amount of configuration of the catalogs from all the cluster configuration.
sep-prod-setup.yaml Main configuration file for the prod cluster. Include any configuration for all other top level nodes that configure the coordinator, workers, and all other aspects of the cluster.

Create and manage additional configuration files, if you are operating multiple clusters, while reusing the credentials file. For example, if you run a dev and stage cluster use the following additional files:

  • sep-dev-catalogs.yaml
  • sep-dev-setup.yaml
  • sep-stage-catalogs.yaml
  • sep-stage-setup.yaml

If you are optionally implementing one or both of the Hive Metastore Service and Apache Ranger, you to create an configuration file for each of these per cluster as well:

Production prod cluster:

  • hms-prod.yaml
  • ranger-prod.yaml

Development dev cluster:

  • hms-dev.yaml
  • ranger-dev.yaml

Staging stage cluster:

  • hms-stage.yaml
  • ranger-stage.yaml

registry-access.yaml #

You can get started with a minimal file that only adds your credentials to the Starburst Harbor instance:

  registryCredentials:
    enabled: true
    registry: harbor.starburstdata.net/starburstdata
    username: <yourusername>
    password: <yourpassword>

sep-prod-catalogs.yaml #

A catalog YAML file adds all the configurations for defining the catalogs and their connection details to the underlying data sources. The following snippet contains a few completely configured catalogs that are ready to use:

  • tpch-testdata exposes the TPC-H benchmark data useful for learning SQL and testing.
  • tmpmemory uses the Memory connector to provide a small temporary test ground for users.
  • metrics uses the JMX connector and exposes the internal metrics of SEP for monitoring and troubleshooting.
  • clientdb uses the Starburst PostgreSQL connector to access the clientdb database.
  • datalake and s3 are stubs of catalogs using the Starburst Hive connector with a HMS and a Glue catalog as metastore.
  catalogs:
    tpch-testdata: |
      connector.name=tpch
    tmpmemory: |
      connector.name=memory
    metrics: |
      connector.name=jmx
    clientdb: |
      connector.name=postgresql
      connection-url=jdbc:postgresql://postgresql:5432/clientdb
      connection-password=${ENV:PSQL_PASSWORD}
      connection-user=${ENV:PSQL_USERNAME}
    datalake: |
      connector.name=hive
      hive.metastore.uri=thrift://hive:9083
    s3: |
      connector.name=hive
      hive.metastore=glue

sep-prod-setup.yaml #

This example provides a minimal starting point as a best practice. It achieves the following:

  • environment: provides the name MyProductionCluster for the environment, which becomes visible in the Web UI.
  • sharedSecret: sets a shared random secret string for communications between the coordinator and all workers. NOTE: This is different than the shared secret set for the license file with the kubectl create secret command.
  • replicas: configures the cluster to use four workers.
  • resources: adjusts the memory and CPU requirements for the workers and the coordinator. In this example, it increases the values for use with more powerful servers than the default.
  environment: production
  sharedSecret: AN0Qhhw9PsZmEgEXAMPLEkIj3AJZ5/Mnyy5iRANDOMceM+SSV+APSTiSTRING

  coordinator:
    resources:
      memory: "256Gi"
      requests:
        cpu: 32
      limits:
        cpu: 32

  worker:
    replicas: 4
    resources:
      memory: "256Gi"
      requests:
        cpu: 32
      limits:
        cpu: 32