Configuring the Hive Metastore Service in Kubernetes#
The starburst-hive
Helm chart configures the Hive Metastore Service, and
optionally the backend database in the cluster detailed in the following
sections. The following Starburst Enterprise platform (SEP) connectors and features require a HMS:
Note
SEP supports the Hive Metastore Service (HMS) version 3.13. HMS version 4.X is not supported.
Use your registry credentials, and follow best practices by creating an override file for changes to default values as desired.
In addition to the configuration properties described in this document, you can also use the base Hive connector’s metastore configuration properties, the Thrift metastore configuration properties, and the AWS Glue metastore configuration properties as needed, depending on your environment.
Before you begin#
Get the latest starburst-hive
Helm chart as described in our
installation guide with the configured registry access.
Configure the Hive Metastore#
There are several top-level nodes in the HMS Helm chart that you must modify for a minimum HMS configuration:
serviceAccountName
resources
database
expose
hiveMetastoreWarehouseDir
hdfs
objectStorage
As with SEP, we strongly suggest that you initially deploy HMS with the minimum configuration described in this topic and ensure that it deploys and is accessible before making any additional customizations described in our reference documentation.
Note
Store customizations in a separate file containing only changed values as
recommended in our best practices guide. In this topic for example,
customizations are stored in a file named hms-values.yaml
that is used in
the Helm upgrade
command.
Configure resources and service account#
Ensure that the following top-level nodes of the Helm chart have the correct values to reflect your environment:
serviceAccountName
: we strongly recommend using a service account for the pod.resources
: ensure that the CPU and memory sizes are appropriate for your instance type.
The default values for the resources
node are as follows:
heapSizePercentage: 85
resources:
requests:
memory: "1Gi"
cpu: 1
limits:
memory: "1Gi"
cpu: 1
Caution
We strongly suggest leaving the heapSizePercentage
at the default value.
Configure the PostgreSQL backend database#
The configuration properties for the internal PostgreSQL backend database are
found in the database
top-level node. As a minimum configuration, you must
ensure that the following are set correctly for your environment:
database:
internal:
databaseName: hive
databasePassword: HivePass1234
databaseUser: hive
driver: org.postgresql.Driver
port: 5432
type: internal
Note
Alternatively, you can use an external backend database for production usage that you must manage yourself.
The following table lists the available backend database configuration properties:
Node name |
Description |
---|---|
|
Set to |
|
Docker container images used for the PostgreSQL server |
|
Storage volume to persist the database. The default configuration requests a new persistent volume (PV). |
|
The default configuration, which requests a new persistent volume (PV). |
|
Alternative volume configuration, which use existing volume claim by
referencing the name as the value in quotes, e.g., |
|
Alternative volume configuration, which configures an empty directory on the pod, keeping in mind that a pod replacement loses the database content. |
|
|
|
Name of the internal database |
|
User to connect to the internal database |
|
Password to connect to internal database |
|
YAML sequence of mappings to define Secret or Configmap as a source of environment variables for the PostgreSQL container. |
|
YAML sequence of mappings to define two keys environment variables for the PostgreSQL container. |
Note
The database.resources
node is separate from the top-level resources
node. It defines the resources available to the backing database itself, not
the HMS server.
Examples#
OpenShift deployments often do not have access to pull from the default Docker
registry library/postgres
. You can replace it with an image from the Red Hat
registry, which requires additional environment variables set with the parameter
database.internal.env
:
database:
type: internal
internal:
image:
repository: "registry.redhat.io/rhscl/postgresql-96-rhel7"
tag: "latest"
env:
- name: POSTGRESQL_DATABASE
value: "hive"
- name: POSTGRESQL_USER
value: "hive"
- name: POSTGRESQL_PASSWORD
value: "HivePass1234"
Another option is to create a Secret (ex. postgresql-secret
) containing
variables needed by postgresql
which are mentioned in previous code block,
and pass it to the container with envFrom
parameter:
database:
type: internal
internal:
image:
repository: "registry.redhat.io/rhscl/postgresql-96-rhel7"
tag: "latest"
envFrom:
- secretRef:
name: postgresql-secret
External backend database for HMS#
This section shows the setup for using of an external PostgreSQL, MySQL, Oracle,
or Microsoft SQL Server database. You must provide the necessary details for the
external server, and ensure that it can be reached from the k8s cluster pod. Set
the database.type
to external
and configure the connection properties:
database:
type: external
external:
driver:
jdbcUrl:
password:
setPasswordsViaEnvFrom: false
user:
Node name |
Description |
---|---|
|
Set to |
|
JDBC URL to connect to the external database as required by the database and used driver, including hostname and port. Ensure you use a valid JDBC URL as required by the PostgreSQL, MySQL, Oracle, or SQL Server driver. Typically, the syntax requires the host, port, and database name as follows:
|
|
Valid values are as follows:
|
|
Database user name to access the external database using JDBC. |
|
Password for the user configured to access the external database using JDBC. |
Expose the pod to the outside network#
The expose
section of the starburst-hive
Helm chart works similarly to
the SEP server expose section. Differences are
isolated to the configured default values. Additionally, ingress
is not
supported, as the HMS service uses the TCP protocol and not HTTP.
By default, the HMS is available at the hostname hive
and port 9083
. As
a result, the default Thrift URL for the cluster is thrift://hive:9083
.
Ensure to adapt your configured catalogs to use the correct Thrift URL. You can
use the URL for any catalog:
catalog:
datalake: |
connector.name=hive
hive.metastore.uri=thrift://hive:9083
The default type
is clusterIp
:
expose:
type: "clusterIp"
clusterIp:
name: "hive"
ports:
http:
port: 9083
You can also use nodePort
:
expose:
type: "nodePort"
nodePort:
name: "hive"
ports:
http:
port: 9083
nodePort: 30083
Additionally, you can use loadBalancer
:
expose:
type: "loadBalancer"
loadBalancer:
name: "hive"
IP: ""
ports:
http:
port: 9083
annotations: {}
sourceRanges: []
Configure storage#
Configure a connection to the Hive Metastore Service (HMS), the Hadoop Distributed File System (HDFS), or an object store using the following three top-level nodes:
hiveMetastoreWarehouseDir
hdfs
objectStorage
Note
You must also configure the catalog with the appropriate credentials.
The default configuration for each of these properties is empty.
Add the location of your Hive metastore’s warehouse directory to the
hiveMetastoreWarehouseDir
node to enable the HMS to store metadata and
gather statistics. In the hdfs
node, add the hadoopUserName
you use to
connect to the warehouse directory.
There are several templates for configuring object storage in the
objectStorage
node. For example, you can define how to connect to S3:
objectStorage:
awsS3:
accessKey:
endpoint:
pathStyleAccess: false
region:
secretKey:
The Helm chart also includes templates for Azure, Azure Data Lake, and Google object storage.
Note
For AWS, you can provide access to S3 using secrets, or by using IAM credentials attached to the metastore pod.
The following table lists the available storage configuration properties:
Node name |
Description |
---|---|
|
The location of your Hive Metastore’s warehouse directory. For example,
|
|
User name for Hadoop HDFS access |
|
Configuration for AWS S3 access |
|
AWS region name |
|
AWS S3 endpoint, for example
|
|
Name of the access key for AWS S3 |
|
Name of the secret key for AWS S3 |
|
|
|
Configuration for Google Storage access |
|
Name of the secret with the file containing the access key to the cloud
storage. The key of the secret must be named |
|
Configuration for Microsoft Azure storage systems |
|
Configuration for Azure Blob Filesystem (ABFS) |
|
Authentication to access ABFS, Valid values are |
|
Configuration for access key authentication to ABFS |
|
Name of the ABFS account to access |
|
Actual access key to use for ABFS access |
|
Configuration for OAuth authentication to ABFS |
|
Client identifier for OAuth authentication |
|
Secret for OAuth |
|
Endpoint URL for OAuth |
|
Configuration for Windows Azure Storage Blob (WASB) |
|
Name of the storage account to use for WASB |
|
Key to access WASB |
|
Configuration for Azure Data Lake (ADL) |
|
Configuration for OAuth authentication to ADL |
|
Client identifier for OAuth access to ADL |
|
Credential for OAuth access to ADL |
|
Refresh URL for the OAuth access to ADL |
More information about the configuration options is available in the following resources:
Metastore configuration for Avro#
In order to enable Avro tables
when using Hive 3.x, you need to add the following property definition to the
Hive metastore configuration file hive-site.xml
:
<property>
<name>metastore.storage.schema.reader.impl</name>
<value>org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader</value>
</property>
For more information about additional files, see Adding files.
Configure TLS (optional)#
Note
This is separate from configuring TLS on SEP itself.
If your organization uses TLS, you can enable and configure your HMS to work with it. The most straightforward way to handle TLS is to terminate TLS at the load balancer or ingress, using a signed certificate. We strongly suggest this method, which requires no additional configuration in the HMS.
If you choose not to handle TLS using that method, you can instead configure it
in the expose
top-level node of the HMS Helm chart:
expose:
type: "[clusterIp|nodePort|loadBalancer|ingress]"
The default type
is clusterIp
. For details on configuring each of these
types, see exposing the pod to the outside network.
Additional settings#
Server start up configuration#
You can configure a startup shell script for the HMS using the following variables:
initFile
extraArguments
initFile
#
Use initfile
to pass a shell script to run before HMS is launched. The
content of the script must be an inline string. The original startup command is
passed as the first argument at the end of the script as exec "$@"
. Use
exec "$1"
to any additional arguments.
extraArguments
#
Use extraArguments
to pass a list of additional arguments to the
initFile
script.
The following example shows how you can use initFile
and extraArguments
to run a custom startup script. The initFile
script must end with exec "$@"
:
initFile: |
#!/bin/bash
echo "Custom init for $2"
exec "$@"
extraArguments:
- TEST_ARG
Docker image and registry#
The Helm chart for the HMS uses a similar configuration for its Docker image and registry section as the Helm chart for SEP.
image:
pullPolicy: "IfNotPresent"
repository: "harbor.starburstdata.net/starburstdata/hive"
tag: "3.1.3-e.3"
registryCredentials:
enabled: false
password:
registry:
username:
imagePullSecrets:
Additional volumes#
Additional volumes may be necessary for persisting files. These can be defined
in additionalVolumes
. None are defined by default:
additionalVolumes: []
You can add one or more volumes supported by k8s to all nodes in the cluster.
Specify a path
to create a directory to store the keys for your ConfigMap or
Secret.
You may also specify an optional subPath
parameter which takes an optional
key in the ConfigMap or Secret volume you create. If you specify subPath
, a
key named subPath
from your ConfigMap or Secret is mounted as a file within
the directory specified in path
.
additionalVolumes:
- path: /mnt/InContainer
volume:
emptyDir: {}
- path: /etc/hive/conf/test_config.txt
subPath: test_config.txt
volume:
configMap:
name: "configmap-in-volume"
Node assignment#
You can configure your cluster to use a specific node and pod for the HMS:
nodeSelector: {}
tolerations: []
affinity: {}
Our SEP configuration documentation contains examples and resources to help you configure these YAML nodes.
Annotations#
You can annotate your deployment or pods using the following variables:
deploymentAnnotations
podAnnotations
Security context#
You can optionally configure security contexts to specify privileges and access control settings for your HMS pods.
securityContext:
If you do not want to set the serviceContext for the default
service account, you can restrict it by configuring the service account for the HMS pod.
Environment variables#
You can pass environment variables to your HMS container using the same variables as the internal database:
envFrom: []
env: []
Both variables are specified as mapping sequences. For example:
envFrom:
- secretRef:
name: my-secret-with-vars
env:
- name: MY_VARIABLE
value: some-value