Coordinator high availability#

Note

The previous name of the Starburst Control Plane was Starburst Portal.

Starburst Control Plane provides coordinator high availability (HA) for SEP clusters. When a coordinator fails, Starburst Control Plane automatically promotes a healthy standby coordinator to handle the traffic.

Note

Coordinator high availability is a private preview feature. Contact Starburst Support with questions or feedback.

Overview#

In a standard SEP deployment, each cluster has a single coordinator that acts as the entry point for all queries. If this coordinator becomes unavailable, queries fail until you restore the coordinator.

High availability maintains multiple coordinators for each cluster:

  • One coordinator is active and handles queries.

  • The other coordinators are on standby.

  • If the active coordinator fails, Starburst Control Plane promotes a standby coordinator to handle traffic.

Requirements#

Before you enable coordinator high availability, ensure your deployment meets the following requirements:

  • Multiple coordinators: Deploy at least two coordinator instances per environment with the same size and location.

  • Environment property: All nodes in a cluster must share the same node.environment value.

  • Network connectivity: Bidirectional network access between Starburst Control Plane and all coordinators is required for announcements and health monitoring.

  • Consistent configuration: All coordinators in an environment must have identical configurations, including catalogs, connectors, and security settings.

Coordinator roles#

Starburst Control Plane assigns each coordinator one of the following states:

Role

Description

ACTIVE

The coordinator currently processing queries.

STANDBY

A healthy backup coordinator ready for promotion if the active coordinator fails.

DECOMMISSIONED

A coordinator removed from the HA pool. Restart the coordinator to return it to the pool.

GONE

A stale coordinator record whose URI now serves a different coordinator instance. This occurs when a coordinator restarts and receives a new instance identifier. The stale record is excluded from HA management. The new coordinator instance registers itself automatically as STANDBY.

Health status#

Starburst Control Plane assigns one of the following health statuses to each coordinator:

Status

Condition

HEALTHY

The last health check succeeded.

UNKNOWN

Health check data is older than 30 seconds, or the coordinator has fewer than three consecutive failures over 10 seconds.

NOT_HEALTHY

More than three consecutive health check failures over 10 seconds.

When a network issue occurs, health checks fail immediately and the coordinator status changes to UNKNOWN. After three or more consecutive failures over 10 seconds, the status changes to NOT_HEALTHY and Starburst Control Plane triggers a failover.

Configuration#

Use the following sections to configure high availability for Starburst Control Plane and your SEP clusters.

Configure SEP coordinators#

When you enable coordinator high availability, all SEP coordinators and workers send announcements to Starburst Control Plane. Starburst Control Plane uses these announcements to:

  • Register coordinators

  • Track coordinator availability and URI

  • Proxy worker announcements to the active coordinator

Note

The minimum required high availability configuration consists of one active coordinator and one standby coordinator.

To configure coordinators to send announcements to Starburst Control Plane, set the following property in each coordinator’s config.properties file:

discovery.uri=https://portal.example.com

Use an internal hostname or private IP address for discovery.uri to keep announcements and health checks on your local network.

Note

Coordinators and workers must use the same discovery.uri value.

Configure SEP workers#

To configure workers to send announcements to Starburst Control Plane, set the following property in each worker’s config.properties file:

discovery.uri=https://portal.example.com

Use an internal hostname or private IP address for discovery.uri to keep announcements and health checks on your local network.

Note

Coordinators and workers must use the same discovery.uri value.

Starburst Control Plane forwards worker announcements to the active coordinator. During failover, Starburst Control Plane automatically redirects workers to the new active coordinator.

Configure environments#

Each cluster you connect to Starburst Control Plane must have a unique node.environment value. This property identifies which cluster each node belongs to. All nodes in the same cluster (coordinators and workers) must use the same value.

Starburst Control Plane uses the node.environment value to:

  • Group nodes: When a node announces itself, Starburst Control Plane reads the X-Trino-Environment HTTP header (derived from node.environment) to determine which cluster it belongs to.

  • Route queries: Environments map to routing groups in Starburst Control Plane. When you configure a routing group with a specific environment, queries routed to that group are sent to the active coordinator for that environment.

  • Isolate HA state: Each environment maintains its own active coordinator independently. A failover in one environment does not affect other environments.

Operations#

Use the following steps to manage coordinators in a high availability deployment.

Add a standby coordinator#

To add a standby coordinator:

  1. Deploy a SEP coordinator with the same environment configuration.

  2. Configure the coordinator to send announcements to Starburst Control Plane.

  3. The coordinator automatically registers as STANDBY.

  4. Health monitoring starts automatically.

Remove a coordinator#

To remove a coordinator from the HA pool:

  1. Ensure at least one other healthy coordinator is running in the environment.

  2. Stop the SEP coordinator (for example, by scaling down a Kubernetes pod or stopping the service).

The coordinator record remains in the database with UNKNOWN status. To return the coordinator to the HA pool, restart the coordinator. Once health checks succeed, the status changes to HEALTHY and the coordinator resumes in the STANDBY role.

Trigger manual failover#

To manually trigger a failover, stop the active SEP coordinator by scaling down a Kubernetes pod or stopping the service. Starburst Control Plane detects the failure and promotes a healthy standby.

Coordinator lifecycle#

The following sections describe how Starburst Control Plane registers coordinators, monitors coordinator health, and handles failover.

Coordinator registration#

When you start a SEP coordinator, it periodically sends announcements to Starburst Control Plane. Starburst Control Plane processes these announcements as follows:

  1. Retrieves the coordinator’s server info

  2. Registers the coordinator with a STANDBY role

  3. Records the coordinator’s URI and environment

  4. Begins health monitoring for the coordinator

Worker nodes also send announcements to Starburst Control Plane. Starburst Control Plane proxies these announcements to the active coordinator. This ensures worker nodes connect to the active coordinator during failover.

Health monitoring#

Starburst Control Plane monitors all your registered coordinators:

  • Runs health checks every five seconds

  • Verifies coordinators are reachable and ready to handle queries

  • Failed checks increment a failure counter; successful checks reset the failure counter

A health check fails when a coordinator:

  • Is unreachable (connection refused, timeout, network error)

  • Returns an HTTP status code other than 200

  • Reports it is not ready to handle queries

During each health check, Starburst Control Plane also verifies that the instance identifier returned by the coordinator matches the stored identifier. If the identifiers differ, the coordinator at that URI has been replaced by a new instance (for example, after a pod restart). Starburst Control Plane immediately marks the stale record as GONE. The new coordinator instance at the same URI announces itself and registers as a fresh STANDBY.

Automatic failover#

When the monitor detects three or more health check failures over 10 seconds, the coordinator status changes to NOT_HEALTHY and Starburst Control Plane initiates failover:

  1. Starburst Control Plane selects a healthy coordinator with a STANDBY role

  2. The failed coordinator is marked as DECOMMISSIONED

  3. The selected coordinator is promoted from STANDBY to ACTIVE

  4. Query traffic automatically routes to the new active coordinator

Note

If no active coordinator exists (for example, during initial startup), Starburst Control Plane selects a healthy standby and promotes it to ACTIVE.

Multiple active coordinators#

If multiple coordinators become active simultaneously:

  1. The monitor detects multiple active coordinators

  2. Starburst Control Plane selects one healthy coordinator to remain active

  3. All other active coordinators are demoted to DECOMMISSIONED

Troubleshooting#

If you experience issues with coordinator high availability, check the following common issues.

No failover after coordinator failure#

If the active coordinator is down but no standby has been promoted:

  • Verify standby coordinators are running, reachable, and in the STANDBY state.

  • Check Starburst Control Plane logs for health check failure messages.

  • Ensure standby coordinators have successful recent health checks.

  • Confirm health check failures have reached the threshold (3 failures over 10 seconds).

Queries fail during failover#

There is a brief window during failover where no active coordinator exists. Queries you submit during this time may fail. Configure your client application to retry failed queries.

Multiple coordinators marked as DECOMMISSIONED#

If all but one coordinator are in the DECOMMISSIONED state:

  • Review Starburst Control Plane logs for events with multiple active coordinators.

  • Restart coordinators in the DECOMMISSIONED state to return them to the HA pool as standbys.

  • Check for network issues between Starburst Control Plane and the coordinators.

Coordinator is marked as GONE#

A coordinator is marked as GONE when the health check detects that the URI for a registered coordinator now responds with a different instance identifier. This typically happens after a coordinator restarts and is assigned a new instance identifier while reusing the same URI (for example, after a Kubernetes pod restart without a persistent identity).

The stale GONE record is kept in the database for audit purposes but is excluded from HA management. No action is required — the new coordinator instance at the same URI registers itself as a STANDBY automatically once it sends an announcement to Starburst Control Plane.

Health checks show UNKNOWN status#

If a coordinator’s health status is UNKNOWN:

  • Check for network issues between Starburst Control Plane and the coordinators.

  • Confirm the coordinator’s /v1/info/state endpoint is accessible.

  • Check Starburst Control Plane logs for health check failure messages.

Limitations#

  • Coordinator high availability operates at the environment level. Each cluster independently maintains its own active coordinator.

  • Starburst Control Plane does not retry running queries when a coordinator fails. Configure your client application to retry failed queries.

  • Failover takes approximately 10-15 seconds because Starburst Control Plane waits for three consecutive health check failures before promoting a standby. If a coordinator becomes unresponsive without disconnecting, failover takes up to two minutes by default. To reduce this time, configure clusters.http-client.request-timeout.

  • To return a decommissioned coordinator to the HA pool, restart the coordinator.

  • When the SEP Helm chart is deployed with internalTls: true, the SEP init container automatically overrides any discovery.uri value you set in config.properties with https://{node.environment}:{httpsPort} (for example, https://mycluster:8443). Setting discovery.uri in your Helm values has no effect in this configuration. To use coordinator high availability with internalTls: true, you must make Starburst Control Plane accessible at the address that the init container generates - currently, this is not supported by the Helm chart and requires custom setup.