Configure fault tolerance for your cluster#

You can configure SEP to be more resilient against query failure by enabling fault-tolerant execution. This allows SEP to handle larger queries such as batch operations without worker node interruptions causing the query to fail.

When configured, the SEP cluster buffers data used by workers during query processing. If a worker node fails for any reason, such as a network outage or running out of available resources, the coordinator issues the data to an available worker that can pick up query processing, allowing the query to continue.

Architecture#

The coordinator node uses a configured exchange manager service that buffers data for query processing in an external location such as an S3 bucket. Worker nodes send data to the buffer as they execute their query tasks.

FTE s3 exchange architecture

Best practices and considerations#

A fault-tolerant cluster is best suited for large batch queries. Fault-tolerant execution of a high volume of short-running may experience increased latency. As such, it is recommended to run a dedicated fault-tolerant SEP cluster for handling batch operations, separate from a cluster that is designated for a higher query volume.

Support for fault-tolerant execution of SQL statements varies on a per-connector basis, with more details in the documentation for each connector. See Fault-tolerant execution for a table detailing per-connector support.

Catalogs for other data sources only support fault-tolerant execution of read operations. The Starburst Teradata Direct connector only supports fault-tolerant execution of read operations with a QUERY retry policy.

When fault-tolerant execution is enabled on a cluster, write operations fail for any catalogs that do not support fault-tolerant execution of those operations.

The exchange manager may send a large amount of data to the exchange storage, resulting in high I/O load on that storage. You can configure multiple storage locations for use by the exchange manager to help balance the I/O load between them.

Configuration#

The following steps describe how to configure a SEP cluster for fault-tolerant execution with a S3 bucket-based exchange:

  1. Set up a S3 bucket to use as the exchange storage. For this example we are using an AWS S3 bucket, but other storage options are described in the reference documentation as well. You can use multiple S3 buckets for exchange storage.

    For each bucket in AWS, collect the following information:

    • S3 URI location for the bucket, such as s3://exchange-spooling-bucket

    • Region that the bucket is located in, such as us-west-1

    • AWS access and secret keys for the bucket

  2. Add the following exchange manager configuration in both the coordinator.etcFiles.properties and worker.etcFiles.properties sections of the Helm chart in a new exchange-manager.properties: entry, using the gathered S3 bucket information:

    coordinator:
      etcFiles:
        properties:
        ...
          exchange-manager.properties: |
            exchange-manager.name=filesystem
            exchange.base-directories=s3://exchange-spooling-bucket-1,s3://exchange-spooling-bucket-2
            exchange.s3.region=us-west-1
            exchange.s3.aws-access-key=example-access-key
            exchange.s3.aws-secret-key=example-secret-key
    ...
    worker:
      etcFiles:
        properties:
        ...
          exchange-manager.properties: |
            exchange-manager.name=filesystem
            exchange.base-directories=s3://exchange-spooling-bucket-1,s3://exchange-spooling-bucket-2
            exchange.s3.region=us-west-1
            exchange.s3.aws-access-key=example-access-key
            exchange.s3.aws-secret-key=example-secret-key
    

    In non-Kubernetes installations, the same properties must be defined in the etc/exchange-manager.properties configuration file on the coordinator and all worker nodes.

  3. Add the following configuration for fault-tolerant execution in both the coordinator.additionalProperties: and worker.additionalProperties: sections of the Helm chart:

    coordinator:
      additionalProperties:
        ...
        retry-policy=TASK
        exchange.compression-enabled=true
        query.low-memory-killer.delay=0s
    ...
    worker:
      additionalProperties:
        ...
        retry-policy=TASK
        exchange.compression-enabled=true
        query.low-memory-killer.delay=0s
    

    In non-Kubernetes installations, the same properties must be defined in the config.properties file on the coordinator and all worker nodes.

  4. Re-deploy your instance of SEP or, for non-Kubernetes installations, restart the cluster.

Your SEP cluster is now configured with fault-tolerant query execution. If a query run on the cluster would normally fail due to an interruption of query processing, fault-tolerant execution now resumes the query processing to ensure successful execution of the query.

Next steps#

For more information about fault-tolerant execution, including simple query retries that do not require an exchange manager and advanced configuration operations, see the reference documentation.