Skip to main content
Redhat Developers  Logo
  • Products

    Featured

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat OpenShift AI
      Red Hat OpenShift AI
    • Red Hat Enterprise Linux AI
      Linux icon inside of a brain
    • Image mode for Red Hat Enterprise Linux
      RHEL image mode
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • Red Hat Developer Hub
      Developer Hub
    • View All Red Hat Products
    • Linux

      • Red Hat Enterprise Linux
      • Image mode for Red Hat Enterprise Linux
      • Red Hat Universal Base Images (UBI)
    • Java runtimes & frameworks

      • JBoss Enterprise Application Platform
      • Red Hat build of OpenJDK
    • Kubernetes

      • Red Hat OpenShift
      • Microsoft Azure Red Hat OpenShift
      • Red Hat OpenShift Virtualization
      • Red Hat OpenShift Lightspeed
    • Integration & App Connectivity

      • Red Hat Build of Apache Camel
      • Red Hat Service Interconnect
      • Red Hat Connectivity Link
    • AI/ML

      • Red Hat OpenShift AI
      • Red Hat Enterprise Linux AI
    • Automation

      • Red Hat Ansible Automation Platform
      • Red Hat Ansible Lightspeed
    • Developer tools

      • Red Hat Trusted Software Supply Chain
      • Podman Desktop
      • Red Hat OpenShift Dev Spaces
    • Developer Sandbox

      Developer Sandbox
      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Openshift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • View All Technologies
    • Programming Languages & Frameworks

      • Java
      • Python
      • JavaScript
    • System Design & Architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer Productivity

      • Developer productivity
      • Developer Tools
      • GitOps
    • Secure Development & Architectures

      • Security
      • Secure coding
    • Platform Engineering

      • DevOps
      • DevSecOps
      • Ansible automation for applications and services
    • Automated Data Processing

      • AI/ML
      • Data Science
      • Apache Kafka on Kubernetes
      • View All Technologies
    • Start exploring in the Developer Sandbox for free

      sandbox graphic
      Try Red Hat's products and technologies without setup or configuration.
    • Try at no cost
  • Learn

    Featured

    • Kubernetes & Cloud Native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • Java
      Java icon
    • AI/ML
      AI/ML Icon
    • View All Learning Resources

    E-Books

    • GitOps Cookbook
    • Podman in Action
    • Kubernetes Operators
    • The Path to GitOps
    • View All E-books

    Cheat Sheets

    • Linux Commands
    • Bash Commands
    • Git
    • systemd Commands
    • View All Cheat Sheets

    Documentation

    • API Catalog
    • Product Documentation
    • Legacy Documentation
    • Red Hat Learning

      Learning image
      Boost your technical skills to expert-level with the help of interactive lessons offered by various Red Hat Learning programs.
    • Explore Red Hat Learning
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

How RHOCP 4.17 enhances control plane resilience

Deploying more control nodes in Red Hat OpenShift Container Platform 4.17

February 12, 2025
Mark Schmitt
Related topics:
ContainersObservabilitySystem DesignVirtualization
Related products:
Red Hat OpenShift Container PlatformRed Hat OpenShift Virtualization

Share:

    Now Red Hat OpenShift Container Platform 4.17 has a new non-standard control plane high availability option that is enabled by new etcd capabilities and API server optimizations. The standard (prior to 4.17) control plane high availability requires two of three control plane nodes to maintain etcd quorum, which is still an option in 4.17. However, what’s new is the option to deploy four or five control plane nodes to enhance resiliency. This option is only for bare metal environments.

    This article explains the new feature in OpenShift Container Platform 4.17 that allows a cluster to utilize four or five control nodes to enhance control plane resilience.

    Enhanced control plane resilience in OpenShift Container Platform 4.17

    Customers with hard requirements for active-active deployments across two locations requiring support for stateful traditional applications (e.g., Red Hat OpenShift Virtualization virtual machines (VMs) that can only run a single instance) have dependencies on the underlying infrastructure to provide the availability. These use cases are common when deploying VMs on traditional virtualization stacks. A traditional OpenShift cluster (<4.16) is deployed as a stretched or spanned cluster with a control-plane distribution of 2+1 or 1+1+1 to support those scenarios. During failure scenarios in the data center hosting the majority of control plane nodes, the surviving control plane node becomes the only node with the latest configuration and state of all the objects/resources on the cluster. 

    The recovery procedure in a disaster scenario for this configuration requires the single surviving node to become read-write and have the only copy of the etcd. Should that node fail, it will be a catastrophic failure. This is more critical when OpenShift Virtualization is also hosting stateful VMs. To increase resiliency and reduce risk for this scenario during this type of failure, RHOCP 4.17 extends the number of control plane nodes to support 2+2 and 3+2 deployments. In this scenario, a failure of a site with the majority of the nodes will still have two copies of etcd in read-only in the surviving location, providing higher assurance for the recoverability of the cluster. Recovering a control plane node is currently a manual process, but there are plans to automate this procedure in a future release. 

    You can do this by deploying a cluster with three control plane nodes and scale up on day two with the required configuration. There are future plans for the agent based installer to enable this configuration on day one.

    FAQs

    You may have questions about stretched clusters in RHOCP 4.17, such as:

    • Question: Were stretched clusters supported prior to RHOCP 4.17?  

      Answer: Yes but not recommended, and don't tell Redbeard as he illustrates the complications of a stretched three node cluster here. No support exception is required. For smaller discrete clusters, stretching the control plane is neither recommended nor supported.

    • Question: What about latency between the stretched control plane data centers? While there is a recommendation of less than 5ms between data center control plane nodes, there's no Red Hat OpenShift (or Kubernetes) limitation that dictates this latency number. Less than 5ms is a lofty goal, so what happens when/if the latency creeps up beyond that?  

      Answer: The most common issues with etcd are caused by slow storage, CPU overload, etcd database size growth, and latency between control nodes. Applying an etcd request should normally take fewer than 50 milliseconds. If the average apply duration exceeds 100 milliseconds, etcd will warn that entries are taking too long to apply (took-too-long messages appear in the logs). The 100ms number is the default profile parameter ETCD_HEARTBEAT_INTERVAL, which can be modified to the slow profile that is 500ms (primarily for AWS & Azure deployments). 

      Modifying the heartbeat profile only prevents things like leader (re)elections from happening—it does not help performance. This means that if API performance is a limiting factor in the cluster, such as a cluster with a busy GitOps deployment, then the latency profile will not help. A final caveat: third-party workloads and/or specialized layered products that may be installed on control plane nodes may impose latency limitations (i.e., Red Hat OpenShift Data Foundation (ODF) or other SDS CSI).

    What do I need to monitor and how?

    Again, the main causes of etcd issues are: slow storage/disk latency, CPU overload, etcd database size growth, and network latency between nodes. The following steps show how to monitor these issues.

    1. From the console, navigate to Administrator -> Observe -> Dashboards.

    2. Then select etcd from dashboard dropdown. This brings up a number of etcd specific graphs, as shown in Figures 1-3.

    etcd dashboard 1
    Figure 1: etcd performance dashboard.
    etcd dashboard 2
    Figure 2: etcd performance dashboard (continued).
    etcd dashboard 3
    Figure 3: etcd performance dashboard (continued).
    1. To rule out a slow disk, make sure it is less than 25ms by inspecting the Disk Sync Duration graph (Figure 4).

    Disk Sync Duration dashboard
    Figure 4: The Disk Sync Duration graph.
    1. You can also use Fio tool/suite, but it only provides a single point in time. Data fsync distributions should be less than 10ms, as shown in Figure 5.
    Data fsync distribution dashboard
    Figure 5: Data fsync distribution.
    1. Monitor CPU overload on the CPU IOwait graph. This is the amount of time a CPU spends waiting for input/output (I/O) operations to complete, such as disk or network access. High I/O wait times can indicate that the CPU is idle, but there are outstanding I/O requests, which can limit the CPU's performance. This value should be less than 4.0 (Figure 6).

    CPU overload and CPU IOwait dashboards
    Figure 6: Monitor CPU overload on the CPU IOwait graph.
    1. View the etcd database size growth on the DB size graph by clicking the inspect link in Figure 7.

    etcd database size dashboard
    Figure 7: View the etcd database size growth on the DB size graph.
    1. You can view network latency in the peer round trip time by hitting the inspect link in Figure 8. The latency between nodes should be less than 50ms.

    Network latency round trip time dashboard
    Figure 8: The view of network latency.

    Measuring jitter between nodes

    A final factor that may come into play is network jitter between nodes. Jitter is latency variation typically caused by path congestion, resource contention, or hardware performance. Network latency plus jitter is the number that should be less than 50ms.

    You can measure network jitter among all control plane nodes using the iPerf3 tool in UDP mode. The following KCS articles document a way to build and run custom iperf container images:

    • KCS 5233541—Testing Network Bandwidth in Red Hat OpenShift using iPerf Container.
    • KCS 6129701—How to run iPerf network performance test in Red Hat OpenShift Container Platform 4.

    Follow these steps to measure jitter between two nodes, using the container image from KCS 6129701:

    1. Connect to one of the control plane nodes and run the iPerf container as iPerf server in host network mode. When running in server mode, the tool accepts transmission control protocol (TCP) and user datagram protocol (UDP) tests:

    podman run -ti --rm --net host quay.io/kinvolk/iperf3 iperf3 -s
    1. Then connect to another control plane node and run the iPerf in UDP client mode:

    podman run -ti --rm --net host quay.io/kinvolk/iperf3 iperf3 -u -c <node_iperf_server> -t 300
    1. The default test will run for 10 seconds, and at the end, the client output will show the average Jitter (from the client perspective). It is recommended to run the test for 5 minutes/300 seconds (-t 300):

    # oc debug node/m1
    Starting pod/m1-debug ...
    To use host binaries, run `chroot /host`
    Pod IP: 198.18.111.13
    If you don't see a command prompt, try pressing enter.
    sh-4.4# chroot /host
    sh-4.4# podman run -ti --rm --net host quay.io/kinvolk/iperf3 iperf3 -u -c m0
    Connecting to host m0, port 5201
    [  5] local 198.18.111.13 port 60878 connected to 198.18.111.12 port 5201
    [ ID] Interval           Transfer     Bitrate         Total Datagrams
    [  5]   0.00-1.00   sec   129 KBytes  1.05 Mbits/sec  91
    [  5]   1.00-2.00   sec   127 KBytes  1.04 Mbits/sec  90
    [  5]   2.00-3.00   sec   129 KBytes  1.05 Mbits/sec  91
    [  5]   3.00-4.00   sec   129 KBytes  1.05 Mbits/sec  91
    [  5]   4.00-5.00   sec   127 KBytes  1.04 Mbits/sec  90
    [  5]   5.00-6.00   sec   129 KBytes  1.05 Mbits/sec  91
    [  5]   6.00-7.00   sec   127 KBytes  1.04 Mbits/sec  90
    [  5]   7.00-8.00   sec   129 KBytes  1.05 Mbits/sec  91
    [  5]   8.00-9.00   sec   127 KBytes  1.04 Mbits/sec  90
    [  5]   9.00-10.00  sec   129 KBytes  1.05 Mbits/sec  91
    - - - - - - - - - - - - - - - - - - - - - - - - -
    [ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total Datagrams
    [  5]   0.00-10.00  sec  1.25 MBytes  1.05 Mbits/sec  0.000 ms  0/906 (0%)  sender
    [  5]   0.00-10.04  sec  1.25 MBytes  1.05 Mbits/sec  1.074 ms  0/906 (0%)  receiver
    iperf Done.
    sh-4.4#

    Recap

    In this article, you learned about the new feature in Red Hat OpenShift Container Platform 4.17, an option that allows a cluster to utilize four or five control nodes to enhance control plane resilience. You also learned how to monitor the causes of etcd issues: slow storage/disk latency, CPU overload, etcd database size growth, and network latency between nodes. Network jitter between nodes is another issue explored and demonstrated how to measure network jitter among all control plane nodes using the iPerf3 tool in UDP mode. Learn more about the new OpenShift Container Platform 4.17 non-standard control plane high-availability option enabled by new etcd capabilities and API server optimizations.

    OSZAR »

    Related Posts

    • Run OpenShift sandboxed containers with hosted control planes

    • Ensure a scalable and performant environment for ROSA with hosted control planes

    • Easily upgrade hosted OpenShift Virtualization clusters on hosted control planes

    • Hosted control plane operations

    Recent Posts

    • LLM Compressor: Optimize LLMs for low-latency deployments

    • How to set up NVIDIA NIM on Red Hat OpenShift AI

    • Leveraging Ansible Event-Driven Automation for Automatic CPU Scaling in OpenShift Virtualization

    • Python packaging for RHEL 9 & 10 using pyproject RPM macros

    • Kafka Monthly Digest: April 2025

    What’s up next?

    Learn how to create and manage virtual machines using Red Hat OpenShift and the Developer Sandbox in this hands-on activity.

    Start the activity
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Products

    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform

    Build

    • Developer Sandbox
    • Developer Tools
    • Interactive Tutorials
    • API Catalog

    Quicklinks

    • Learning Resources
    • E-books
    • Cheat Sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site Status Dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue

    OSZAR »