Understanding Kubernetes Leader Election: How Control Plane Components Achieve High Availability

 



Kubernetes is a highly distributed system, and ensuring its control plane functions efficiently requires a robust leader election mechanism. This article dives deep into how Kubernetes elects leaders, the role of etcd, and what happens during failures.

What is Leader Election in Kubernetes?

Leader election is the process of selecting a single authoritative instance of a control plane component to manage tasks while others remain in standby mode. This prevents conflicts and ensures high availability.

Kubernetes Control Plane Components and Leadership

Certain control plane components require a leader to coordinate tasks effectively:

1. etcd (Key-Value Store)

  • Purpose: Stores all cluster data, including configurations, secrets, and node states.

  • Leader Election:

    • Runs in a cluster of odd-numbered instances (e.g., 3, 5, 7) to maintain quorum.

    • Uses the Raft consensus algorithm for leader election.

    • If the leader node fails, other etcd nodes vote to elect a new leader.

    • Quorum requirement: More than 50% of etcd nodes must be available, or etcd goes into read-only mode (no new writes, updates, or pod rescheduling).

2. kube-controller-manager (Controllers Manager)

  • Purpose: Runs multiple controllers that manage replicas, nodes, jobs, etc.

  • Leader Election:

    • In an HA setup, multiple kube-controller-managers exist, but only one acts as the leader.

    • The leader is elected via leases stored in etcd.

    • If the leader fails, another standby instance takes over within 15 seconds.

3. kube-scheduler (Pod Scheduling)

  • Purpose: Assigns new pods to worker nodes based on resource availability and constraints.

  • Leader Election:

    • Works similarly to kube-controller-manager.

    • Only one kube-scheduler becomes the leader at a time, while others remain idle.

How Kubernetes Leader Election Works?

  1. Initial Start-Up:

    • When a Kubernetes cluster starts for the first time, etcd forms a quorum, and a leader is elected using the Raft algorithm.

    • Control plane components (controller-manager, scheduler) use leases in etcd to elect their leaders.

  2. Failure Scenario:

    • If a leader node fails, the remaining nodes detect the failure.

    • A new leader is elected from available nodes within 15 seconds.

    • If fewer than 50% of control plane nodes are available, etcd loses quorum and enters read-only mode.

Quorum in Kubernetes

Quorum refers to the minimum number of nodes required for etcd to function properly. The rule is:

Quorum = (Total etcd nodes / 2) + 1

Example:

  • 3 etcd nodes → Quorum = 2 (If 2 nodes fail, etcd becomes read-only)

  • 5 etcd nodes → Quorum = 3 (If 3 nodes fail, etcd becomes read-only)

If quorum is lost:

  • etcd stops accepting writes (no new deployments, updates, or rescheduling of crashed pods).

  • Pods already running continue functioning, but no new workloads are scheduled.

  • Communication between pods in different nodes may fail.

What is Split-Brain in Kubernetes?

Split-brain occurs when network partitions cause multiple nodes to think they are the leader. This leads to:

  • Conflicting writes to etcd.

  • Inconsistent cluster state.

  • Potential downtime if reconciliation fails.

To prevent split-brain, Kubernetes always recommends an odd number of control plane nodes (3, 5, etc.) to ensure clear leader selection.

Do Kubelet and Kube-Proxy Have Leaders?

Unlike control plane components:

  • Kubelet (manages node pods) runs independently on every worker node. No leader is needed.

  • Kube-Proxy (handles networking rules) also runs independently on each node without leader election.

How to Monitor etcd and Quorum Status?

To check the health of etcd:

ETCDCTL_API=3 etcdctl endpoint status --write-out=table

To monitor leader election logs:

kubectl logs -n kube-system etcd-<leader-node-name>

Leader election ensures Kubernetes remains highly available. Understanding etcd, quorum, and election mechanisms helps in troubleshooting and maintaining cluster stability. Keeping an odd number of control plane nodes, monitoring quorum, and preventing split-brain are key aspects of Kubernetes resilience.

Have questions? Let us know in the comments!