Diagram illustrating automatic failover process in an InnoDB Cluster for MySQL.

Automatic Failover in InnoDB Cluster Explained Simply

,

Automatic failover is one of the main reasons engineers choose MySQL InnoDB Cluster. It can detect a primary failure and promote a new one with minimal manual work. This article explains how it works, how to configure it safely, and what to watch out for.

Core components of InnoDB Cluster failover

InnoDB Cluster is built from three main parts:

  • MySQL Server with InnoDB and Group Replication for data replication
  • MySQL Shell and the AdminAPI to create and manage the cluster
  • MySQL Router to route application traffic to the right instance

A minimal production cluster usually has:

  • 3 or more MySQL instances in a Group Replication cluster
  • At least 2 MySQL Router instances (for redundancy)
  • Applications connecting to Router, not directly to MySQL
+-----------+       +-------------------+
|  App(s)  | ---> | MySQL Router (RW) |
+-----------+       +-------------------+
                          |
                 +--------+--------+
                 |  InnoDB Cluster |
                 +--------+--------+
                  |       |       |
               Primary  Replica Replica

How automatic failover works

Automatic failover is mainly driven by Group Replication and the InnoDB Cluster metadata. At a high level:

  1. Group Replication monitors members via a consensus protocol.
  2. If the current primary disappears, the group holds a view change.
  3. A new primary is elected according to the cluster mode and metadata.
  4. MySQL Router refreshes metadata and starts routing writes to the new primary.

Single-primary vs multi-primary modes

InnoDB Cluster supports:

  • Single-primary mode: one writable primary, others are read-only replicas.
  • Multi-primary mode: multiple writable instances (more complex conflict rules).

Most production deployments use single-primary because failover logic and application behaviour are simpler. This article assumes single-primary mode.

Election and quorum basics

Group Replication uses a majority-based quorum. With 3 nodes, if 1 fails and 2 can still talk to each other, they keep running. If the group splits 2–1, the side with 2 members continues; the single node gets blocked to avoid split-brain.

3-node cluster (single-primary)

[Node1] [Node2] [Node3]
   ^       ^       ^
   |       |       |
   +--- Group Replication ---+

If Node1 (primary) fails:
- Node2 and Node3 still have quorum
- Group elects a new primary (e.g. Node2)
- Router switches writes to Node2

Step-by-step: what happens during failover

1. Failure detection

Each member periodically exchanges messages with the others. If the primary stops responding for longer than configured timeouts, Group Replication marks it as unreachable.

Key timeouts involved include:

  • Network timeouts (TCP, OS level)
  • Group Replication timeouts (e.g. view change deadlines)

Detection is not instantaneous. Expect a short window where the primary is down but not yet declared failed.

2. View change and primary election

When a node is declared failed:

  1. Remaining members agree on a new group view.
  2. They verify quorum (majority of original members).
  3. They decide which node becomes the new primary.

The choice of new primary is influenced by:

  • Cluster metadata (preferred primary, member state)
  • Replication position (who is most up to date)

If no node is sufficiently up to date or quorum is lost, automatic failover might not happen. This is by design to protect data consistency.

3. Cluster metadata and role update

The InnoDB Cluster metadata schema (on the metadata server) is updated to reflect the new primary. This metadata is what MySQL Router and AdminAPI use to understand cluster topology.

From the DBA perspective, you see the new primary when you run:

mysqlsh -- dba.getCluster().status()

or similar AdminAPI calls. You should confirm that the new primary is ONLINE and that other members are ONLINE or RECOVERING.

4. Router refresh and application impact

MySQL Router periodically refreshes metadata from the cluster. When it sees the primary role has moved:

  • Existing write connections to the old primary will eventually fail.
  • New write connections are directed to the new primary.
  • Read-only connections may keep working if replicas are still available.

The application impact depends on:

  • How the client handles connection errors and retries
  • Transaction length (long transactions are more likely to be interrupted)
  • Router refresh interval

Automatic failover is not magic zero-downtime; it is controlled, short downtime with no manual intervention.

Design assumption

Configuring a basic InnoDB Cluster with automatic failover

The details vary by version, but the high-level steps on RHEL/Rocky Linux are similar.

1. Prepare three MySQL instances

On each node:

  1. Install MySQL Server and MySQL Shell from the vendor repository.
  2. Configure basic settings in my.cnf (data directory, ports, innodb settings).
  3. Ensure network connectivity between all nodes (firewall, SELinux policy as needed).

2. Configure Group Replication prerequisites

For each instance, ensure settings such as:

[mysqld]
server_id=1                      # unique per node
log_bin=binlog
binlog_format=ROW
transaction_write_set_extraction=XXHASH64
loose-group_replication_group_name="aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee"
loose-group_replication_start_on_boot=OFF
loose-group_replication_bootstrap_group=OFF

Adjust server_id and ports per node. Do not enable group_replication_bootstrap_group permanently.

3. Create the cluster with MySQL Shell

From one node using MySQL Shell in JavaScript or Python mode:

mysqlsh root@primary-host:3306

// Create an InnoDB Cluster in single-primary mode
var cluster = dba.createCluster('myCluster');
cluster.addInstance('root@replica1-host:3306');
cluster.addInstance('root@replica2-host:3306');

This sets up Group Replication, configures metadata, and establishes a single-primary cluster. Automatic failover is now available as long as quorum is maintained.

4. Deploy MySQL Router

Install MySQL Router on at least two separate hosts. Then bootstrap it against the cluster:

mysqlrouter --bootstrap root@primary-host:3306 \
  --directory /var/lib/mysqlrouter \
  --user mysqlrouter

This generates configuration with ports for read-write and read-only endpoints, plus metadata refresh behaviour.

Best practices for safe automatic failover

1. Always use an odd number of voting members

Use 3, 5, or 7 nodes to simplify quorum decisions and reduce split-brain risk. For many workloads, 3 nodes is sufficient.

2. Keep nodes close together for low latency

Group Replication is synchronous (or virtually synchronous) and sensitive to latency. Place nodes in the same data centre or low-latency network. Multi-region clusters are possible but require careful design and may increase failover time.

3. Separate Routers from database nodes

Run MySQL Router on separate hosts or application nodes so a DB node failure does not also remove the Router. Use at least two Routers behind a load balancer.

4. Tune timeouts carefully

If timeouts are too low, transient network blips can trigger unnecessary failover. If too high, failover becomes slow. Start with defaults, then:

  • Measure typical network latency and jitter between nodes.
  • Observe behaviour during maintenance and load spikes.
  • Adjust only when you understand the side effects.

5. Design applications for retry

Applications should:

  • Handle connection drops and retry with backoff.
  • Avoid long transactions when possible.
  • Be idempotent for critical operations, or use unique keys to prevent duplicates.

This makes automatic failover much smoother from the user perspective.

6. Test failover regularly

In a non-production environment with similar topology, simulate failures by:

  • Stopping the primary mysqld process cleanly.
  • Simulating network partitions between nodes.
  • Rebooting a node to verify rejoin behaviour.

Observe:

  • How long it takes for a new primary to appear.
  • How applications behave during the event.
  • Whether any data is lost or needs recovery.

Operational considerations and limitations

When automatic failover will not happen

InnoDB Cluster may refuse to fail over automatically when:

  • Quorum is lost (for example, 2 of 3 nodes are down).
  • Remaining nodes are too far behind the old primary.
  • Metadata is inconsistent or cluster configuration is broken.

In these cases, manual intervention is required. This is intentional to protect data integrity.

Planned maintenance and switchover

For patching or hardware maintenance, use an orchestrated switchover instead of killing the primary. With MySQL Shell you can:

var cluster = dba.getCluster('myCluster');
cluster.setPrimaryInstance('root@replica1-host:3306');

This promotes a new primary in a controlled way, then you can safely stop the old primary.

Monitoring and alerting

Automatic failover does not remove the need for monitoring. At minimum, monitor:

  • Cluster status from AdminAPI (primary node, member states)
  • Replication lag and errors
  • Router health and connectivity
  • Application error rates during failover windows

This article offers general technical guidance. Validate all configurations in a safe environment before applying them to production.

Automatic failover in InnoDB Cluster is powerful but predictable once you understand the moving parts. Build with three or more nodes, route traffic through MySQL Router, tune timeouts cautiously, and test failover regularly. With these practices, you can achieve robust, self-healing MySQL deployments without adding unnecessary complexity.

Smart reads for curious minds

We don’t spam! Read more in our privacy policy

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *