Detecting and Resolving Split-Brain in MySQL/InnoDB HA Setups

Split-brain in MySQL high availability setups is one of the most dangerous failure modes: you end up with two (or more) writable primaries that diverge, and reconciling them safely is painful. This article focuses on how to detect, resolve, and prevent split-brain in MySQL/InnoDB environments.

What split-brain means in MySQL terms

In MySQL, split-brain usually appears in replicated or clustered setups:

Classic async primary/replica with an HA manager (e.g. Pacemaker, keepalived)
Group Replication / InnoDB Cluster with misconfigured quorum
Multi-DC replication topologies (active-active or circular replication)

The core problem is simple: more than one node accepts writes that are not synchronised with each other. Later, when connectivity is restored, their data and binary logs conflict.

Normal primary/replica:

  Clients --> [PRIMARY] --binlog--> [REPLICA]

Split-brain scenario:

  Clients --> [PRIMARY A]   X   [PRIMARY B] <-- Clients
                  ^   network split   ^
                  \-------------------/

Common causes of split-brain

Most split-brain incidents come from infrastructure or orchestration weaknesses rather than MySQL itself.

Network partitions between data centres or racks
Failover tools promoting a replica without fencing the old primary
Shared VIP or load balancer misconfiguration pointing to two primaries
Manual promotion of a replica while the old primary is still reachable somewhere
Cluster quorum misconfiguration (e.g. even number of nodes, no arbitrator)

Early warning: how to detect split-brain quickly

Detecting split-brain early limits damage. You want monitoring that can answer two questions:

How many nodes can accept writes right now?
Are there signs of conflicting replication topology?

Check MySQL replication status

On each node, start with replication status. For asynchronous replication:

mysql> SHOW SLAVE STATUS\G
mysql> SHOW MASTER STATUS\G

Red flags include:

More than one node reporting non-empty File and Position in SHOW MASTER STATUS and actively receiving writes.
Replication stopped with Duplicate entry or Update/delete rows not found errors.
Replica pointing to a different master than expected (check Master_Host in SHOW SLAVE STATUS).

Check application and VIP routing

From an application host, verify where writes are going:

$ mysql -h db-vip -e "SHOW VARIABLES LIKE 'server_id';"

If the same VIP or load balancer sometimes returns different server_id values while both allow writes, you likely have a split-brain or at least a split-primary configuration.

Check cluster or HA manager status

For HA stacks on RHEL/Rocky Linux (Pacemaker/Corosync, keepalived, etc.), check that only one node owns the shared resource (VIP, floating IP, shared storage). For example, with Pacemaker:

$ pcs status

Look for:

VIP running on more than one node
Resource state flapping between nodes

Never trust MySQL alone to tell you about split-brain. Always correlate MySQL state with network routing and HA manager status.

Operational best practice

Confirming a split-brain incident

Once you suspect split-brain, you need to confirm whether data has diverged.

Step 1: Freeze the topology

First priority is to stop making things worse. Choose one node as the candidate primary and reduce writes elsewhere.

Disable application writes to all but one node (change connection strings, disable VIPs, or set read_only=ON).
In an emergency, you may temporarily stop MySQL on suspect nodes:

# WARNING: Stopping MySQL is disruptive. Do this only with change control.
$ sudo systemctl stop mysqld

Make sure only one node is writable:

mysql> SHOW VARIABLES LIKE 'read_only';
mysql> SHOW VARIABLES LIKE 'super_read_only';

On replicas or suspect nodes, set:

mysql> SET GLOBAL super_read_only = ON;

Step 2: Compare binary log positions

On each node, collect:

mysql> SHOW MASTER STATUS\G
mysql> SHOW SLAVE STATUS\G

Record:

Binary log file and position
Executed_Gtid_Set (if using GTID)
Master_UUID and Retrieved_Gtid_Set (if GTID)

If using GTID, compare the sets. Divergence looks like this:

Node A Executed_Gtid_Set: 1111-1111-1111:1-100
Node B Executed_Gtid_Set: 2222-2222-2222:1-80

Here, both nodes have executed different GTID ranges from different sources, indicating conflicting histories.

Step 3: Spot data-level differences

For critical tables, run quick checksums to estimate divergence. You can use pt-table-checksum (Percona Toolkit) or simple checksums on representative tables:

mysql> SELECT COUNT(*), SUM(CRC32(CONCAT_WS('#', col1, col2)))
       FROM important_table;

If counts or checksums differ between nodes, data has diverged.

Strategies to resolve split-brain

Resolution means choosing an authoritative data source and carefully resynchronising others. There is no generic automatic fix; you must decide what data to keep or discard.

Strategy 1: Pick a winner, rebuild others

This is the safest and most common approach.

Choose the node with the most complete and trusted data as the new primary.
Back it up (physical backup with xtrabackup or logical with mysqldump).
Wipe and rebuild other nodes from this primary.

Example (on a replica to be rebuilt):

# WARNING: This destroys all data on the replica. Ensure you have backups.
$ sudo systemctl stop mysqld
$ rm -rf /var/lib/mysql/*
$ # Restore from backup or fresh clone
$ sudo systemctl start mysqld

Then configure replication from the chosen primary:

mysql> CHANGE MASTER TO
    MASTER_HOST='primary-host',
    MASTER_USER='repl',
    MASTER_PASSWORD='secret',
    MASTER_AUTO_POSITION=1;

mysql> START SLAVE;

Strategy 2: Manually merge data

Sometimes you cannot simply discard writes from the losing node. In this case:

Export conflicting tables from the losing node.
Compare with the winning node using checksums or diff tools.
Prepare merge scripts (INSERT IGNORE, REPLACE, or application-level reconciliation).

Example export on losing node:

$ mysqldump --single-transaction db important_table > losing-important.sql

Then on the winning node, load into a staging schema, compare, and merge carefully. This is time-consuming but sometimes necessary.

Strategy 3: GTID-based reconciliation (advanced)

If GTID is enabled and divergence is mild, you can sometimes reattach a node by skipping conflicting transactions. This is dangerous and must be used sparingly.

On the replica:

mysql> STOP SLAVE;
mysql> SET GLOBAL gtid_purged = '...';  -- only if reinitialising
mysql> START SLAVE;

Or skip specific transactions with SET GTID_NEXT and empty transactions. Always document exactly what you skip and why.

Preventing split-brain in MySQL HA designs

Prevention is far cheaper than recovery. Focus on fencing, quorum, and clear write ownership.

Design for single-writer semantics

Prefer designs where only one node is ever writable:

One primary, multiple replicas
Group Replication single-primary mode
InnoDB Cluster with a single writable primary

Even in active-active multi-DC, restrict each application to a local primary and avoid circular writes where possible.

Use strong fencing and STONITH

In cluster managers (Pacemaker/Corosync), configure STONITH (Shoot The Other Node In The Head) to ensure a failed primary is powered off or fenced before a new primary is promoted.

On RHEL/Rocky Linux, this typically involves:

A fencing agent that can power-cycle nodes (IPMI, iDRAC, etc.).
Cluster configuration that refuses promotion without successful fencing.

Configure proper quorum

Avoid even-numbered clusters without a tiebreaker. For 2-node clusters, add:

A third lightweight arbitrator (e.g. Galera arbitrator in Galera-based systems).
A quorum device (e.g. QDevice in Corosync).

Ensure that a minority partition cannot continue serving writes.

Protect the VIP or load balancer

The VIP or load balancer must never point at two writable nodes.

Use health checks that verify read_only = OFF and cluster state, not just TCP port 3306.
Ensure failback is controlled; avoid rapid bouncing between nodes.
Keep configuration under version control and review changes.

Monitor for anomalies

Add alerts for:

More than one node with super_read_only = OFF
Multiple nodes reporting themselves as primary in the HA manager
Replication errors (duplicate key, row not found)
VIP or load balancer serving different backends unexpectedly

Operational runbook example

Having a simple, written runbook reduces panic during incidents. An outline:

Detect: Alert triggers on multiple writable nodes.
Stabilise: Freeze writes to all but one node, or stop MySQL on secondary suspects.
Assess: Collect SHOW MASTER STATUS, SHOW SLAVE STATUS, GTID sets, and basic checksums.
Decide: Choose a winner node based on data completeness and business rules.
Rebuild: Recreate other nodes from the winner or manually merge where required.
Review: Fix root causes in HA configuration, quorum, or networking.

Conclusion

Split-brain in MySQL/InnoDB is primarily an HA design and operations problem, not a database feature issue. Detect it early by correlating MySQL state with network and HA manager status, resolve it by choosing a clear data authority and rebuilding others, and prevent recurrence with strong fencing, proper quorum, and strict single-writer semantics. Investing in these controls saves far more time and risk than any one-off recovery exercise.

This article offers general technical guidance. Validate all configurations in a safe environment before applying them to production.