Split-brain in MySQL high availability setups is one of the most dangerous failure modes: you end up with two (or more) writable primaries that diverge, and reconciling them safely is painful. This article focuses on how to detect, resolve, and prevent split-brain in MySQL/InnoDB environments.
What split-brain means in MySQL terms
In MySQL, split-brain usually appears in replicated or clustered setups:
- Classic async primary/replica with an HA manager (e.g. Pacemaker, keepalived)
- Group Replication / InnoDB Cluster with misconfigured quorum
- Multi-DC replication topologies (active-active or circular replication)
The core problem is simple: more than one node accepts writes that are not synchronised with each other. Later, when connectivity is restored, their data and binary logs conflict.
Normal primary/replica:
Clients --> [PRIMARY] --binlog--> [REPLICA]
Split-brain scenario:
Clients --> [PRIMARY A] X [PRIMARY B] <-- Clients
^ network split ^
\-------------------/
Common causes of split-brain
Most split-brain incidents come from infrastructure or orchestration weaknesses rather than MySQL itself.
- Network partitions between data centres or racks
- Failover tools promoting a replica without fencing the old primary
- Shared VIP or load balancer misconfiguration pointing to two primaries
- Manual promotion of a replica while the old primary is still reachable somewhere
- Cluster quorum misconfiguration (e.g. even number of nodes, no arbitrator)
Early warning: how to detect split-brain quickly
Detecting split-brain early limits damage. You want monitoring that can answer two questions:
- How many nodes can accept writes right now?
- Are there signs of conflicting replication topology?
Check MySQL replication status
On each node, start with replication status. For asynchronous replication:
mysql> SHOW SLAVE STATUS\G
mysql> SHOW MASTER STATUS\G
Red flags include:
- More than one node reporting non-empty File and Position in SHOW MASTER STATUS and actively receiving writes.
- Replication stopped with Duplicate entry or Update/delete rows not found errors.
- Replica pointing to a different master than expected (check Master_Host in SHOW SLAVE STATUS).
Check application and VIP routing
From an application host, verify where writes are going:
$ mysql -h db-vip -e "SHOW VARIABLES LIKE 'server_id';"
If the same VIP or load balancer sometimes returns different server_id values while both allow writes, you likely have a split-brain or at least a split-primary configuration.
Check cluster or HA manager status
For HA stacks on RHEL/Rocky Linux (Pacemaker/Corosync, keepalived, etc.), check that only one node owns the shared resource (VIP, floating IP, shared storage). For example, with Pacemaker:
$ pcs status
Look for:
- VIP running on more than one node
- Resource state flapping between nodes
Never trust MySQL alone to tell you about split-brain. Always correlate MySQL state with network routing and HA manager status.
Operational best practice
Confirming a split-brain incident
Once you suspect split-brain, you need to confirm whether data has diverged.
Step 1: Freeze the topology
First priority is to stop making things worse. Choose one node as the candidate primary and reduce writes elsewhere.
- Disable application writes to all but one node (change connection strings, disable VIPs, or set read_only=ON).
- In an emergency, you may temporarily stop MySQL on suspect nodes:
# WARNING: Stopping MySQL is disruptive. Do this only with change control.
$ sudo systemctl stop mysqld
Make sure only one node is writable:
mysql> SHOW VARIABLES LIKE 'read_only';
mysql> SHOW VARIABLES LIKE 'super_read_only';
On replicas or suspect nodes, set:
mysql> SET GLOBAL super_read_only = ON;
Step 2: Compare binary log positions
On each node, collect:
mysql> SHOW MASTER STATUS\G
mysql> SHOW SLAVE STATUS\G
Record:
- Binary log file and position
- Executed_Gtid_Set (if using GTID)
- Master_UUID and Retrieved_Gtid_Set (if GTID)
If using GTID, compare the sets. Divergence looks like this:
Node A Executed_Gtid_Set: 1111-1111-1111:1-100
Node B Executed_Gtid_Set: 2222-2222-2222:1-80
Here, both nodes have executed different GTID ranges from different sources, indicating conflicting histories.
Step 3: Spot data-level differences
For critical tables, run quick checksums to estimate divergence. You can use pt-table-checksum (Percona Toolkit) or simple checksums on representative tables:
mysql> SELECT COUNT(*), SUM(CRC32(CONCAT_WS('#', col1, col2)))
FROM important_table;
If counts or checksums differ between nodes, data has diverged.
Strategies to resolve split-brain
Resolution means choosing an authoritative data source and carefully resynchronising others. There is no generic automatic fix; you must decide what data to keep or discard.
Strategy 1: Pick a winner, rebuild others
This is the safest and most common approach.
- Choose the node with the most complete and trusted data as the new primary.
- Back it up (physical backup with xtrabackup or logical with mysqldump).
- Wipe and rebuild other nodes from this primary.
Example (on a replica to be rebuilt):
# WARNING: This destroys all data on the replica. Ensure you have backups.
$ sudo systemctl stop mysqld
$ rm -rf /var/lib/mysql/*
$ # Restore from backup or fresh clone
$ sudo systemctl start mysqld
Then configure replication from the chosen primary:
mysql> CHANGE MASTER TO
MASTER_HOST='primary-host',
MASTER_USER='repl',
MASTER_PASSWORD='secret',
MASTER_AUTO_POSITION=1;
mysql> START SLAVE;
Strategy 2: Manually merge data
Sometimes you cannot simply discard writes from the losing node. In this case:
- Export conflicting tables from the losing node.
- Compare with the winning node using checksums or diff tools.
- Prepare merge scripts (INSERT IGNORE, REPLACE, or application-level reconciliation).
Example export on losing node:
$ mysqldump --single-transaction db important_table > losing-important.sql
Then on the winning node, load into a staging schema, compare, and merge carefully. This is time-consuming but sometimes necessary.
Strategy 3: GTID-based reconciliation (advanced)
If GTID is enabled and divergence is mild, you can sometimes reattach a node by skipping conflicting transactions. This is dangerous and must be used sparingly.
On the replica:
mysql> STOP SLAVE;
mysql> SET GLOBAL gtid_purged = '...'; -- only if reinitialising
mysql> START SLAVE;
Or skip specific transactions with SET GTID_NEXT and empty transactions. Always document exactly what you skip and why.
Preventing split-brain in MySQL HA designs
Prevention is far cheaper than recovery. Focus on fencing, quorum, and clear write ownership.
Design for single-writer semantics
Prefer designs where only one node is ever writable:
- One primary, multiple replicas
- Group Replication single-primary mode
- InnoDB Cluster with a single writable primary
Even in active-active multi-DC, restrict each application to a local primary and avoid circular writes where possible.
Use strong fencing and STONITH
In cluster managers (Pacemaker/Corosync), configure STONITH (Shoot The Other Node In The Head) to ensure a failed primary is powered off or fenced before a new primary is promoted.
On RHEL/Rocky Linux, this typically involves:
- A fencing agent that can power-cycle nodes (IPMI, iDRAC, etc.).
- Cluster configuration that refuses promotion without successful fencing.
Configure proper quorum
Avoid even-numbered clusters without a tiebreaker. For 2-node clusters, add:
- A third lightweight arbitrator (e.g. Galera arbitrator in Galera-based systems).
- A quorum device (e.g. QDevice in Corosync).
Ensure that a minority partition cannot continue serving writes.
Protect the VIP or load balancer
The VIP or load balancer must never point at two writable nodes.
- Use health checks that verify read_only = OFF and cluster state, not just TCP port 3306.
- Ensure failback is controlled; avoid rapid bouncing between nodes.
- Keep configuration under version control and review changes.
Monitor for anomalies
Add alerts for:
- More than one node with super_read_only = OFF
- Multiple nodes reporting themselves as primary in the HA manager
- Replication errors (duplicate key, row not found)
- VIP or load balancer serving different backends unexpectedly
Operational runbook example
Having a simple, written runbook reduces panic during incidents. An outline:
- Detect: Alert triggers on multiple writable nodes.
- Stabilise: Freeze writes to all but one node, or stop MySQL on secondary suspects.
- Assess: Collect SHOW MASTER STATUS, SHOW SLAVE STATUS, GTID sets, and basic checksums.
- Decide: Choose a winner node based on data completeness and business rules.
- Rebuild: Recreate other nodes from the winner or manually merge where required.
- Review: Fix root causes in HA configuration, quorum, or networking.
Conclusion
Split-brain in MySQL/InnoDB is primarily an HA design and operations problem, not a database feature issue. Detect it early by correlating MySQL state with network and HA manager status, resolve it by choosing a clear data authority and rebuilding others, and prevent recurrence with strong fencing, proper quorum, and strict single-writer semantics. Investing in these controls saves far more time and risk than any one-off recovery exercise.
This article offers general technical guidance. Validate all configurations in a safe environment before applying them to production.


Leave a Reply