InnoDB Cluster gives you a tightly integrated HA stack: MySQL Server with Group Replication, MySQL Shell, and MySQL Router. When it works, it is clean and predictable. When it fails, error messages can be confusing and spread across components.
This guide walks through common InnoDB Cluster errors, how to recognise them, and step-by-step ways to fix them. It assumes you are comfortable with MySQL basics and have shell access to your RHEL/Rocky Linux hosts.
1. Understand the InnoDB Cluster Pieces
Before troubleshooting, be clear about the components involved:
┌─────────────┐ ┌─────────────────┐
│ MySQL │◀────▶│ Group Replication│
│ instances │ └─────────────────┘
└─────┬───────┘
│ managed by
▼
┌─────────────┐ ┌────────────┐
│ MySQL Shell │◀────▶│ Router │
└─────────────┘ └────────────┘
- MySQL Server runs Group Replication.
- MySQL Shell (dba.*) manages the cluster.
- MySQL Router sends application traffic to the right instance.
Always identify which layer is failing before changing anything.
2. Basic Health Checks
Start with simple checks on each node.
2.1 MySQL service and ports
sudo systemctl status mysqld
sudo ss -lntp | grep 3306
Confirm:
mysqldisactive (running).- Port 3306 is listening on the correct IP (often
0.0.0.0for cluster nodes).
2.2 Shell view of the cluster
mysqlsh --uri clusteradmin@primary-host:3306 --js
dba.getCluster().status({extended: 1})
Look for:
status: ONLINEvsOFFLINE/UNREACHABLE.- Number of
ONLINEmembers vs expected. - Group name and UUID identical on all members.
2.3 Log locations
- MySQL error log:
/var/log/mysqld.log(or configuredlog_error). - Router log: typically
/var/log/mysqlrouter.log. - Shell output: the terminal session you used to configure the cluster.
Most cluster problems will show a clear message in at least one of these.
3. Group Replication Fails to Start
A common issue when bootstrapping or restarting nodes is Group Replication refusing to start.
3.1 Typical error messages
ERROR: Plugin group_replication reported: 'local address "ip:port" already in use'ERROR: Plugin group_replication reported: 'Group name is malformed'ERROR: Plugin group_replication reported: 'The member has more executed transactions than those present in the group'
3.2 Step-by-step checks
-
Verify mandatory settings
On each node, check:
SHOW VARIABLES LIKE 'server_id'; SHOW VARIABLES LIKE 'gtid_mode'; SHOW VARIABLES LIKE 'enforce_gtid_consistency'; SHOW VARIABLES LIKE 'binlog_checksum'; SHOW VARIABLES LIKE 'log_slave_updates';Best practice values:
server_id: unique per node.gtid_mode = ON.enforce_gtid_consistency = ON.log_slave_updates = ON.binlog_checksum = NONEfor some versions of Group Replication.
-
Check group_replication options
SHOW VARIABLES LIKE 'group_replication%';Confirm:
group_replication_group_nameis a valid UUID and identical on all nodes.group_replication_local_addressis unique per node.group_replication_group_seedslists all nodes with correct IP:port.
-
Port conflicts
If you see
local address already in use, check the Group Replication port (default 33061):sudo ss -lntp | grep 33061Adjust
group_replication_local_addressor stop the conflicting service. -
GTID divergence
If the node has extra transactions not in the group, Shell usually suggests using
cloneorincrementalrecovery. The safest approach is:- Provision the node from a healthy member using
CLONEplugin or a fresh backup. - Only use
group_replication_allow_local_disjoint_gtids_joinin controlled lab scenarios, not in production.
- Provision the node from a healthy member using
4. Node Stuck in RECOVERING or ERROR
Sometimes a node joins but never becomes ONLINE.
4.1 Recognising the problem
In MySQL Shell:
dba.getCluster().status()
You may see:
"status": "RECOVERING"
or
"status": "ERROR"
4.2 Common causes
- Network issues between members (firewall, DNS, routing).
- Insufficient privileges for the replication user.
- Clone process failing due to disk space or permissions.
4.3 Step-by-step diagnostics
-
Network reachability
ping other-node nc -vz other-node 3306 nc -vz other-node 33061On RHEL/Rocky Linux, check firewalld:
sudo firewall-cmd --list-all # If needed (example, adjust to your policy): # sudo firewall-cmd --permanent --add-port=3306/tcp # sudo firewall-cmd --permanent --add-port=33061/tcp # sudo firewall-cmd --reload -
Replication user privileges
From a primary member:
SHOW GRANTS FOR 'repl'@'%';Ensure at least:
GRANT REPLICATION SLAVE, BACKUP_ADMIN ON *.* TO 'repl'@'%';Adjust the user name/host to your configuration.
-
Clone failures
Check the error log of the recovering node for messages like:
ERROR [MY-013172] [Server] Plugin clone reported: 'Clone failed ...'Verify:
- Enough free disk space for a full copy of the data.
- File system permissions on the data directory.
- Connectivity back to the donor node.
-
Re-adding the node
If the node is badly out of sync, the cleanest path is often:
- Stop MySQL on the broken node.
- Remove or move aside the data directory (only after you are sure you can reprovision).
- Provision from a fresh backup or CLONE from a healthy member.
- Use MySQL Shell
cluster.rejoinInstance()orcluster.addInstance().
Warning: Removing data directories or reinitialising a node is destructive. Only do this when you are certain the node does not contain unique data and the cluster has a healthy primary.
5. MySQL Router Connection Problems
Applications often see cluster issues as connection failures or read/write errors via Router.
5.1 Typical symptoms
- Application cannot connect to Router host/port.
- Writes failing with
READ_ONLYerrors. - Random disconnects during failover.
5.2 Check Router status
sudo systemctl status mysqlrouter
sudo ss -lntp | grep mysqlrouter
Inspect the Router log for errors about metadata or backend reachability.
5.3 Validate metadata
Router uses the InnoDB Cluster metadata schema. If you change instance hostnames or ports manually, metadata can become stale.
From MySQL Shell on the primary:
dba.getCluster().status({extended: 1})
Ensure the addresses match what Router expects in its mysqlrouter.conf. If not, you may need to:
- Update the instance addresses with Shell (
cluster.rebootClusterFromCompleteOutage()sometimes rewrites metadata). - Rebootstrap Router using
mysqlrouter --bootstrapagainst the updated metadata.
5.4 Read-only issues
If writes fail with ERROR 1290 (HY000): The MySQL server is running with the --read-only option:
- Check Router is using the RW port for writes (not the RO port).
- Verify which instance Router thinks is the primary.
- Confirm the primary is not accidentally set to
read_only=ONorsuper_read_only=ON.
6. Handling Primary Failover and Split-Brain Risks
InnoDB Cluster aims to avoid split-brain, but misconfigured networks or manual interventions can still cause trouble.
6.1 Quick topology view
dba.getCluster().status()
# On each node
SELECT * FROM performance_schema.replication_group_members\G
SELECT MEMBER_ROLE, MEMBER_STATE FROM performance_schema.replication_group_members;
All ONLINE members should agree on who the PRIMARY is.
6.2 If multiple primaries appear
Typical causes:
- Network partition where two subsets of nodes think the others are dead.
- Manual START GROUP_REPLICATION on an isolated node.
Best practices:
- Identify the partition that has the latest and most correct data (usually the majority partition that served traffic).
- Shut down MySQL on the minority partition nodes to stop further divergence.
- Later, reprovision minority nodes from the authoritative partition using CLONE or backup/restore.
Do not try to manually merge diverged data sets; rely on a single source of truth and reinitialise others.
7. Configuration Hygiene and Best Practices
Many InnoDB Cluster issues trace back to configuration drift. A few habits help prevent recurring errors.
7.1 Keep consistent my.cnf across nodes
- Use configuration management (Ansible, etc.) to deploy
my.cnf. - Standardise on the same
innodb_buffer_pool_size, log settings, and character set where possible. - Ensure all Group Replication and GTID settings match.
7.2 Avoid manual changes to replication internals
- Do not manually change
gtid_executedorgtid_purgedunless following a documented procedure. - Prefer MySQL Shell operations (
dba.configureInstance,cluster.addInstance) to manualCHANGE REPLICATION SOURCE TOfor cluster members.
7.3 Monitor proactively
- Track Group Replication metrics (e.g.
performance_schema.replication_group_member_stats). - Alert on state changes: ONLINE → ERROR / RECOVERING.
- Monitor disk space, I/O latency, and network latency between nodes.
8. Conclusion
InnoDB Cluster issues almost always leave evidence in logs, status views, or metadata. When something breaks, first identify which layer is at fault, then follow a structured checklist: verify core settings, confirm network reachability, check GTID alignment, and rely on Shell-managed operations rather than ad-hoc fixes. With consistent configuration and basic monitoring, most cluster errors become predictable and recoverable rather than mysterious outages.
This article offers general technical guidance. Validate all configurations in a safe environment before applying them to production.


Leave a Reply