InnoDB Cluster Troubleshooting Guide: Common Errors

InnoDB Cluster gives you a tightly integrated HA stack: MySQL Server with Group Replication, MySQL Shell, and MySQL Router. When it works, it is clean and predictable. When it fails, error messages can be confusing and spread across components.

This guide walks through common InnoDB Cluster errors, how to recognise them, and step-by-step ways to fix them. It assumes you are comfortable with MySQL basics and have shell access to your RHEL/Rocky Linux hosts.

1. Understand the InnoDB Cluster Pieces

Before troubleshooting, be clear about the components involved:

┌─────────────┐      ┌─────────────────┐
│  MySQL      │◀────▶│ Group Replication│
│  instances  │      └─────────────────┘
└─────┬───────┘
      │  managed by
      ▼
┌─────────────┐      ┌────────────┐
│ MySQL Shell │◀────▶│   Router   │
└─────────────┘      └────────────┘

MySQL Server runs Group Replication.
MySQL Shell (dba.*) manages the cluster.
MySQL Router sends application traffic to the right instance.

Always identify which layer is failing before changing anything.

2. Basic Health Checks

Start with simple checks on each node.

2.1 MySQL service and ports

sudo systemctl status mysqld
sudo ss -lntp | grep 3306

Confirm:

mysqld is active (running).
Port 3306 is listening on the correct IP (often 0.0.0.0 for cluster nodes).

2.2 Shell view of the cluster

mysqlsh --uri clusteradmin@primary-host:3306 --js

dba.getCluster().status({extended: 1})

Look for:

status: ONLINE vs OFFLINE / UNREACHABLE.
Number of ONLINE members vs expected.
Group name and UUID identical on all members.

2.3 Log locations

MySQL error log: /var/log/mysqld.log (or configured log_error).
Router log: typically /var/log/mysqlrouter.log.
Shell output: the terminal session you used to configure the cluster.

Most cluster problems will show a clear message in at least one of these.

3. Group Replication Fails to Start

A common issue when bootstrapping or restarting nodes is Group Replication refusing to start.

3.1 Typical error messages

ERROR: Plugin group_replication reported: 'local address "ip:port" already in use'
ERROR: Plugin group_replication reported: 'Group name is malformed'
ERROR: Plugin group_replication reported: 'The member has more executed transactions than those present in the group'

3.2 Step-by-step checks

Verify mandatory settings

On each node, check:
```
SHOW VARIABLES LIKE 'server_id';
SHOW VARIABLES LIKE 'gtid_mode';
SHOW VARIABLES LIKE 'enforce_gtid_consistency';
SHOW VARIABLES LIKE 'binlog_checksum';
SHOW VARIABLES LIKE 'log_slave_updates';
```
Best practice values:
- server_id: unique per node.
- gtid_mode = ON.
- enforce_gtid_consistency = ON.
- log_slave_updates = ON.
- binlog_checksum = NONE for some versions of Group Replication.
Check group_replication options
```
SHOW VARIABLES LIKE 'group_replication%';
```
Confirm:
- group_replication_group_name is a valid UUID and identical on all nodes.
- group_replication_local_address is unique per node.
- group_replication_group_seeds lists all nodes with correct IP:port.
Port conflicts

If you see local address already in use, check the Group Replication port (default 33061):
```
sudo ss -lntp | grep 33061
```
Adjust group_replication_local_address or stop the conflicting service.
GTID divergence

If the node has extra transactions not in the group, Shell usually suggests using clone or incremental recovery. The safest approach is:
- Provision the node from a healthy member using CLONE plugin or a fresh backup.
- Only use group_replication_allow_local_disjoint_gtids_join in controlled lab scenarios, not in production.

4. Node Stuck in RECOVERING or ERROR

Sometimes a node joins but never becomes ONLINE.

4.1 Recognising the problem

In MySQL Shell:

dba.getCluster().status()

You may see:

"status": "RECOVERING"

"status": "ERROR"

4.2 Common causes

Network issues between members (firewall, DNS, routing).
Insufficient privileges for the replication user.
Clone process failing due to disk space or permissions.

4.3 Step-by-step diagnostics

Network reachability

ping other-node
nc -vz other-node 3306
nc -vz other-node 33061

On RHEL/Rocky Linux, check firewalld:

sudo firewall-cmd --list-all
# If needed (example, adjust to your policy):
# sudo firewall-cmd --permanent --add-port=3306/tcp
# sudo firewall-cmd --permanent --add-port=33061/tcp
# sudo firewall-cmd --reload

Replication user privileges

From a primary member:
```
SHOW GRANTS FOR 'repl'@'%';
```
Ensure at least:
```
GRANT REPLICATION SLAVE, BACKUP_ADMIN ON *.* TO 'repl'@'%';
```
Adjust the user name/host to your configuration.
Clone failures

Check the error log of the recovering node for messages like:
```
ERROR [MY-013172] [Server] Plugin clone reported: 'Clone failed ...'
```
Verify:
- Enough free disk space for a full copy of the data.
- File system permissions on the data directory.
- Connectivity back to the donor node.
Re-adding the node

If the node is badly out of sync, the cleanest path is often:
1. Stop MySQL on the broken node.
2. Remove or move aside the data directory (only after you are sure you can reprovision).
3. Provision from a fresh backup or CLONE from a healthy member.
4. Use MySQL Shell cluster.rejoinInstance() or cluster.addInstance().
Warning: Removing data directories or reinitialising a node is destructive. Only do this when you are certain the node does not contain unique data and the cluster has a healthy primary.

5. MySQL Router Connection Problems

Applications often see cluster issues as connection failures or read/write errors via Router.

5.1 Typical symptoms

Application cannot connect to Router host/port.
Writes failing with READ_ONLY errors.
Random disconnects during failover.

5.2 Check Router status

sudo systemctl status mysqlrouter
sudo ss -lntp | grep mysqlrouter

Inspect the Router log for errors about metadata or backend reachability.

5.3 Validate metadata

Router uses the InnoDB Cluster metadata schema. If you change instance hostnames or ports manually, metadata can become stale.

From MySQL Shell on the primary:

dba.getCluster().status({extended: 1})

Ensure the addresses match what Router expects in its mysqlrouter.conf. If not, you may need to:

Update the instance addresses with Shell (cluster.rebootClusterFromCompleteOutage() sometimes rewrites metadata).
Rebootstrap Router using mysqlrouter --bootstrap against the updated metadata.

5.4 Read-only issues

If writes fail with ERROR 1290 (HY000): The MySQL server is running with the --read-only option:

Check Router is using the RW port for writes (not the RO port).
Verify which instance Router thinks is the primary.
Confirm the primary is not accidentally set to read_only=ON or super_read_only=ON.

6. Handling Primary Failover and Split-Brain Risks

InnoDB Cluster aims to avoid split-brain, but misconfigured networks or manual interventions can still cause trouble.

6.1 Quick topology view

dba.getCluster().status()

# On each node
SELECT * FROM performance_schema.replication_group_members\G
SELECT MEMBER_ROLE, MEMBER_STATE FROM performance_schema.replication_group_members;

All ONLINE members should agree on who the PRIMARY is.

6.2 If multiple primaries appear

Typical causes:

Network partition where two subsets of nodes think the others are dead.
Manual START GROUP_REPLICATION on an isolated node.

Best practices:

Identify the partition that has the latest and most correct data (usually the majority partition that served traffic).
Shut down MySQL on the minority partition nodes to stop further divergence.
Later, reprovision minority nodes from the authoritative partition using CLONE or backup/restore.

Do not try to manually merge diverged data sets; rely on a single source of truth and reinitialise others.

7. Configuration Hygiene and Best Practices

Many InnoDB Cluster issues trace back to configuration drift. A few habits help prevent recurring errors.

7.1 Keep consistent my.cnf across nodes

Use configuration management (Ansible, etc.) to deploy my.cnf.
Standardise on the same innodb_buffer_pool_size, log settings, and character set where possible.
Ensure all Group Replication and GTID settings match.

7.2 Avoid manual changes to replication internals

Do not manually change gtid_executed or gtid_purged unless following a documented procedure.
Prefer MySQL Shell operations (dba.configureInstance, cluster.addInstance) to manual CHANGE REPLICATION SOURCE TO for cluster members.

7.3 Monitor proactively

Track Group Replication metrics (e.g. performance_schema.replication_group_member_stats).
Alert on state changes: ONLINE → ERROR / RECOVERING.
Monitor disk space, I/O latency, and network latency between nodes.

8. Conclusion

InnoDB Cluster issues almost always leave evidence in logs, status views, or metadata. When something breaks, first identify which layer is at fault, then follow a structured checklist: verify core settings, confirm network reachability, check GTID alignment, and rely on Shell-managed operations rather than ad-hoc fixes. With consistent configuration and basic monitoring, most cluster errors become predictable and recoverable rather than mysterious outages.

This article offers general technical guidance. Validate all configurations in a safe environment before applying them to production.