Monitoring a MySQL InnoDB Cluster is less about pretty dashboards and more about knowing which signals matter for availability, performance, and data safety. This guide walks through what to watch, how to collect it, and how to reason about cluster health.
What Is MySQL InnoDB Cluster Monitoring Really About?
InnoDB Cluster (Group Replication + MySQL Router + MySQL Shell) gives you HA, but only if you detect problems early. Effective monitoring focuses on three areas:
- Cluster topology: which nodes are primary/secondary, who can vote, who can serve reads.
- Replication health: write consistency, lag, conflicts, and failover readiness.
- Resource usage: CPU, memory, I/O, and InnoDB internals that affect throughput and latency.
Think in terms of questions you want answered quickly:
- Is the cluster writable and consistent?
- Which node is the primary right now?
- Are any nodes unhealthy, lagging, or about to run out of resources?
- Would a failover succeed now?
Step 1: Establish a Baseline Cluster View
Start by making it trivial to see which nodes exist and their roles. A simple mental model helps:
[Client] --> [MySQL Router] --> [Primary Node]
|--> [Secondary Node 1]
|--> [Secondary Node 2]
Use MySQL Shell on any cluster member:
mysqlsh --uri root@primary-host:3306 --js
dshell> var cluster = dba.getCluster()
dshell> cluster.status({extended: 1})
Key things to check in the output:
- status is OK (or at least OK_NO_TOLERANCE).
- All expected instances appear under defaultReplicaSet.topology.
- memberRole shows exactly one PRIMARY and the rest SECONDARY.
- memberState is ONLINE for all members that should be serving.
Automate this check via a script that parses cluster.status() JSON output and alerts on:
- Missing members.
- No primary.
- Members stuck in RECOVERING or ERROR states.
Step 2: Monitor Group Replication Core Metrics
Group Replication is the heart of InnoDB Cluster. If it is unhealthy, your HA story fails. Focus on:
- Member state and role.
- Replication throughput and conflicts.
- Network and flow control behaviour.
Key replication status queries
On each node, enable the performance schema consumers you need (once per server):
UPDATE performance_schema.setup_consumers
SET ENABLED = 'YES'
WHERE NAME IN ('events_statements_history_long', 'events_stages_history_long');
Then query replication-related tables:
SELECT *
FROM performance_schema.replication_group_members\G
SELECT *
FROM performance_schema.replication_group_member_stats\G
Important fields:
- member_state: should be ONLINE.
- role: PRIMARY or SECONDARY.
- transactions_in_queue: backlog waiting to be applied.
- transactions_remote_applier_queue: remote apply backlog (lag signal).
- received_transaction_set: useful to compare between nodes for divergence.
Detecting lag and replication pressure
Group Replication does not expose a single “seconds behind” metric like classic async replication. Instead, treat lag as:
- Growing transactions_in_queue or transactions_remote_applier_queue.
- Increased commit latency on the primary.
- Rising network and flow control metrics.
Monitor flow control from replication_group_member_stats:
SELECT member_id,
COUNT_TRANSACTIONS_IN_QUEUE,
COUNT_TRANSACTIONS_REMOTE_IN_APPLIER_QUEUE,
COUNT_TRANSACTIONS_REMOTE_APPLIED,
COUNT_TRANSACTIONS_LOCAL_PROPOSED,
COUNT_TRANSACTIONS_LOCAL_ROLLBACK
FROM performance_schema.replication_group_member_stats;
Alert when queues stay non-zero for more than a short window, or when rollback counts spike, which may indicate conflicts or overload.
Step 3: Track InnoDB Engine Health
Cluster-level health is meaningless if individual nodes are I/O bound or thrashing memory. Focus on:
- Buffer pool efficiency.
- Redo and flush pressure.
- Row lock contention.
Buffer pool and I/O
SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_read%';
SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_pages_%';
SHOW GLOBAL STATUS LIKE 'Innodb_data_reads';
SHOW GLOBAL STATUS LIKE 'Innodb_data_writes';
Useful derived metrics:
- Read hit ratio: high logical reads vs physical reads.
- Dirty pages fraction: too high suggests flush pressure.
Pair these with OS metrics on RHEL/Rocky Linux:
sar -d 1 10 # disk utilisation and await
vmstat 1 10 # run queue, swap, I/O
iostat -x 1 10 # per-device latency and utilisation
Locking and contention
Contention can cause timeouts and replication stalls. Monitor:
SHOW GLOBAL STATUS LIKE 'Innodb_row_lock%';
And look at active transactions when needed:
SELECT *
FROM information_schema.innodb_trx
ORDER BY trx_started;
Alert on sustained increases in lock waits or long-running transactions that threaten to block DDL or important writes.
Step 4: Watch Router and Client Connectivity
MySQL Router is often overlooked in monitoring, yet it is the entry point for most applications. You want to know:
- Is the Router process up?
- Is it connected to a healthy cluster primary?
- Are connections being dropped or refused?
Basic Router checks
On RHEL/Rocky Linux, ensure systemd monitoring is in place:
systemctl status mysqlrouter
journalctl -u mysqlrouter -n 100
Enable metrics where available (for example, via Router diagnostics socket or logs) and capture:
- Connection counts per route.
- Backend selection decisions (which node is primary at any moment).
- Errors resolving metadata or contacting metadata servers.
At minimum, log-scrape for events like “metadata cache update failed”, repeated connection retries, or backend failures.
Step 5: Implement Periodic Health Checks
Monitoring is more than collecting metrics; you need active checks that answer “can we safely read/write now?”. A simple health check pattern:
- Resolve the current primary via Router or metadata.
- Run a lightweight write test to a dedicated health database/table.
- Verify the write is visible on at least one secondary within a tolerance window.
Example health table:
CREATE DATABASE IF NOT EXISTS cluster_health;
USE cluster_health;
CREATE TABLE IF NOT EXISTS heartbeat (
id INT PRIMARY KEY AUTO_INCREMENT,
ts TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
node VARCHAR(64) NOT NULL
) ENGINE=InnoDB;
Application-side health check (pseudo-logic):
INSERT INTO cluster_health.heartbeat (node) VALUES ('health-check');
-- record inserted primary timestamp and GTID if needed
-- then read from a secondary and ensure the row appears within N seconds
This verifies:
- Writes are accepted by the primary.
- Replication is progressing.
- Routers and load balancers are pointing to the right nodes.
Health checks should prove that your assumptions about topology, replication, and routing still hold, not just that ports are open.
Operational principle
Step 6: Alerting and Dashboards
Once you know what to measure, define clear thresholds and aggregate views.
Practical alert suggestions
- Cluster topology: no primary, member_state != ONLINE, quorum loss.
- Replication: sustained non-zero transaction queues, spike in rollbacks, frequent member re-joins.
- Resources: high disk utilisation, low free memory, rising InnoDB I/O waits.
- Router: process down, repeated backend selection failures.
- Backups: backup job failures or excessive backup window duration.
Dashboard layout idea
Organise dashboards by concern, not by data source:
- Top-level: cluster summary, primary node, member states, high-level QPS and latency.
- Replication: queues, flow control, commit latency, conflicts.
- InnoDB internals: buffer pool, redo, I/O, locks.
- Infrastructure: CPU, memory, disk, network per node.
Step 7: Test Failover and Observability Together
An InnoDB Cluster that has never been failover-tested is effectively unmonitored. Schedule controlled failover tests:
- Trigger a planned switchover using MySQL Shell (
cluster.setPrimaryInstance()). - Observe metrics during and after the event.
- Verify alerts fire as expected (primary change, role changes, temporary state transitions).
- Confirm application behaviour: retries, reconnects, and latencies.
Use these tests to refine which metrics are genuinely useful and which alerts are noisy.
This article offers general technical guidance. Validate all configurations in a safe environment before applying them to production.
Effective MySQL InnoDB Cluster monitoring is about predictable behaviour under failure: understand your topology, watch Group Replication health, track InnoDB internals, and continuously test failover. With targeted metrics, practical health checks, and rehearsed procedures, you can detect issues early and keep your HA deployment reliable.


Leave a Reply