Understanding Quorum and Voting in MySQL Group Replication

Quorum and voting are at the core of how MySQL Group Replication keeps your data consistent and your cluster highly available. Understanding them is essential if you run production MySQL with Group Replication or InnoDB Cluster.

What quorum means in Group Replication

Group Replication is a distributed system. Each server (member) participates in decisions about which transactions are accepted and which members are allowed to serve writes. Quorum is the minimum number of members that must agree to move forward safely.

For most deployments you should think in terms of a simple rule:

A majority of members must be healthy and communicating for the group to accept writes.
If the group loses majority, it must stop writes to avoid split-brain.

With N members, the minimum majority is:

majority = floor(N / 2) + 1

Examples:

3 members: majority = 2
5 members: majority = 3
4 members: majority = 3 (still a majority of 4)

How voting works in Group Replication

Group Replication uses a group communication layer and consensus protocol under the hood. Each member has a vote. When a transaction commits, it is replicated to the group and needs acknowledgement from a majority of members before it is considered durable and visible (in single-primary mode).

At a high level:

Client sends a transaction to the primary.
Primary proposes the transaction to the group.
Members receive, validate, and acknowledge the proposal.
When a majority acknowledge, the transaction is committed.

The same voting principle is used when deciding which members form the primary partition after a network issue.

Quorum in common failure scenarios

3-node cluster example

Consider three MySQL instances in a single-primary Group Replication setup:

   [A] --- [B] --- [C]

Normal state: all three are healthy and connected. Majority is 2.

Loss of one node: if C fails, A and B still have 2/3 >= majority. The group keeps running. If B was the primary, it remains primary; if B fails but A is primary, A continues as primary with C.
Network split 1+2: if A is isolated and B+C can talk, B+C have 2/3 >= majority and continue as the primary partition; A loses quorum and must not accept writes.

This is why 3 nodes is the minimum practical size for a production group.

4-node cluster and why even numbers are tricky

With four members, majority is 3. A 2+2 split is dangerous:

   [A] --- [B]    [C] --- [D]

In a perfect 2+2 split, neither side has 3 votes. Correct behaviour is that both sides stop writes. This is safe but not highly available. You paid for 4 servers and in this scenario you cannot write on either side.

For this reason, most production deployments use 3 or 5 members instead of 4.

Step-by-step: designing a safe topology

Step 1: Choose the number of members

General guidance:

3 members: common for most workloads. Can tolerate 1 failure.
5 members: for higher fault-tolerance (can tolerate 2 failures) at the cost of more replication overhead.
Avoid 2 members: cannot form a majority on failure; requires additional mechanisms (like an arbitrator) which Group Replication does not provide natively.
Avoid 4 members: 2+2 split yields no majority.

Step 2: Plan network placement

Quorum is about connectivity as much as node count. Place nodes so that the most likely failure pattern still leaves a majority in the most useful location.

For three nodes across two data centres, prefer 2+1 placement, with 2 in the primary DC and 1 in the secondary.
For three availability zones, 1+1+1 works well, but be aware that loss of any two zones will stop writes.

Example: 3 nodes, 2 DCs (DC1 is primary site)

DC1: [A] [B]
DC2: [C]

If DC2 is lost, A+B still have 2/3 majority. If DC1 is lost, only C remains (1/3); no writes should be allowed, which is correct because the primary site has failed.

Step 3: Decide single-primary vs multi-primary

Group Replication can run in:

single-primary mode: one writable primary, others are read-only.
multi-primary mode: all members are writable, with conflict detection.

Quorum rules apply in both modes, but operationally single-primary is simpler:

Easier to reason about which node is the writer during failover.
Less risk of conflicts, especially with non-deterministic operations.

Most production deployments that prioritise correctness and simplicity use single-primary.

How Group Replication chooses the primary partition

When a network issue occurs, the group can split into several partitions. Only one partition is allowed to continue as the primary partition and accept writes.

The selection is based on:

Partition size (must have majority).
View ID and internal metadata (to determine which partition is most up to date and consistent).

From an operator perspective, the key rules are:

Only one partition with majority will remain writable.
Partitions without majority must not be used for writes, even if they are technically reachable.

Never force writes on a minority partition; this can create split-brain and data divergence that is extremely hard to fix.

Operational best practice

Monitoring quorum and membership

To operate Group Replication safely, monitor which members are ONLINE and which partition is primary. Useful queries include:

SELECT *
FROM performance_schema.replication_group_members
ORDER BY MEMBER_ID;

SELECT *
FROM performance_schema.replication_group_member_stats
WHERE CHANNEL_NAME = 'group_replication_applier';

Key fields to watch:

MEMBER_STATE: ONLINE, RECOVERING, OFFLINE, ERROR.
MEMBER_ROLE: PRIMARY or SECONDARY (in single-primary mode).
COUNT(*) of ONLINE members: should be at or near your expected group size.

Configuration points related to quorum

Core Group Replication settings

Base configuration elements (simplified example):

[mysqld]
server_id=1
log_bin=binlog
binlog_format=ROW
transaction_write_set_extraction=XXHASH64
plugin_load_add='group_replication.so'

# Group Replication
loose-group_replication_group_name='aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee'
loose-group_replication_start_on_boot=OFF
loose-group_replication_local_address='10.0.0.1:33061'
loose-group_replication_group_seeds='10.0.0.1:33061,10.0.0.2:33061,10.0.0.3:33061'
loose-group_replication_bootstrap_group=OFF

These settings do not directly set quorum size; quorum is derived from the number of ONLINE group members. However, they are essential for stable membership and communication.

Flow control and performance vs safety

Quorum is about safety, but you must also consider performance. If one or more members are very slow, they can affect commit latency. Some relevant settings:

group_replication_flow_control_mode: AUTO or DISABLED. AUTO helps prevent slow members from falling too far behind.
group_replication_member_expel_timeout: how long before an unresponsive member is expelled from the group and stops affecting quorum.

Expelling a member reduces the group size and therefore the majority threshold. For example, in a 3-node group, if one is expelled, the group size becomes 2 and majority becomes 2. You should monitor expulsions, as frequent changes in group size can indicate instability.

Operational best practices for quorum and voting

Prefer odd numbers of members (3 or 5) to avoid ambiguous splits.
Do not run production with only 2 members unless you fully understand the limitations and accept that loss of one node means loss of writes.
Monitor membership and alert if the number of ONLINE members drops below expected.
Automate failover carefully: ensure your orchestrator or router only promotes members in the majority partition.
Test failure scenarios (node crash, network partition, DC failure) in a non-production environment regularly.

Example: reasoning about a failure

Given a 3-node single-primary group: A (primary), B, C. All are ONLINE.

Network issue isolates A from B and C. B and C can still talk to each other.
B and C form a 2-member partition with majority (2/3). A is alone (1/3).
The group elects B or C as the new primary (depending on configuration and internal state).
A loses quorum and must not accept writes.

From an application perspective:

Your router or proxy should route writes to the new primary in the B+C partition.
Health checks must ensure that A is not treated as writable while isolated.

This example shows how quorum and voting protect you from split-brain, but only if your application and routing layer respect the group state.

This article offers general technical guidance. Validate all configurations in a safe environment before applying them to production.

Understanding quorum and voting in MySQL Group Replication lets you design safer topologies, reason about failure modes, and avoid split-brain. With the right number of nodes, careful placement, and proper monitoring, you can achieve predictable, robust high availability for your MySQL workloads.