Backing Up and Restoring an InnoDB Cluster Using MySQL Enterprise Backup

Undo logs are at the heart of how InnoDB delivers transactions, MVCC, and crash recovery. Understanding them helps you debug locking issues, tune long-running workloads, and avoid mysterious tablespace growth.

1. What undo logs are and why they exist

InnoDB undo logs record how to reverse changes made by a transaction. They are used for:

Rollback: undoing changes when a transaction rolls back.
MVCC: providing older row versions to consistent reads (e.g. REPEATABLE READ).
Crash recovery: rolling back incomplete transactions after a crash.

Conceptually, for each modified row, InnoDB writes an undo record describing the previous state. When a query needs an older version, InnoDB follows the undo chain backwards.

┌──────────────┐      ┌──────────────┐      ┌──────────────┐
│  Row v3      │◄────►│  Row v2      │◄────►│  Row v1      │
│ (current)    │      │ (undo log)   │      │ (undo log)   │
└──────────────┘      └──────────────┘      └──────────────┘

Each version is linked by transaction IDs and undo log pointers, allowing InnoDB to reconstruct the correct snapshot for each transaction.

2. Where undo logs live

Modern MySQL typically uses separate undo tablespaces:

innodb_undo_tablespaces: number of undo tablespaces.
innodb_undo_directory: directory for undo tablespace files (if not using system tablespace).

To check your current configuration:

SHOW VARIABLES LIKE 'innodb_undo%';

To see the physical files on a RHEL/Rocky Linux server (adjust paths as appropriate):

cd /var/lib/mysql
ls -1 ib_undo*

Each undo tablespace contains multiple undo log segments, and each segment belongs to a rollback segment that tracks active undo logs.

3. Step-by-step: what happens on UPDATE and COMMIT

3.1 UPDATE lifecycle

Transaction starts
```
START TRANSACTION;
```
Row is read and locked
```
UPDATE accounts SET balance = balance - 100 WHERE id = 42;
```
InnoDB:
- Locates the row (id = 42).
- Acquires the appropriate lock (record + gap/next-key as needed).
Undo record is written
Before modifying the clustered index record, InnoDB writes an undo record with the previous values (e.g. old balance, old secondary index entries).
Data page is updated
The row is updated in the buffer pool, referencing the undo record as the previous version.
Redo log is written
A redo entry is written for both the data change and the undo log change. This allows crash recovery to restore consistency.

3.2 COMMIT lifecycle

Transaction marks commit
```
COMMIT;
```
InnoDB marks the transaction as committed in its transaction system.
Undo records become reusable (eventually)
Undo records cannot be deleted immediately; active transactions may still need older versions for consistent reads. Instead, they are marked as candidates for purge.
Purge thread cleans up later
A background purge process traverses undo logs, removes obsolete versions, and frees undo space.

4. MVCC and consistent reads using undo logs

Under REPEATABLE READ (default in many MySQL builds), a transaction sees a consistent snapshot of data taken at its first read. Undo logs are how InnoDB reconstructs that snapshot.

Consider two sessions:

Session A:                        Session B:
──────────                        ──────────
START TRANSACTION;                
SELECT balance                    
FROM accounts                     
WHERE id = 42;                    
                                  UPDATE accounts
                                  SET balance = balance - 100
                                  WHERE id = 42;
                                  COMMIT;

SELECT balance                    
FROM accounts                     
WHERE id = 42;

Session A will see the same balance in both selects, even though Session B committed a change in between. InnoDB does this by:

Remembering the snapshot transaction ID at the first read.
For each row, following undo records backwards until it finds a version visible to that snapshot.

More concurrent updates → more undo records → longer undo chains → more work to reconstruct snapshots.

5. The purge process: cleaning old versions

Purge is a background process that:

Scans undo logs for committed transactions whose row versions are no longer needed by any active snapshot.
Removes obsolete row versions and frees undo space.
Deletes secondary index entries for deleted rows.

If purge cannot keep up, you may see:

Growing undo tablespaces.
Longer undo chains → slower consistent reads.
Replication lag (slaves applying many row changes while purge is behind).

Key variables to inspect:

SHOW VARIABLES LIKE 'innodb_purge%';

Commonly relevant:

innodb_purge_threads: number of purge worker threads.
innodb_purge_batch_size: how many undo entries each purge batch processes.

Increasing purge threads can help on write-heavy systems, but test carefully; too many threads may increase contention.

6. Monitoring undo activity and issues

6.1 Inspecting undo tablespace size

On the filesystem (example path):

cd /var/lib/mysql
ls -lh ib_undo*

Inside MySQL, use INFORMATION_SCHEMA or performance_schema where available. For example:

SELECT FILE_NAME, TOTAL_EXTENTS, FREE_EXTENTS
FROM INFORMATION_SCHEMA.INNODB_TABLESPACES
WHERE FILE_NAME LIKE '%undo%';

If undo tablespaces grow and never shrink, it usually indicates sustained high write activity combined with long-running transactions or slow purge.

6.2 Detecting long-running transactions

Long transactions prevent purge from discarding old versions. On a busy system, this is a common root cause of undo bloat.

SELECT trx_id, trx_started, trx_state, trx_mysql_thread_id,
       trx_query
FROM INFORMATION_SCHEMA.INNODB_TRX
ORDER BY trx_started;

Look for transactions that have been running for minutes or hours while the workload is write-heavy.

6.3 Checking history length

Some versions expose undo history length via INNODB_METRICS or status output. One typical approach:

SHOW ENGINE INNODB STATUS\G

Look for lines mentioning history list length. A consistently increasing value indicates purge is not keeping up.

7. Best practices for healthy undo logs

7.1 Keep transactions short

Commit as soon as business logic allows; avoid holding transactions open across user think time or network calls.
Avoid long-running SELECT ... FOR UPDATE or SELECT ... LOCK IN SHARE MODE unless truly necessary.
Batch work into smaller transactions where possible (e.g. 1k–10k rows per batch instead of millions).

7.2 Avoid unnecessary read locks

Use plain SELECT for reporting queries; do not add FOR UPDATE unless you plan to update those rows.
Consider using READ COMMITTED isolation for certain workloads to reduce undo pressure, if your consistency requirements allow it.

7.3 Tune purge carefully

On a write-heavy system, you may need more purge capacity.

Example (requires restart, adjust for your environment):

[mysqld]
innodb_purge_threads = 4
innodb_purge_batch_size = 300

Guidelines:

Increase innodb_purge_threads gradually (e.g. from 1 to 2, then 4) and monitor CPU and I/O.
Do not set extremely high batch sizes; it can cause bursts of I/O and latency spikes.

7.4 Plan undo tablespace layout

Use multiple undo tablespaces (innodb_undo_tablespaces) for better concurrency and manageability.
Place undo tablespaces on storage with good random I/O performance (same class as data files).
Monitor growth; if you see one-off spikes due to maintenance jobs, schedule those during low-traffic windows.

7.5 Be careful with bulk operations

Large DELETE or UPDATE operations can generate massive undo and stress purge.

Use batched deletes/updates with explicit limits:

-- WARNING: Potentially destructive. Test on non-production first.

SET @batch_size = 10000;
SET @rows_affected = 1;

WHILE @rows_affected > 0 DO
  DELETE FROM events
  WHERE created_at < NOW() - INTERVAL 90 DAY
  LIMIT @batch_size;
  SET @rows_affected = ROW_COUNT();
END WHILE;

Alternatively, perform bulk changes via partition management (dropping partitions) where appropriate.

8. Undo logs and replication

Undo logs are not replicated directly, but they influence replication behaviour:

On row-based replication, the slave applies row changes, generating its own undo logs.
If the slave has long-running transactions (e.g. reporting queries), purge may lag, increasing undo usage.
Heavy purge activity on the master can increase I/O load, indirectly affecting binary log flushing and replication delay.

Best practices:

Keep analytical/reporting workloads off the primary where possible (use replicas or dedicated analytics nodes).
Monitor INNODB_TRX on replicas as well; long reads there can also cause undo growth.
Ensure replicas have similar purge and undo configuration to avoid surprises.

9. Practical checklist

When you see growing undo tablespaces or MVCC-related performance issues, walk through this checklist:

Identify long transactions
- Query INFORMATION_SCHEMA.INNODB_TRX and resolve or terminate runaway sessions.
Inspect history length
- Use SHOW ENGINE INNODB STATUS to see if history list length is growing.
Review workload patterns
- Look for large bulk operations, periodic jobs, or new application features that increased write volume.
Evaluate purge configuration
- Consider modest increases to innodb_purge_threads and tune batch size.
Plan long-term layout
- Ensure undo tablespaces are on appropriate storage and sized for peak workloads.

This article offers general technical guidance. Validate all configurations in a safe environment before applying them to production.