3.4. Logging

3.4. LOGGING

To ensure consistency and durability, MongoDB employs a journaling approach and periodically triggers checkpoints at defined intervals. Crashes between checkpoints, however, may result in a loss of writes that have not yet been written to disk. Write-ahead logging (WAL) provides a way to make these writes durable.

3.4.1. Write-Ahead Logging (WAL)

Write-Ahead Logging (WAL) is a family of techniques designed to provide atomicity and durability (two of the ACID properties) in database systems. WAL involves wrapping information about the current write operation and storing it durably before confirming the write to the client application. A log sequence number (LSN) is usually associated with each logged write to establish the happen-before relation between logs. Log records are stored in a memory log buffer and are later synchronously written to non-volatile storage by the WAL protocol. Upon failure, a data recovery scheme like ARIES replays all logs in LSN order to reconstruct the state of the database immediately prior to the crash.

Name	Definition
`alloc_lsn`	Next log record allocation
`ckpt_lsn`	Last checkpoint
`sync_lsn`	Last record synced to disk
`write_lsn`	End of the last record written to the operating system
`write_start_lsn`	Start of the last record written to the operating system

WiredTiger uses B-trees to store data in volatile memory. A snapshot is a consistent, durable view of these B-trees, which is periodically written out to disk. Starting from version 3.6, MongoDB configures WiredTiger to create checkpoints (i.e., write the snapshot data to disk) every 60 seconds. To provide durability in the event of failure between checkpoints, WiredTiger uses WAL to maintain on-disk journal files.

WiredTiger creates one log record for each client-initiated write operation. A log record wraps all internal write operations to WiredTiger’s in-memory data structures caused by the application-initiated write. A log record consists of a 16-byte header and data. The first 4 bytes of the header contain the length of the record, and this length is always a multiple of 4 bytes.

MongoDB configures WiredTiger to buffer all log records up to 128 KB in an in-memory data structure called the slot. Slots are synchronously flushed to non-volatile storage every 100 milliseconds or upon a full-sync write, whichever comes first. A full-sync write is a write operation that requires its journal record to be flushed to non-volatile storage before returning, ensuring that the written data survives a crash. This provides the strictest durability, in contrast to non-sync writes where the data is recorded in a buffer in memory but is not guaranteed to be immediately written to non-volatile storage. After a full-sync write is issued by the client to the query executor, all records in the slot buffer must be synchronized to non-volatile storage to commit the write.

3.4.2. Checkpoints

In addition to journaling, MongoDB periodically initiates a checkpoint process at specific intervals (every 60 seconds) or when the volume of log data written reaches a threshold (2 GB).

Write Dirty Leaf Pages:
- Instead of overwriting existing data, new versions of dirty leaf pages are written into free space on disk.
Write Internal Pages, Including the Root:
- The internal pages, along with the root page, are written. Importantly, the old checkpoint remains valid until this process is complete.
Sync the File:
- The file containing the new pages is synced to ensure all changes are properly written to disk.
Update Metadata with New Root Address:
- The new root page’s address is written to the metadata. Once the metadata is durable, pages from old checkpoints can be freed.

After every checkpoint, all journal records whose writes happened before the checkpoint are automatically garbage-collected, freeing space for future journal records. In this sense, the journal can be thought of as a circular buffer for write operations. WiredTiger keeps track of the current checkpoint and the current starting point for WAL in the journal file.

3.4.3. Recovery

When recovering from a crash, WiredTiger first looks in the data files for the identifier of the last checkpoint, then searches the journal files for the record that matches this identifier. WiredTiger then reads log records one by one from the journal file:

It scans through the header to obtain metadata, such as the size of the entire record.
It reads through the data part of the log record according to the metadata.
After reading each record, it immediately applies it and continues to the next record until all records are consumed.

This structured approach ensures that MongoDB can recover reliably and efficiently from crashes, maintaining the integrity and durability of the data.