Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

track_and_verify_wals_in_manifest with sync=true creates hundreds of duplicate WalAddition entries #13192

Open
evanj opened this issue Dec 6, 2024 · 0 comments

Comments

@evanj
Copy link
Contributor

evanj commented Dec 6, 2024

Expected behavior

With track_and_verify_wals_in_manifest=true a WalAddition should only be recorded in the MANIFEST once when it is synced, no matter what write options we use.

Actual behavior

When writing with write_options.sync=true, the WalAddition is duplicated hundreds of times in the MANIFEST. This causes MANIFEST to get very large and causes additional useless disk IO. Raw output from ldb manifest_dump is at the bottom of this issue.

Steps to reproduce the behavior

I have created a unit test in a branch. See DBTestTrackWalCountRecords in db_basic_test.cc here: https://github.com/evanj/rocksdb/blob/evan.jones/test-track-with-sync/db/db_basic_test.cc#L4149-L4184

Workarounds

  • Don't set track_and_verify_wals_in_manifest=true. However, the RocksDB wiki recommends setting this to true: "We recommend to set track_and_verify_wals_in_manifest to true for production".
  • Set atomic_flush=true, which handles this better (it only duplicates the record twice instead of hundreds of times, see below).

Additional details

I attempted to debug this problem, and while I think I understand it, I do not know how to fix it.

My test executes many writes so RocksDB switches to a new WAL 3 times (4 total WAL files). It then counts the number of WalAddition records in MANIFEST. It tries the following options, with both atomic_flush=true and atomic_flush=false.

  • Writes with sync=false
  • Writes with sync=false followed by FlushWAL(true)
  • Writes with sync=true

The results are the following:

  • sync=false; atomic_flush=false : 0 records in MANIFEST (CORRECT): Nothing is flushed, so I think this is right.
  • sync=false; atomic_flush=true : 3 records in MANIFEST (CORRECT): Strictly speaking I'm not sure these records are needed, but this is correct.
  • sync=false + FlushWal(true); atomic_flush=*: 3 records in MANIFEST (CORRECT)
  • sync=true; atomic_flush=false: 194 records in MANIFEST (ERROR): This should be 3. Each record is replicated many times with the exact same values (see below)
  • sync=true; atomic_flush=true: 6 records in MANIFEST (ERROR): This should be 3. This is not as bad, since it only does this twice (see below)

My (maybe incorrect) analysis of the two incorrect cases:

sync=true; atomic_flush=false:

In the "correct" case when we use FlushWal(true), the WAL file is closed by SyncWalImpl, then removed from DBImpl::logs_ by MarkLogsSynced. Since the log is removed from logs_, it only gets added to the MANIFEST once.

In this "incorrect" case, the new log file is added to logs_ in SwitchMemtable, but the file itself is closed by a background task: BGWorkFlush which eventually calls FindObsoleteFiles, which actually closes the file. Util FindObsoleteFiles closes the WAL file, each write finds two files in logs_, so each call to Put then calls MarkLogsSynced, which decides it needs to add the WAL file to synced_wals here:

rocksdb/db/db_impl/db_impl.cc

Lines 1860 to 1864 in 1095810

// Inactive WAL
if (immutable_db_options_.track_and_verify_wals_in_manifest &&
wal.GetPreSyncSize() > 0) {
synced_wals->AddWal(wal.number, WalMetadata(wal.GetPreSyncSize()));
}

I'm not sure how to fix this. Maybe we need SwitchMemtable to close the log file, so this code in MarkLogsSynced only gets executed once? Or maybe we need this code in MarkLogsSynced to close the log file?

sync=true; atomic_flush=true: I think the problem is that the MANIFEST is updated FIRST by the Put(sync=true), then it is updated AGAIN by SyncWalImpl, which is called because atomic_flush=true. This is similar to the atomic_flush=false case, but less bad. I think if we fix the above case, it should fix this also.

Example MANIFEST dumps

sync=true; atomic_flush=false

Notice that the WalAdditions repeat many times. This is truncated output from the unit test, showing the MANIFEST records switching from one WAL to the next.

... many copies removed ...
{"EditNumber": 64, "PrevLogNumber": 0, "NextFileNumber": 10, "LastSeq": 0, "WalAdditions": [{"LogNumber": 4, "SyncedSizeInBytes": 57947}], "ColumnFamily": 0}
{"EditNumber": 65, "PrevLogNumber": 0, "NextFileNumber": 10, "LastSeq": 0, "WalAdditions": [{"LogNumber": 4, "SyncedSizeInBytes": 57947}], "ColumnFamily": 0}
{"EditNumber": 66, "PrevLogNumber": 0, "NextFileNumber": 10, "LastSeq": 0, "WalAdditions": [{"LogNumber": 4, "SyncedSizeInBytes": 57947}], "ColumnFamily": 0}
{"EditNumber": 67, "LogNumber": 8, "PrevLogNumber": 0, "NextFileNumber": 10, "LastSeq": 1880, "AddedFiles": [{"Level": 0, "FileNumber": 9, "FileSize": 33043, "SmallestIKey": "'k0' seq:1, type:1", "LargestIKey": "'k999' seq:1000, type:1", "OldestAncesterTime": 1733497012, "FileCreationTime": 1733497013, "EpochNumber": 1, "FileChecksum": "", "FileChecksumFuncName": "Unknown", "TailSize": 1228, "UserDefinedTimestampsPersisted": 1}], "ColumnFamily": 0}
{"EditNumber": 68, "PrevLogNumber": 0, "NextFileNumber": 10, "MinLogNumberToKeep": 8, "LastSeq": 1880, "WalDeletion": {"LogNumber": 8}, "ColumnFamily": 0}
{"EditNumber": 69, "PrevLogNumber": 0, "NextFileNumber": 10, "LastSeq": 1880, "WalAdditions": [{"LogNumber": 4, "SyncedSizeInBytes": 57947}], "ColumnFamily": 0}
{"EditNumber": 70, "PrevLogNumber": 0, "NextFileNumber": 12, "LastSeq": 1880, "WalAdditions": [{"LogNumber": 8, "SyncedSizeInBytes": 59616}], "ColumnFamily": 0}
{"EditNumber": 71, "PrevLogNumber": 0, "NextFileNumber": 12, "LastSeq": 1880, "WalAdditions": [{"LogNumber": 8, "SyncedSizeInBytes": 59616}], "ColumnFamily": 0}
{"EditNumber": 72, "PrevLogNumber": 0, "NextFileNumber": 12, "LastSeq": 1880, "WalAdditions": [{"LogNumber": 8, "SyncedSizeInBytes": 59616}], "ColumnFamily": 0}
{"EditNumber": 73, "PrevLogNumber": 0, "NextFileNumber": 12, "LastSeq": 1880, "WalAdditions": [{"LogNumber": 8, "SyncedSizeInBytes": 59616}], "ColumnFamily": 0}
... many copies removed ...

sync=true; atomic_flush=true

Notice that the WalAdditions are all exactly duplicated. This is the complete output from the unit test.

{"EditNumber": 0, "DB ID": "c3a81eee-cce5-41f9-b944-93b40f0d74ba", "ColumnFamily": 0}
{"EditNumber": 1, "ColumnFamily": 0}
{"EditNumber": 2, "Comparator": "leveldb.BytewiseComparator", "ColumnFamily": 0}
{"EditNumber": 3, "LogNumber": 0, "LastSeq": 0, "ColumnFamily": 0}
{"EditNumber": 4, "PrevLogNumber": 0, "NextFileNumber": 6, "LastSeq": 0, "ColumnFamily": 0}
{"EditNumber": 5, "PrevLogNumber": 0, "NextFileNumber": 9, "LastSeq": 0, "WalAdditions": [{"LogNumber": 4, "SyncedSizeInBytes": 57947}], "ColumnFamily": 0}
{"EditNumber": 6, "PrevLogNumber": 0, "NextFileNumber": 9, "LastSeq": 0, "WalAdditions": [{"LogNumber": 4, "SyncedSizeInBytes": 57947}], "ColumnFamily": 0}
{"EditNumber": 7, "PrevLogNumber": 0, "NextFileNumber": 9, "LastSeq": 0, "WalAdditions": [{"LogNumber": 4, "SyncedSizeInBytes": 57947}], "ColumnFamily": 0}
{"EditNumber": 8, "LogNumber": 8, "PrevLogNumber": 0, "NextFileNumber": 10, "LastSeq": 1880, "AddedFiles": [{"Level": 0, "FileNumber": 9, "FileSize": 33043, "SmallestIKey": "'k0' seq:1, type:1", "LargestIKey": "'k999' seq:1000, type:1", "OldestAncesterTime": 1733496195, "FileCreationTime": 1733496218, "EpochNumber": 1, "FileChecksum": "", "FileChecksumFuncName": "Unknown", "TailSize": 1228, "UserDefinedTimestampsPersisted": 1}], "ColumnFamily": 0}
{"EditNumber": 9, "PrevLogNumber": 0, "NextFileNumber": 10, "MinLogNumberToKeep": 8, "LastSeq": 1880, "WalDeletion": {"LogNumber": 8}, "ColumnFamily": 0}
{"EditNumber": 10, "PrevLogNumber": 0, "NextFileNumber": 11, "LastSeq": 1880, "WalAdditions": [{"LogNumber": 8, "SyncedSizeInBytes": 59616}], "ColumnFamily": 0}
{"EditNumber": 11, "PrevLogNumber": 0, "NextFileNumber": 11, "LastSeq": 1880, "WalAdditions": [{"LogNumber": 8, "SyncedSizeInBytes": 59616}], "ColumnFamily": 0}
{"EditNumber": 12, "LogNumber": 10, "PrevLogNumber": 0, "NextFileNumber": 12, "LastSeq": 3743, "AddedFiles": [{"Level": 0, "FileNumber": 11, "FileSize": 34166, "SmallestIKey": "'k1880' seq:1881, type:1", "LargestIKey": "'k3742' seq:3743, type:1", "OldestAncesterTime": 1733496195, "FileCreationTime": 1733496219, "EpochNumber": 2, "FileChecksum": "", "FileChecksumFuncName": "Unknown", "TailSize": 1251, "UserDefinedTimestampsPersisted": 1}], "ColumnFamily": 0}
{"EditNumber": 13, "PrevLogNumber": 0, "NextFileNumber": 12, "MinLogNumberToKeep": 10, "LastSeq": 3743, "WalDeletion": {"LogNumber": 10}, "ColumnFamily": 0}
{"EditNumber": 14, "PrevLogNumber": 0, "NextFileNumber": 13, "LastSeq": 3743, "WalAdditions": [{"LogNumber": 10, "SyncedSizeInBytes": 59840}], "ColumnFamily": 0}
{"EditNumber": 15, "PrevLogNumber": 0, "NextFileNumber": 13, "LastSeq": 3743, "WalAdditions": [{"LogNumber": 10, "SyncedSizeInBytes": 59840}], "ColumnFamily": 0}
{"EditNumber": 16, "LogNumber": 12, "PrevLogNumber": 0, "NextFileNumber": 14, "LastSeq": 5613, "AddedFiles": [{"Level": 0, "FileNumber": 13, "FileSize": 34286, "SmallestIKey": "'k3743' seq:3744, type:1", "LargestIKey": "'k5612' seq:5613, type:1", "OldestAncesterTime": 1733496218, "FileCreationTime": 1733496220, "EpochNumber": 3, "FileChecksum": "", "FileChecksumFuncName": "Unknown", "TailSize": 1249, "UserDefinedTimestampsPersisted": 1}], "ColumnFamily": 0}
{"EditNumber": 17, "PrevLogNumber": 0, "NextFileNumber": 14, "MinLogNumberToKeep": 12, "LastSeq": 5613, "WalDeletion": {"LogNumber": 12}, "ColumnFamily": 0}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant