Project

General

Profile

Actions

Bug #54288

closed

rocksdb: Corruption: missing start of fragmented record

Added by Igor Fedotov about 2 years ago. Updated almost 2 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

After a kernel panic one of our OSD is unable to start and is showing:
rocksdb: [db/db_impl_open.cc:518] db/099134.log: dropping 1728 bytes; Corruption: missing start of fragmented record(2
)

Actions #1

Updated by Igor Fedotov about 2 years ago

  • Status changed from New to Triaged
  • Assignee set to Igor Fedotov

Investigation revealed that RocksDB embedded to Ceph Octopus release implicitly substitutes default wal_recovery_mode value (=kPointInTimeRecovery=2) to a less reliable kTolerateCorruptedTailRecords(=2) if recycle_log_file_num is non-zero. And the latter is what we have got by default in bluestore_rocksdb_options parameter.
Relevant RocksDB code is in SanitizeOptions() function from db/db_impl_open.cc:

if (result.recycle_log_file_num &&
(result.wal_recovery_mode WALRecoveryMode::kPointInTimeRecovery ||
result.wal_recovery_mode WALRecoveryMode::kAbsoluteConsistency)) {
// kPointInTimeRecovery is indistinguishable from
// kTolerateCorruptedTailRecords in recycle mode since we define
// the "end" of the log as the first corrupt record we encounter.
// kAbsoluteConsistency doesn't make sense because even a clean
// shutdown leaves old junk at the end of the log file.
result.wal_recovery_mode = WALRecoveryMode::kTolerateCorruptedTailRecords;
}

One can see the actual values for both wal_recovery_mode and recycle_log_file_num by inspecting OSD log:
022-02-14T11:58:58.846+0000 7f1fec229100 4 rocksdb: Options.wal_recovery_mode: 0
022-02-14T11:58:58.846+0000 7f1fec229100 4 rocksdb: Options.recycle_log_file_num: 4

And resetting the latter does the trick indeed. wal_recovery_mode becomes equal to 2 and that permits OSD to recover from the data corruption and proceed with the startup:


2022-02-14T22:18:39.763+0000 7f8173b22100 4 rocksdb: Options.recycle_log_file_num: 0
2022-02-14T22:18:39.763+0000 7f8173b22100 4 rocksdb: Options.wal_recovery_mode: 2
...
2022-02-14T22:18:40.199+0000 7f8173b22100 3 rocksdb: [db/db_impl_open.cc:518] /persistent/db/099134.log: dropping 1728 bytes; Corruption: missing start of fragmented record(2)
2022-02-14T22:18:40.199+0000 7f8173b22100 4 rocksdb: [db/db_impl_open.cc:810] Point in time recovered to log #99134 seq #5023211213

Actions #2

Updated by Igor Fedotov about 2 years ago

Interestingly RocksDB from Pacific onward deals with wal_recovery_mode and recycle_log_file_num inconsistency a bit differently. That's how it looks like in Quincy/master:
if (result.recycle_log_file_num &&
(result.wal_recovery_mode
WALRecoveryMode::kTolerateCorruptedTailRecords ||
result.wal_recovery_mode WALRecoveryMode::kPointInTimeRecovery ||
result.wal_recovery_mode == WALRecoveryMode::kAbsoluteConsistency)) {
// - kTolerateCorruptedTailRecords is inconsistent with recycle log file
// feature. WAL recycling expects recovery success upon encountering a
// corrupt record at the point where new data ends and recycled data
// remains at the tail. However, `kTolerateCorruptedTailRecords` must fail
// upon encountering any such corrupt record, as it cannot differentiate
// between this and a real corruption, which would cause committed updates
// to be truncated -- a violation of the recovery guarantee.
// - kPointInTimeRecovery and kAbsoluteConsistency are incompatible with
// recycle log file feature temporarily due to a bug found introducing a
// hole in the recovered data
// (https://github.com/facebook/rocksdb/pull/7252#issuecomment-673766236).
// Besides this bug, we believe the features are fundamentally compatible.
result.recycle_log_file_num = 0;
}

So we implicitly getting recycle_log_file_num set to zero in fresh Ceph releases and this apparently confirms we can use the same approach in Octopus.

Actions #3

Updated by Igor Fedotov about 2 years ago

  • Status changed from Triaged to Fix Under Review
Actions #4

Updated by Igor Fedotov about 2 years ago

  • Pull request ID set to 45040
Actions #5

Updated by Igor Fedotov about 2 years ago

just to note: just seen another cluster suffering with the issue and recoverable through the proposed approach

Actions #7

Updated by Igor Fedotov almost 2 years ago

  • Status changed from Fix Under Review to Resolved
Actions #8

Updated by Igor Fedotov almost 2 years ago

  • Status changed from Resolved to Rejected
Actions

Also available in: Atom PDF