Bug #54288: rocksdb: Corruption: missing start of fragmented record - bluestore - Ceph

Actions

Copy link

Bug #54288

closed

rocksdb: Corruption: missing start of fragmented record

Added by Igor Fedotov about 2 years ago. Updated almost 2 years ago.

Status:

Rejected

Priority:

Normal

Assignee:

Igor Fedotov

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v15.2.15

ceph-qa-suite:

Pull request ID:

45040

Crash signature (v1):

Crash signature (v2):

Description

After a kernel panic one of our OSD is unable to start and is showing:
rocksdb: [db/db_impl_open.cc:518] db/099134.log: dropping 1728 bytes; Corruption: missing start of fragmented record(2
)

Actions

Copy link

Updated by Igor Fedotov about 2 years ago

Status changed from New to Triaged
Assignee set to Igor Fedotov

Investigation revealed that RocksDB embedded to Ceph Octopus release implicitly substitutes default wal_recovery_mode value (=kPointInTimeRecovery=2) to a less reliable kTolerateCorruptedTailRecords(=2) if recycle_log_file_num is non-zero. And the latter is what we have got by default in bluestore_rocksdb_options parameter.
Relevant RocksDB code is in SanitizeOptions() function from db/db_impl_open.cc:

if (result.recycle_log_file_num && (result.wal_recovery_mode WALRecoveryMode::kPointInTimeRecovery || result.wal_recovery_mode WALRecoveryMode::kAbsoluteConsistency)) { // kPointInTimeRecovery is indistinguishable from // kTolerateCorruptedTailRecords in recycle mode since we define // the "end" of the log as the first corrupt record we encounter. // kAbsoluteConsistency doesn't make sense because even a clean // shutdown leaves old junk at the end of the log file. result.wal_recovery_mode = WALRecoveryMode::kTolerateCorruptedTailRecords; }

One can see the actual values for both wal_recovery_mode and recycle_log_file_num by inspecting OSD log:
022-02-14T11:58:58.846+0000 7f1fec229100 4 rocksdb: Options.wal_recovery_mode: 0 022-02-14T11:58:58.846+0000 7f1fec229100 4 rocksdb: Options.recycle_log_file_num: 4

And resetting the latter does the trick indeed. wal_recovery_mode becomes equal to 2 and that permits OSD to recover from the data corruption and proceed with the startup:

2022-02-14T22:18:39.763+0000 7f8173b22100 4 rocksdb: Options.recycle_log_file_num: 0 2022-02-14T22:18:39.763+0000 7f8173b22100 4 rocksdb: Options.wal_recovery_mode: 2 ... 2022-02-14T22:18:40.199+0000 7f8173b22100 3 rocksdb: [db/db_impl_open.cc:518] /persistent/db/099134.log: dropping 1728 bytes; Corruption: missing start of fragmented record(2) 2022-02-14T22:18:40.199+0000 7f8173b22100 4 rocksdb: [db/db_impl_open.cc:810] Point in time recovered to log #99134 seq #5023211213

Actions

Copy link

Updated by Igor Fedotov about 2 years ago

Interestingly RocksDB from Pacific onward deals with wal_recovery_mode and recycle_log_file_num inconsistency a bit differently. That's how it looks like in Quincy/master:
if (result.recycle_log_file_num && (result.wal_recovery_mode WALRecoveryMode::kTolerateCorruptedTailRecords || result.wal_recovery_mode WALRecoveryMode::kPointInTimeRecovery || result.wal_recovery_mode == WALRecoveryMode::kAbsoluteConsistency)) { // - kTolerateCorruptedTailRecords is inconsistent with recycle log file // feature. WAL recycling expects recovery success upon encountering a // corrupt record at the point where new data ends and recycled data // remains at the tail. However, `kTolerateCorruptedTailRecords` must fail // upon encountering any such corrupt record, as it cannot differentiate // between this and a real corruption, which would cause committed updates // to be truncated -- a violation of the recovery guarantee. // - kPointInTimeRecovery and kAbsoluteConsistency are incompatible with // recycle log file feature temporarily due to a bug found introducing a // hole in the recovered data // (https://github.com/facebook/rocksdb/pull/7252#issuecomment-673766236). // Besides this bug, we believe the features are fundamentally compatible. result.recycle_log_file_num = 0; }

So we implicitly getting recycle_log_file_num set to zero in fresh Ceph releases and this apparently confirms we can use the same approach in Octopus.

Actions

Copy link