Project

General

Profile

Actions

Bug #48002

open

Compaction error: Corruption: block checksum mismatch:

Added by Jamin Collins over 3 years ago. Updated over 3 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I appear to have ran into https://tracker.ceph.com/issues/37282 again.

Same AMD based host

$ grep model /proc/cpuinfo | tail -n 1
model name    : AMD Ryzen 7 3700X 8-Core Processor

$ uname -r
5.8.8-arch1-1
$ ceph --version
ceph version 14.2.8 (2d095e947a02261ce61424021bb43bd3022d35cb) nautilus (stable)
2020-10-26 08:55:52.458 7f707099b640  3 rocksdb: [db/db_impl_compaction_flush.cc:2659] Compaction error: Corruption: block checksum mismatch: expected 2320317871, got 95494035  in db/029531.sst offset 20331766 size 3873
2020-10-26 08:55:52.458 7f707099b640  4 rocksdb: (Original Log Time 2020/10/26-08:55:52.460238) [db/compaction_job.cc:751] [default] compacted to: files[6 3 27 41 0 0 0] max score 0.97, MB/sec: 326.9 rd, 82.8 wr, level 1, files in(6, 3) out(2) MB in(126.2, 134.2) out(66.0), read-write-amplify(2.6) write-amplify(0.5) Corruption: block checksum mismatch: expected 2320317871, got 95494035  in db/029531.sst offset 20331766 size 3873, records in: 1041295, records dropped: 185 output_compression: NoCompression

2020-10-26 08:55:52.458 7f707099b640  4 rocksdb: (Original Log Time 2020/10/26-08:55:52.460255) EVENT_LOG_v1 {"time_micros": 1603724152460248, "job": 3, "event": "compaction_finished", "compaction_time_micros": 835288, "compaction_time_cpu_micros": 378136, "output_level": 1, "num_output_files": 2, "total_output_size": 92272126, "num_input_records": 288841, "num_output_records": 288656, "num_subcompactions": 1, "output_compression": "NoCompression", "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [6, 3, 27, 41, 0, 0, 0]}
2020-10-26 08:55:52.458 7f707099b640  2 rocksdb: [db/db_impl_compaction_flush.cc:2209] Waiting after background compaction error: Corruption: block checksum mismatch: expected 2320317871, got 95494035  in db/029531.sst offset 20331766 size 3873, Accumulated background error counts: 1

I have left the OSD in this state (for now) in case any additional data needs to be gathered.

Don't know if it is related at all, but the OSD failure seems to coincide with the host's logging volume filling:

Oct 26 01:06:38 langhus-1 ceph-osd[310129]: 2020-10-26 01:06:38.676 7f2c921ec640 -1 rocksdb: submit_common error: Corruption: block checksum mismatch: expected 2320317871, got 95494035  in db/029531.sst offset 20331766 size 3873 code = 2 Rocksdb transaction:
Oct 26 01:06:38 langhus-1 ceph-osd[310129]: Put( Prefix = P key = 0x00000000001c68ca'._info' Value size = 964)
Oct 26 01:06:38 langhus-1 ceph-osd[310129]: /build/ceph/src/ceph-14.2.8/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_kv_sync_thread()' thread 7f2c921ec640 time 2020-10-26 01:06:38.680185
Oct 26 01:06:38 langhus-1 ceph-osd[310129]: /build/ceph/src/ceph-14.2.8/src/os/bluestore/BlueStore.cc: 11016: FAILED ceph_assert(r == 0)
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  ceph version 14.2.8 (2d095e947a02261ce61424021bb43bd3022d35cb) nautilus (stable)
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14f) [0x5603f293a777]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  2: (()+0x4ea961) [0x5603f293a961]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  3: (BlueStore::_kv_sync_thread()+0x1309) [0x5603f2f8a2f9]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  4: (BlueStore::KVSyncThread::entry()+0xd) [0x5603f2fb17cd]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  5: (()+0x93e9) [0x7f2c9f2783e9]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  6: (clone()+0x43) [0x7f2c9ee40293]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]: 2020-10-26 01:06:38.676 7f2c921ec640 -1 /build/ceph/src/ceph-14.2.8/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_kv_sync_thread()' thread 7f2c921ec640 time 2020-10-26 01:06:38.680185
Oct 26 01:06:38 langhus-1 ceph-osd[310129]: /build/ceph/src/ceph-14.2.8/src/os/bluestore/BlueStore.cc: 11016: FAILED ceph_assert(r == 0)
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  ceph version 14.2.8 (2d095e947a02261ce61424021bb43bd3022d35cb) nautilus (stable)
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14f) [0x5603f293a777]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  2: (()+0x4ea961) [0x5603f293a961]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  3: (BlueStore::_kv_sync_thread()+0x1309) [0x5603f2f8a2f9]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  4: (BlueStore::KVSyncThread::entry()+0xd) [0x5603f2fb17cd]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  5: (()+0x93e9) [0x7f2c9f2783e9]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  6: (clone()+0x43) [0x7f2c9ee40293]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]: *** Caught signal (Aborted) **
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  in thread 7f2c921ec640 thread_name:bstore_kv_sync
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  ceph version 14.2.8 (2d095e947a02261ce61424021bb43bd3022d35cb) nautilus (stable)
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  1: (()+0x140f0) [0x7f2c9f2830f0]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  2: (gsignal()+0x145) [0x7f2c9ed7d615]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  3: (abort()+0x116) [0x7f2c9ed66862]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1aa) [0x5603f293a7d2]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  5: (()+0x4ea961) [0x5603f293a961]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  6: (BlueStore::_kv_sync_thread()+0x1309) [0x5603f2f8a2f9]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  7: (BlueStore::KVSyncThread::entry()+0xd) [0x5603f2fb17cd]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  8: (()+0x93e9) [0x7f2c9f2783e9]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  9: (clone()+0x43) [0x7f2c9ee40293]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]: 2020-10-26 01:06:38.679 7f2c921ec640 -1 *** Caught signal (Aborted) **
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  in thread 7f2c921ec640 thread_name:bstore_kv_sync
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  ceph version 14.2.8 (2d095e947a02261ce61424021bb43bd3022d35cb) nautilus (stable)
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  1: (()+0x140f0) [0x7f2c9f2830f0]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  2: (gsignal()+0x145) [0x7f2c9ed7d615]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  3: (abort()+0x116) [0x7f2c9ed66862]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1aa) [0x5603f293a7d2]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  5: (()+0x4ea961) [0x5603f293a961]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  6: (BlueStore::_kv_sync_thread()+0x1309) [0x5603f2f8a2f9]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  7: (BlueStore::KVSyncThread::entry()+0xd) [0x5603f2fb17cd]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  8: (()+0x93e9) [0x7f2c9f2783e9]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  9: (clone()+0x43) [0x7f2c9ee40293]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Oct 26 01:06:38 langhus-1 ceph-osd[310129]: problem writing to /var/log/ceph/ceph-osd.0.log: (28) No space left on device
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:     -2> 2020-10-26 01:06:38.676 7f2c921ec640 -1 rocksdb: submit_common error: Corruption: block checksum mismatch: expected 2320317871, got 95494035  in db/029531.sst offset 20331766 size 3873 code = 2 Rocksdb transaction:
Oct 26 01:06:38 langhus-1 ceph-osd[310129]: Put( Prefix = P key = 0x00000000001c68ca'._info' Value size = 964)
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:     -1> 2020-10-26 01:06:38.676 7f2c921ec640 -1 /build/ceph/src/ceph-14.2.8/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_kv_sync_thread()' thread 7f2c921ec640 time 2020-10-26 01:06:38.680185
Oct 26 01:06:38 langhus-1 ceph-osd[310129]: /build/ceph/src/ceph-14.2.8/src/os/bluestore/BlueStore.cc: 11016: FAILED ceph_assert(r == 0)
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  ceph version 14.2.8 (2d095e947a02261ce61424021bb43bd3022d35cb) nautilus (stable)
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14f) [0x5603f293a777]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  2: (()+0x4ea961) [0x5603f293a961]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  3: (BlueStore::_kv_sync_thread()+0x1309) [0x5603f2f8a2f9]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  4: (BlueStore::KVSyncThread::entry()+0xd) [0x5603f2fb17cd]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  5: (()+0x93e9) [0x7f2c9f2783e9]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  6: (clone()+0x43) [0x7f2c9ee40293]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:      0> 2020-10-26 01:06:38.679 7f2c921ec640 -1 *** Caught signal (Aborted) **
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  in thread 7f2c921ec640 thread_name:bstore_kv_sync
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  ceph version 14.2.8 (2d095e947a02261ce61424021bb43bd3022d35cb) nautilus (stable)
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  1: (()+0x140f0) [0x7f2c9f2830f0]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  2: (gsignal()+0x145) [0x7f2c9ed7d615]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  3: (abort()+0x116) [0x7f2c9ed66862]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1aa) [0x5603f293a7d2]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  5: (()+0x4ea961) [0x5603f293a961]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  6: (BlueStore::_kv_sync_thread()+0x1309) [0x5603f2f8a2f9]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  7: (BlueStore::KVSyncThread::entry()+0xd) [0x5603f2fb17cd]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  8: (()+0x93e9) [0x7f2c9f2783e9]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  9: (clone()+0x43) [0x7f2c9ee40293]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Oct 26 01:06:38 langhus-1 ceph-osd[310129]: problem writing to /var/log/ceph/ceph-osd.0.log: (28) No space left on device

I have system logs (/var/log) on their own partition. Sadly, these were one of the first things I removed attempting to get the OSD restarted.

Actions

Also available in: Atom PDF