Bug #48002: Compaction error: Corruption: block checksum mismatch: - bluestore - Ceph

Actions

Copy link

Bug #48002

open

Compaction error: Corruption: block checksum mismatch:

Added by Jamin Collins over 3 years ago. Updated over 3 years ago.

Status:

New

Priority:

Normal

Assignee:

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v14.2.8

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

I appear to have ran into https://tracker.ceph.com/issues/37282 again.

Same AMD based host

$ grep model /proc/cpuinfo | tail -n 1
model name    : AMD Ryzen 7 3700X 8-Core Processor

$ uname -r
5.8.8-arch1-1

$ ceph --version
ceph version 14.2.8 (2d095e947a02261ce61424021bb43bd3022d35cb) nautilus (stable)

2020-10-26 08:55:52.458 7f707099b640  3 rocksdb: [db/db_impl_compaction_flush.cc:2659] Compaction error: Corruption: block checksum mismatch: expected 2320317871, got 95494035  in db/029531.sst offset 20331766 size 3873
2020-10-26 08:55:52.458 7f707099b640  4 rocksdb: (Original Log Time 2020/10/26-08:55:52.460238) [db/compaction_job.cc:751] [default] compacted to: files[6 3 27 41 0 0 0] max score 0.97, MB/sec: 326.9 rd, 82.8 wr, level 1, files in(6, 3) out(2) MB in(126.2, 134.2) out(66.0), read-write-amplify(2.6) write-amplify(0.5) Corruption: block checksum mismatch: expected 2320317871, got 95494035  in db/029531.sst offset 20331766 size 3873, records in: 1041295, records dropped: 185 output_compression: NoCompression

2020-10-26 08:55:52.458 7f707099b640  4 rocksdb: (Original Log Time 2020/10/26-08:55:52.460255) EVENT_LOG_v1 {"time_micros": 1603724152460248, "job": 3, "event": "compaction_finished", "compaction_time_micros": 835288, "compaction_time_cpu_micros": 378136, "output_level": 1, "num_output_files": 2, "total_output_size": 92272126, "num_input_records": 288841, "num_output_records": 288656, "num_subcompactions": 1, "output_compression": "NoCompression", "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [6, 3, 27, 41, 0, 0, 0]}
2020-10-26 08:55:52.458 7f707099b640  2 rocksdb: [db/db_impl_compaction_flush.cc:2209] Waiting after background compaction error: Corruption: block checksum mismatch: expected 2320317871, got 95494035  in db/029531.sst offset 20331766 size 3873, Accumulated background error counts: 1

I have left the OSD in this state (for now) in case any additional data needs to be gathered.

Don't know if it is related at all, but the OSD failure seems to coincide with the host's logging volume filling:

Oct 26 01:06:38 langhus-1 ceph-osd[310129]: 2020-10-26 01:06:38.676 7f2c921ec640 -1 rocksdb: submit_common error: Corruption: block checksum mismatch: expected 2320317871, got 95494035  in db/029531.sst offset 20331766 size 3873 code = 2 Rocksdb transaction:
Oct 26 01:06:38 langhus-1 ceph-osd[310129]: Put( Prefix = P key = 0x00000000001c68ca'._info' Value size = 964)
Oct 26 01:06:38 langhus-1 ceph-osd[310129]: /build/ceph/src/ceph-14.2.8/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_kv_sync_thread()' thread 7f2c921ec640 time 2020-10-26 01:06:38.680185
Oct 26 01:06:38 langhus-1 ceph-osd[310129]: /build/ceph/src/ceph-14.2.8/src/os/bluestore/BlueStore.cc: 11016: FAILED ceph_assert(r == 0)
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  ceph version 14.2.8 (2d095e947a02261ce61424021bb43bd3022d35cb) nautilus (stable)
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14f) [0x5603f293a777]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  2: (()+0x4ea961) [0x5603f293a961]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  3: (BlueStore::_kv_sync_thread()+0x1309) [0x5603f2f8a2f9]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  4: (BlueStore::KVSyncThread::entry()+0xd) [0x5603f2fb17cd]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  5: (()+0x93e9) [0x7f2c9f2783e9]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  6: (clone()+0x43) [0x7f2c9ee40293]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]: 2020-10-26 01:06:38.676 7f2c921ec640 -1 /build/ceph/src/ceph-14.2.8/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_kv_sync_thread()' thread 7f2c921ec640 time 2020-10-26 01:06:38.680185
Oct 26 01:06:38 langhus-1 ceph-osd[310129]: /build/ceph/src/ceph-14.2.8/src/os/bluestore/BlueStore.cc: 11016: FAILED ceph_assert(r == 0)
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  ceph version 14.2.8 (2d095e947a02261ce61424021bb43bd3022d35cb) nautilus (stable)
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14f) [0x5603f293a777]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  2: (()+0x4ea961) [0x5603f293a961]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  3: (BlueStore::_kv_sync_thread()+0x1309) [0x5603f2f8a2f9]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  4: (BlueStore::KVSyncThread::entry()+0xd) [0x5603f2fb17cd]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  5: (()+0x93e9) [0x7f2c9f2783e9]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  6: (clone()+0x43) [0x7f2c9ee40293]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]: *** Caught signal (Aborted) **
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  in thread 7f2c921ec640 thread_name:bstore_kv_sync
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  ceph version 14.2.8 (2d095e947a02261ce61424021bb43bd3022d35cb) nautilus (stable)
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  1: (()+0x140f0) [0x7f2c9f2830f0]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  2: (gsignal()+0x145) [0x7f2c9ed7d615]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  3: (abort()+0x116) [0x7f2c9ed66862]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1aa) [0x5603f293a7d2]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  5: (()+0x4ea961) [0x5603f293a961]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  6: (BlueStore::_kv_sync_thread()+0x1309) [0x5603f2f8a2f9]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  7: (BlueStore::KVSyncThread::entry()+0xd) [0x5603f2fb17cd]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  8: (()+0x93e9) [0x7f2c9f2783e9]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  9: (clone()+0x43) [0x7f2c9ee40293]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]: 2020-10-26 01:06:38.679 7f2c921ec640 -1 *** Caught signal (Aborted) **
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  in thread 7f2c921ec640 thread_name:bstore_kv_sync
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  ceph version 14.2.8 (2d095e947a02261ce61424021bb43bd3022d35cb) nautilus (stable)
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  1: (()+0x140f0) [0x7f2c9f2830f0]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  2: (gsignal()+0x145) [0x7f2c9ed7d615]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  3: (abort()+0x116) [0x7f2c9ed66862]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1aa) [0x5603f293a7d2]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  5: (()+0x4ea961) [0x5603f293a961]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  6: (BlueStore::_kv_sync_thread()+0x1309) [0x5603f2f8a2f9]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  7: (BlueStore::KVSyncThread::entry()+0xd) [0x5603f2fb17cd]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  8: (()+0x93e9) [0x7f2c9f2783e9]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  9: (clone()+0x43) [0x7f2c9ee40293]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Oct 26 01:06:38 langhus-1 ceph-osd[310129]: problem writing to /var/log/ceph/ceph-osd.0.log: (28) No space left on device
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:     -2> 2020-10-26 01:06:38.676 7f2c921ec640 -1 rocksdb: submit_common error: Corruption: block checksum mismatch: expected 2320317871, got 95494035  in db/029531.sst offset 20331766 size 3873 code = 2 Rocksdb transaction:
Oct 26 01:06:38 langhus-1 ceph-osd[310129]: Put( Prefix = P key = 0x00000000001c68ca'._info' Value size = 964)
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:     -1> 2020-10-26 01:06:38.676 7f2c921ec640 -1 /build/ceph/src/ceph-14.2.8/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_kv_sync_thread()' thread 7f2c921ec640 time 2020-10-26 01:06:38.680185
Oct 26 01:06:38 langhus-1 ceph-osd[310129]: /build/ceph/src/ceph-14.2.8/src/os/bluestore/BlueStore.cc: 11016: FAILED ceph_assert(r == 0)
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  ceph version 14.2.8 (2d095e947a02261ce61424021bb43bd3022d35cb) nautilus (stable)
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14f) [0x5603f293a777]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  2: (()+0x4ea961) [0x5603f293a961]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  3: (BlueStore::_kv_sync_thread()+0x1309) [0x5603f2f8a2f9]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  4: (BlueStore::KVSyncThread::entry()+0xd) [0x5603f2fb17cd]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  5: (()+0x93e9) [0x7f2c9f2783e9]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  6: (clone()+0x43) [0x7f2c9ee40293]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:      0> 2020-10-26 01:06:38.679 7f2c921ec640 -1 *** Caught signal (Aborted) **
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  in thread 7f2c921ec640 thread_name:bstore_kv_sync
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  ceph version 14.2.8 (2d095e947a02261ce61424021bb43bd3022d35cb) nautilus (stable)
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  1: (()+0x140f0) [0x7f2c9f2830f0]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  2: (gsignal()+0x145) [0x7f2c9ed7d615]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  3: (abort()+0x116) [0x7f2c9ed66862]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1aa) [0x5603f293a7d2]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  5: (()+0x4ea961) [0x5603f293a961]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  6: (BlueStore::_kv_sync_thread()+0x1309) [0x5603f2f8a2f9]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  7: (BlueStore::KVSyncThread::entry()+0xd) [0x5603f2fb17cd]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  8: (()+0x93e9) [0x7f2c9f2783e9]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  9: (clone()+0x43) [0x7f2c9ee40293]
Oct 26 01:06:38 langhus-1 ceph-osd[310129]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Oct 26 01:06:38 langhus-1 ceph-osd[310129]: problem writing to /var/log/ceph/ceph-osd.0.log: (28) No space left on device

I have system logs (/var/log) on their own partition. Sadly, these were one of the first things I removed attempting to get the OSD restarted.

Actions

Copy link

Updated by Igor Fedotov over 3 years ago

Could you please provide a report for ceph-bluestore-tool's bluefs-bdev-sizes command. Wondering if this OSD has any spillover.

Additionally for OSDs running on the same host (I presume you haven't restarted them for a while, have you?) please check perf
bluestore_reads_with_retries performance counter. Are there any non-zero value?

Also wondering if this failing OSD is exactly the same (in terms of underlying disks) as it was previous time. In other words are "failing" disk(s) the same?

Actions

Copy link

Updated by Jamin Collins over 3 years ago

Additionally for OSDs running on the same host (I presume you haven't restarted them for a while, have you?) please check perf bluestore_reads_with_retries performance counter. Are there any non-zero value?

I restarted all OSDs, MONs, and MGRs yesterday. However, that was done several hours before this happened as the cluster wasn't fully recovering from a storage node having been offline. The restarts did get the cluster to continue recovery, but I'm guessing that recovery also led to the log volume filling, which appears to coincide with the OSD failure.

The other two OSDs on this host (5 & 10) both report 0 retries:

$ sudo ceph daemon osd.5 perf dump | jq .bluestore.bluestore_reads_with_retries
0

$ sudo ceph daemon osd.10 perf dump | jq .bluestore.bluestore_reads_with_retries
0

Could you please provide a report for ceph-bluestore-tool's bluefs-bdev-sizes command. Wondering if this OSD has any spillover.

$ sudo ceph-bluestore-tool bluefs-bdev-sizes --path /var/lib/ceph/osd/ceph-0
inferring bluefs devices from bluestore path
 slot 2 /var/lib/ceph/osd/ceph-0/block -> /dev/dm-0
 slot 1 /var/lib/ceph/osd/ceph-0/block.db -> /dev/dm-7
1 : device size 0x2800000000 : own 0x[2000~27ffffe000] = 0x27ffffe000 : using 0x11bcfe000(4.4 GiB)
2 : device size 0x3a381400000 : own 0x[1bf1f510000~25429d0000] = 0x25429d0000 : using 0x0(0 B)

Also wondering if this failing OSD is exactly the same (in terms of underlying disks) as it was previous time. In other words are "failing" disk(s) the same?

Yes, I believe the drive is the exact same drive as before (serial numbers match). However, it should be noted that this drive has been working in the cluster without issue between the previous report and this report.

$ sudo hdparm -i /dev/sda

/dev/sda:

 Model=ST4000VX007-2DT166, FwRev=CV11, SerialNo=WDH1FSZY
 Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs RotSpdTol>.5% }
 RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=0
 BuffType=unknown, BuffSize=8192kB, MaxMultSect=16, MultSect=off
 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=7814037168
 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
 PIO modes:  pio0 pio1 pio2 pio3 pio4 
 DMA modes:  mdma0 mdma1 mdma2 
 UDMA modes: udma0 udma1 udma2 udma3 udma4 udma5 *udma6 
 AdvancedPM=no WriteCache=enabled
 Drive conforms to: unknown:  ATA/ATAPI-4,5,6,7

 * signifies the current active mode

Actions

Copy link

Updated by Igor Fedotov over 3 years ago

So there is no spillover to main(HDD) device. Hence the issue is rather not related to this device.

Anything suspicious about DB device? Have you run smartctl against it?

Actions

Copy link

Updated by Jamin Collins over 3 years ago

Nothing suspicious about it either. It's the DB device for all the OSDs on that host and is the same as in the previous report.

$ sudo hdparm -i /dev/sdc

/dev/sdc:

 Model=Samsung SSD 850 EVO mSATA 1TB, FwRev=32101030, SerialNo=S33FNX0J100209D
 Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs }
 RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=0
 BuffType=unknown, BuffSize=unknown, MaxMultSect=1, MultSect=1
 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=1953525168
 IORDY=on/off, tPIO={min:240,w/IORDY:120}, tDMA={min:120,rec:120}
 PIO modes:  pio0 pio1 pio2 pio3 pio4 
 DMA modes:  mdma0 mdma1 mdma2 
 UDMA modes: udma0 udma1 udma2 udma3 udma4 udma5 *udma6 
 AdvancedPM=yes: disabled (255) WriteCache=disabled
 Drive conforms to: Unspecified:  ATA/ATAPI-4,5,6,7

 * signifies the current active mode

SMART overall-health self-assessment test result: PASSED

What about the error coinciding precisely with the log volume filling? Any chance that's the issue?

Actions

Copy link

Updated by Jamin Collins over 3 years ago

Also, should I continue to keep the OSD in its failed state, is there any information that can be retrieved from it to help pin down the issue?

Actions

Copy link

Updated by Igor Fedotov over 3 years ago

Jamin Collins wrote:

What about the error coinciding precisely with the log volume filling? Any chance that's the issue?

Hard to say but certainly it might be the initial trigger.

Potentially the first OSD's crash could be triggered by log volume fullness. As far as I understand we can't verify this as we don't have corresponding logs...This in turn could corrupt RocksDB which means there are some bugs with DB data consistency during disk writing - as accidental process crash might break DB data consistency.
From another side having the same (or to some degree similar ones as additionally there were csum failures at main device that time) issues before means that full log volume might be unrelated.

Actions

Copy link

Updated by Igor Fedotov over 3 years ago

Jamin Collins wrote:

Also, should I continue to keep the OSD in its failed state, is there any information that can be retrieved from it to help pin down the issue?

At the moment I don't see anything else to one can retrieve from this daemon. But suggest to keep it for additional 2-3 days, just in case...

Actions

Copy link

Updated by Igor Fedotov over 3 years ago

Jamin Collins wrote:

Nothing suspicious about it either. It's the DB device for all the OSDs on that host and is the same as in the previous report.

Model=Samsung SSD 850 EVO mSATA 1TB, FwRev=32101030, SerialNo=S33FNX0J100209D

Ooh... Consumer grade SSD... I'm not telling that's the root cause for data corruption but generally it a very bad practice to use such drives in Ceph, particularly for DB. Personally I had a dramatic performance drop in my lab while attempting to use Samsung 960 Pro - due to lack of power loss protection it hardly handles direct synchronized writes which RocksDB/BlueFS actively use. ceph-users mailing list has a bunch of relevant information deterring from such drives' usage as well.

Wondering if you have similar drive(s) at other nodes?

Actions

Copy link

Updated by Jamin Collins over 3 years ago

At the moment I don't see anything else to one can retrieve from this daemon. But suggest to keep it for additional 2-3 days, just in case...

Will do.

Ooh... Consumer grade SSD... I'm not telling that's the root cause for data corruption but generally it a very bad practice to use such drives in Ceph,

You indicated the same in the previous report. All the nodes have consumer grade SSDs, and the others are from less respected manufacturers. Only this node has experienced this issue.

Actions

Copy link

#10