Project

General

Profile

Bug #49285

OSD Crash: Compaction error: Corruption: block checksum mismatch

Added by Chris K about 3 years ago. Updated about 3 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):

3f2f4fea687916dff0a05a59abd2ae83e75f4c927654b7e877235e79df538049

Crash signature (v2):

Description

We're encountering periodic OSD crashes during test runs on some new hardware.

For this report I'm using osd 101 as a case study. Other OSDs have done this as well. Since December 18, there have been 193 crash events. 65 of which are this stack signature. Another 56 are identical in their backtrace, though they took place on kernel 4.15.0-128 and ceph version 15.2.8. Ignoring the version differences it's safe to say that 121 of the 193 OSD crash events are the same.

  • this is new hardware still in POC at the vendor's PG lab
  • storage is currently backed by a Dell PERC H745 raid controller in passthrough mode
  • OS: Ubuntu 18.04.5, kernel 5.4.0-58-generic
  • ceph version 15.2.5
  • journalctl/syslog indicate no underlying phsical or scsi-level events leading up to or since
  • I have a bluestore fsck dump too, but even compressed it's 3.2GB. I can extract or filter whatever you need out of that.

Relevant excerpt from ceph-osd.101.log:

2021-02-08T16:05:48.536+0000 7fe119dd1700  4 rocksdb: [db/db_impl_write.cc:1436] reusing log 583 from recycle list

2021-02-08T16:05:48.536+0000 7fe119dd1700  4 rocksdb: [db/db_impl_write.cc:1470] [default] New memtable created with log file: #587. Immutable memtables: 0.

2021-02-08T16:05:48.536+0000 7fe11ffe3700  4 rocksdb: (Original Log Time 2021/02/08-16:05:48.542478) [db/db_impl_compaction_flush.cc:2073] Calling FlushMemTableToOutputFile with column family [default], flush slots available 1, compaction slots available 1, flush slots scheduled 1, compaction slots scheduled 0
2021-02-08T16:05:48.536+0000 7fe11ffe3700  4 rocksdb: [db/flush_job.cc:322] [default] [JOB 53] Flushing memtable with next log file: 587

2021-02-08T16:05:48.536+0000 7fe11ffe3700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1612800348542553, "job": 53, "event": "flush_started", "num_memtables": 1, "num_entries": 240551, "num_deletes": 52903, "total_data_size": 254303985, "memory_usage": 260062416, "flush_reason": "Write Buffer Full"}
2021-02-08T16:05:48.536+0000 7fe11ffe3700  4 rocksdb: [db/flush_job.cc:351] [default] [JOB 53] Level-0 flush table #588: started
2021-02-08T16:05:48.924+0000 7fe11ffe3700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1612800348931502, "cf_name": "default", "job": 53, "event": "table_file_creation", "file_number": 588, "file_size": 33771379, "table_properties": {"data_size": 32964927, "index_size": 599182, "filter_size": 206405, "raw_key_size": 4888852, "raw_average_key_size": 59, "raw_value_size": 31001902, "raw_average_value_size": 375, "num_data_blocks": 8045, "num_entries": 82510, "filter_policy_name": "rocksdb.BuiltinBloomFilter"}}
2021-02-08T16:05:48.924+0000 7fe11ffe3700  4 rocksdb: [db/flush_job.cc:392] [default] [JOB 53] Level-0 flush table #588: 33771379 bytes OK
2021-02-08T16:05:48.928+0000 7fe11ffe3700  4 rocksdb: [db/db_impl_files.cc:194] adding log 585 to recycle list

2021-02-08T16:05:48.928+0000 7fe11ffe3700  4 rocksdb: (Original Log Time 2021/02/08-16:05:48.931578) [db/memtable_list.cc:380] [default] Level-0 commit table #588 started
2021-02-08T16:05:48.928+0000 7fe11ffe3700  4 rocksdb: (Original Log Time 2021/02/08-16:05:48.932159) [db/memtable_list.cc:428] [default] Level-0 commit table #588: memtable #1 done
2021-02-08T16:05:48.928+0000 7fe11ffe3700  4 rocksdb: (Original Log Time 2021/02/08-16:05:48.932179) EVENT_LOG_v1 {"time_micros": 1612800348932173, "job": 53, "event": "flush_finished", "output_compression": "NoCompression", "lsm_state": [4, 4, 34, 0, 0, 0, 0], "immutable_memtables": 0}
2021-02-08T16:05:48.928+0000 7fe11ffe3700  4 rocksdb: (Original Log Time 2021/02/08-16:05:48.932236) [db/db_impl_compaction_flush.cc:201] [default] Level summary: files[4 4 34 0 0 0 0] max score 1.00

2021-02-08T16:05:48.928+0000 7fe120fe5700  4 rocksdb: [db/compaction_job.cc:1645] [default] [JOB 54] Compacting 4@0 + 4@1 files to L1, score 1.00
2021-02-08T16:05:48.928+0000 7fe120fe5700  4 rocksdb: [db/compaction_job.cc:1649] [default] Compaction start summary: Base version 53 Base level 0, inputs: [588(32MB) 586(32MB) 584(33MB) 582(33MB)], [565(65MB) 566(65MB) 568(35MB) 569(53MB)]

2021-02-08T16:05:48.928+0000 7fe120fe5700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1612800348932582, "job": 54, "event": "compaction_started", "compaction_reason": "LevelL0FilesNum", "files_L0": [588, 586, 584, 582], "files_L1": [565, 566, 568, 569], "score": 1, "input_data_size": 368973830}

2021-02-08T16:05:50.036+0000 7fe120fe5700  3 rocksdb: [db/db_impl_compaction_flush.cc:2660] *Compaction error: Corruption: block checksum mismatch*: expected 3989109363, got 2324967102  in db/000586.sst offset 33582463 size 601701
2021-02-08T16:05:50.036+0000 7fe120fe5700  4 rocksdb: (Original Log Time 2021/02/08-16:05:50.040429) [db/compaction_job.cc:768] [default] compacted to: files[4 4 34 0 0 0 0] max score 0.81, MB/sec: 333.1 rd, 0.0 wr, level 1, files in(4, 4) out(0) MB in(131.6, 220.3) out(0.0), read-write-amplify(2.7) write-amplify(0.0) Corruption: block checksum mismatch: expected 3989109363, got 2324967102  in db/000586.sst offset 33582463 size 601701, records in: 930768, records dropped: 0 output_compression: NoCompression

2021-02-08T16:05:50.036+0000 7fe120fe5700  4 rocksdb: (Original Log Time 2021/02/08-16:05:50.040483) EVENT_LOG_v1 {"time_micros": 1612800350040458, "job": 54, "event": "compaction_finished", "compaction_time_micros": 1107766, "compaction_time_cpu_micros": 12232, "output_level": 1, "num_output_files": 0, "total_output_size": 0, "num_input_records": 0, "num_output_records": 0, "num_subcompactions": 1, "output_compression": "NoCompression", "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [4, 4, 34, 0, 0, 0, 0]}
2021-02-08T16:05:50.036+0000 7fe120fe5700  2 rocksdb: [db/db_impl_compaction_flush.cc:2212] Waiting after background compaction error: Corruption: block checksum mismatch: expected 3989109363, got 2324967102  in db/000586.sst offset 33582463 size 601701, Accumulated background error counts: 1
2021-02-08T16:05:50.076+0000 7fe119dd1700 -1 rocksdb: submit_common error: Corruption: block checksum mismatch: expected 3989109363, got 2324967102  in db/000586.sst offset 33582463 size 601701 code = 2 Rocksdb transaction:
Put( Prefix = P key = 0x0000000000000a29'.0000017132.00000000000000011671' Value size = 207)
Put( Prefix = P key = 0x0000000000000a29'._fastinfo' Value size = 186)
Put( Prefix = P key = 0x0000000000000a29'.can_rollback_to' Value size = 12)
Put( Prefix = P key = 0x0000000000000a29'.rollback_info_trimmed_to' Value size = 12)
Put( Prefix = O key = 0x88800000000000002837196f06213130'000000165.000000cb!='0xfffffffffffffffeffffffffffffffff6f00000000'x' Value size = 529)
Put( Prefix = O key = 0x88800000000000002837196f06213130'000000165.000000cb!='0xfffffffffffffffeffffffffffffffff6f00080000'x' Value size = 531)
Put( Prefix = O key = 0x88800000000000002837196f06213130'000000165.000000cb!='0xfffffffffffffffeffffffffffffffff6f00100000'x' Value size = 531)
Put( Prefix = O key = 0x88800000000000002837196f06213130'000000165.000000cb!='0xfffffffffffffffeffffffffffffffff6f00180000'x' Value size = 531)
Put( Prefix = O key = 0x88800000000000002837196f06213130'000000165.000000cb!='0xfffffffffffffffeffffffffffffffff6f00200000'x' Value size = 531)
Put( Prefix = O key = 0x88800000000000002837196f06213130'000000165.000000cb!='0xfffffffffffffffeffffffffffffffff6f00280000'x' Value size = 531)
Put( Prefix = O key = 0x88800000000000002837196f06213130'000000165.000000cb!='0xfffffffffffffffeffffffffffffffff6f00300000'x' Value size = 531)
Put( Prefix = O key = 0x88800000000000002837196f06213130'000000165.000000cb!='0xfffffffffffffffeffffffffffffffff6f00380000'x' Value size = 82)
Put( Prefix = O key = 0x88800000000000002837196f06213130'000000165.000000cb!='0xfffffffffffffffeffffffffffffffff'o' Value size = 485)
Merge( Prefix = b key = 0x000001b0c9800000 Value size = 16)
Merge( Prefix = T key = 0x0000000000000028 Value size = 40)
2021-02-08T16:05:50.088+0000 7fe119dd1700 -1 /build/ceph-15.2.5/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)' thread 7fe119dd1700 time 2021-02-08T16:05:50.083093+0000
/build/ceph-15.2.5/src/os/bluestore/BlueStore.cc: 11469: FAILED ceph_assert(r == 0)

 ceph version 15.2.5 (2c93eff00150f0cc5f106a559557a58d3d7b6f1f) octopus (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x154) [0x55f84e601f92]
 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x55f84e60216d]
 3: (BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)+0x3a0) [0x55f84eb3e7d0]
 4: (BlueStore::_kv_sync_thread()+0xbdd) [0x55f84eb90cfd]
 5: (BlueStore::KVSyncThread::entry()+0xd) [0x55f84ebb5cbd]
 6: (()+0x76db) [0x7fe12d5226db]
 7: (clone()+0x3f) [0x7fe12c2c271f]

2021-02-08T16:05:50.096+0000 7fe119dd1700 -1 *** Caught signal (Aborted) **
 in thread 7fe119dd1700 thread_name:bstore_kv_sync

 ceph version 15.2.5 (2c93eff00150f0cc5f106a559557a58d3d7b6f1f) octopus (stable)
 1: (()+0x12980) [0x7fe12d52d980]
 2: (gsignal()+0xc7) [0x7fe12c1dffb7]
 3: (abort()+0x141) [0x7fe12c1e1921]
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a5) [0x55f84e601fe3]
 5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x55f84e60216d]
 6: (BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)+0x3a0) [0x55f84eb3e7d0]
 7: (BlueStore::_kv_sync_thread()+0xbdd) [0x55f84eb90cfd]
 8: (BlueStore::KVSyncThread::entry()+0xd) [0x55f84ebb5cbd]
 9: (()+0x76db) [0x7fe12d5226db]
 10: (clone()+0x3f) [0x7fe12c2c271f]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

osd101.perf.dump.json - ceph tell osd.101 perf dump (29.6 KB) Chris K, 02/12/2021 07:54 PM

osd-101-crash-reports.tar.gz - 5 most recent crash json dumps, on 5 different OSDs (5.94 KB) Chris K, 02/12/2021 07:54 PM

ceph-osd.101.log.2.gz - OSD log that contains the example crash event (line 7575) (226 KB) Chris K, 02/12/2021 07:54 PM

ceph-osd.99.log_witherror.gz - osd 99 multiple corruption events on same dev (422 KB) Chris K, 03/04/2021 09:40 PM

History

#1 Updated by Igor Fedotov about 3 years ago

Are these failures volatile? I mean is affected OSD able to startup successfully after a while?

#2 Updated by Igor Fedotov about 3 years ago

Igor Fedotov wrote:

Are these failures volatile? I mean is affected OSD able to startup successfully after a while?

well, I presume OSD is able to restart...
More correct question would be - if the same OSD fails again does it show the same sst file name and offset and expected/got checksums in the following lines
2021-02-08T16:05:50.036+0000 7fe120fe5700 2 rocksdb: [db/db_impl_compaction_flush.cc:2212] Waiting after background compaction error: Corruption: block checksum mismatch: expected 3989109363, got 2324967102 in db/000586.sst offset 33582463 size 601701, Accumulated background error counts: 1
2021-02-08T16:05:50.076+0000 7fe119dd1700 -1 rocksdb: submit_common error: Corruption: block checksum mismatch: expected 3989109363, got 2324967102 in db/000586.sst offset 33582463 size 601701 code = 2 Rocksdb transaction:

#3 Updated by Chris K about 3 years ago

Hello Igor,

These OSDs can be restarted and it seems they resync and step back in line without trouble.

I'll reduce our OSD count from 100 to 20 and try to get the same OSD to fail twice.

#4 Updated by Igor Fedotov about 3 years ago

Wondering if you have swap enabled for your OSD nodes? And may be excessive RSS memory (much above configured osd-memory-target) usage for ceph-osd processes?

#5 Updated by Chris K about 3 years ago

Yes, both nodes are using swap. Only an 8g file on the / filesystem though. It's backed by a raid 10 array provided by a PERC.

#6 Updated by Igor Fedotov about 3 years ago

Chris K wrote:

Yes, both nodes are using swap. Only an 8g file on the / filesystem though. It's backed by a raid 10 array provided by a PERC.

Can you see swap is in actual use?
Are you observing high RSS memory usage from ceph-osd processes?

Would it be possible to turn swap off and check if that helps?

#7 Updated by Chris K about 3 years ago

Hello Igor, sorry for the long wait.

We were able to get a duplicate crash on the same OSD. The OSD log is attached. The corruption reporting refers to different files in db/*.sst.

Since this crash, we've made some substantial changes:
  • replaced the PERC with an HBA (a Dell HBA355i)
  • updated the os to Ubuntu 20.04.5 (due to compat with the HBA)
  • moved back to ceph latest @ 15.2.8

We'll begin testing shortly and I'll report back with what happens.

#8 Updated by Chris K about 3 years ago

After moving off of the PERC and onto an HBA + upgrading the OS & Ceph, we haven't seen this problem appear again. With so many changes at once, it's hard to say what the cause may have been. My bet is the PERC configured with 50x RAID 0 virtual disks; that's definitely a design edge case.

Thank you for your support Igor, we appreciate it!

#9 Updated by Neha Ojha about 3 years ago

  • Status changed from New to Closed

Also available in: Atom PDF