Project

General

Profile

Bug #37282

rocksdb: submit_transaction_sync error: Corruption: block checksum mismatch code = 2

Added by Jeff Smith almost 2 years ago. Updated about 1 month ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

I have an OSD that will not start. It keep crashing. Not sure where to go from here. Unfortunately, it happened right after 2 other drives died. This means I have PGs down and cannot access the files in cephfs.

# /usr/bin/ceph-osd -f --cluster ceph --id 8 --setuser ceph --setgroup ceph
starting osd.8 at - osd_data /var/lib/ceph/osd/ceph-8 /var/lib/ceph/osd/ceph-8/journal
/build/ceph-13.2.2/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_kv_sync_thread()' thread 7f4c4aea3700 time 2018-11-15 17:28:00.093400
/build/ceph-13.2.2/src/os/bluestore/BlueStore.cc: 9073: FAILED assert(r == 0)
2018-11-15 17:28:00.091 7f4c4aea3700 -1 rocksdb: submit_common error: Corruption: block checksum mismatch: expected 2133069443, got 3635521166  in db/002194.sst offset 30843842 size 4614o code = 2 Rocksdb transaction:
Put( Prefix = P key = 0x00000000005543dd'.can_rollback_to' Value size = 12)
Put( Prefix = P key = 0x00000000005543dd'.rollback_info_trimmed_to' Value size = 12)
Put( Prefix = O key = 0x858000000000000015f000000021213dfffffffffffffffeffffffffffffffff'o' Value size = 31)
Put( Prefix = S key = 'nid_max' Value size = 8)
Put( Prefix = S key = 'blobid_max' Value size = 8)
 ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x7f4c615c65c2]
 2: (()+0x26c787) [0x7f4c615c6787]
 3: (BlueStore::_kv_sync_thread()+0x13e6) [0x55c37dfe1ce6]
 4: (BlueStore::KVSyncThread::entry()+0xd) [0x55c37e02664d]
 5: (()+0x76db) [0x7f4c5fcc06db]
 6: (clone()+0x3f) [0x7f4c5ec8988f]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2018-11-15 17:28:00.091 7f4c4aea3700 -1 /build/ceph-13.2.2/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_kv_sync_thread()' thread 7f4c4aea3700 time 2018-11-15 17:28:00.093400
/build/ceph-13.2.2/src/os/bluestore/BlueStore.cc: 9073: FAILED assert(r == 0)

 ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x7f4c615c65c2]
 2: (()+0x26c787) [0x7f4c615c6787]
 3: (BlueStore::_kv_sync_thread()+0x13e6) [0x55c37dfe1ce6]
 4: (BlueStore::KVSyncThread::entry()+0xd) [0x55c37e02664d]
 5: (()+0x76db) [0x7f4c5fcc06db]
 6: (clone()+0x3f) [0x7f4c5ec8988f]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

    -1> 2018-11-15 17:28:00.091 7f4c4aea3700 -1 rocksdb: submit_common error: Corruption: block checksum mismatch: expected 2133069443, got 3635521166  in db/002194.sst offset 30843842 size 4614o code = 2 Rocksdb transaction:
Put( Prefix = P key = 0x00000000005543dd'.can_rollback_to' Value size = 12)
Put( Prefix = P key = 0x00000000005543dd'.rollback_info_trimmed_to' Value size = 12)
Put( Prefix = O key = 0x858000000000000015f000000021213dfffffffffffffffeffffffffffffffff'o' Value size = 31)
Put( Prefix = S key = 'nid_max' Value size = 8)
Put( Prefix = S key = 'blobid_max' Value size = 8)
     0> 2018-11-15 17:28:00.091 7f4c4aea3700 -1 /build/ceph-13.2.2/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_kv_sync_thread()' thread 7f4c4aea3700 time 2018-11-15 17:28:00.093400
/build/ceph-13.2.2/src/os/bluestore/BlueStore.cc: 9073: FAILED assert(r == 0)

 ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x7f4c615c65c2]
 2: (()+0x26c787) [0x7f4c615c6787]
 3: (BlueStore::_kv_sync_thread()+0x13e6) [0x55c37dfe1ce6]
 4: (BlueStore::KVSyncThread::entry()+0xd) [0x55c37e02664d]
 5: (()+0x76db) [0x7f4c5fcc06db]
 6: (clone()+0x3f) [0x7f4c5ec8988f]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

*** Caught signal (Aborted) **
 in thread 7f4c4aea3700 thread_name:bstore_kv_sync
 ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable)
 1: (()+0x91a780) [0x55c37e0f8780]
 2: (()+0x12890) [0x7f4c5fccb890]
 3: (gsignal()+0xc7) [0x7f4c5eba6e97]
 4: (abort()+0x141) [0x7f4c5eba8801]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x250) [0x7f4c615c6710]
 6: (()+0x26c787) [0x7f4c615c6787]
 7: (BlueStore::_kv_sync_thread()+0x13e6) [0x55c37dfe1ce6]
 8: (BlueStore::KVSyncThread::entry()+0xd) [0x55c37e02664d]
 9: (()+0x76db) [0x7f4c5fcc06db]
 10: (clone()+0x3f) [0x7f4c5ec8988f]
2018-11-15 17:28:00.095 7f4c4aea3700 -1 *** Caught signal (Aborted) **
 in thread 7f4c4aea3700 thread_name:bstore_kv_sync

 ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable)
 1: (()+0x91a780) [0x55c37e0f8780]
 2: (()+0x12890) [0x7f4c5fccb890]
 3: (gsignal()+0xc7) [0x7f4c5eba6e97]
 4: (abort()+0x141) [0x7f4c5eba8801]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x250) [0x7f4c615c6710]
 6: (()+0x26c787) [0x7f4c615c6787]
 7: (BlueStore::_kv_sync_thread()+0x13e6) [0x55c37dfe1ce6]
 8: (BlueStore::KVSyncThread::entry()+0xd) [0x55c37e02664d]
 9: (()+0x76db) [0x7f4c5fcc06db]
 10: (clone()+0x3f) [0x7f4c5ec8988f]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

     0> 2018-11-15 17:28:00.095 7f4c4aea3700 -1 *** Caught signal (Aborted) **
 in thread 7f4c4aea3700 thread_name:bstore_kv_sync

 ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable)
 1: (()+0x91a780) [0x55c37e0f8780]
 2: (()+0x12890) [0x7f4c5fccb890]
 3: (gsignal()+0xc7) [0x7f4c5eba6e97]
 4: (abort()+0x141) [0x7f4c5eba8801]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x250) [0x7f4c615c6710]
 6: (()+0x26c787) [0x7f4c615c6787]
 7: (BlueStore::_kv_sync_thread()+0x13e6) [0x55c37dfe1ce6]
 8: (BlueStore::KVSyncThread::entry()+0xd) [0x55c37e02664d]
 9: (()+0x76db) [0x7f4c5fcc06db]
 10: (clone()+0x3f) [0x7f4c5ec8988f]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Aborted (core dumped)

ceph-osd.25.log View (101 KB) David Sieger, 02/01/2019 01:50 PM

ceph-osd.0.log.gz (673 KB) Jamin Collins, 01/30/2020 08:20 PM

ceph-osd.5.log.gz (762 KB) Jamin Collins, 01/30/2020 08:20 PM


Related issues

Related to bluestore - Bug #40080: Bitmap allocator return duplicate entries which cause interval_set assert Resolved 05/30/2019
Related to bluestore - Bug #41367: rocksdb: submit_transaction error: Corruption: block checksum mismatch code = 2 Duplicate 08/21/2019

History

#1 Updated by Igor Fedotov almost 2 years ago

Firstly I suggest to verify the disk drive behind DB volume for physical errors.

#2 Updated by Jeff Smith almost 2 years ago

I have checked the kernel log and smartctl and do not see any errors.

#3 Updated by Igor Fedotov almost 2 years ago

Somewhat similar issue, may be useful as recovery guidance:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-November/031595.html

#4 Updated by Josh Durgin almost 2 years ago

  • Status changed from New to Need More Info

#5 Updated by David Sieger over 1 year ago

I might have been bitten by the same issue. The OSD in question is has its main data on a spinning drive and its database on a partition of an SSD. A hardware issue has not completely been ruled out, it just looks unlikely as far as I was able to investigate.

I ran ceph-bluestore-tool fsck on the OSD, that resulted in this output:

$ sudo ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-25
2019-02-01 14:00:16.482736 7f39b18cbec0 -1 bluestore(/var/lib/ceph/osd/ceph-25) fsck error: stray shard 0x300000
2019-02-01 14:00:16.482753 7f39b18cbec0 -1 bluestore(/var/lib/ceph/osd/ceph-25) fsck error: 0x7f8000000000000002df7e2d92217262'.0.5a6ca9.238e1f29.0000000082fc!='0xfffffffffffffffeffffffffffffffff6f00300000'x' is unexpected
2019-02-01 14:00:16.482773 7f39b18cbec0 -1 bluestore(/var/lib/ceph/osd/ceph-25) fsck error: stray shard 0x380000
2019-02-01 14:00:16.482774 7f39b18cbec0 -1 bluestore(/var/lib/ceph/osd/ceph-25) fsck error: 0x7f8000000000000002df7e2d92217262'.0.5a6ca9.238e1f29.0000000082fc!='0xfffffffffffffffeffffffffffffffff6f00380000'x' is unexpected
2019-02-01 14:00:44.644396 7f39b18cbec0 -1 bluestore(/var/lib/ceph/osd/ceph-25) fsck error: actual store_statfs(0x49108d0000/0xe8e0c00000, stored 0x9dfe74e1aa/0x9f90320000, compress 0x0/0x0/0x0) != expected store_statfs(0x49108d0000/0xe8e0c00000, stored 0x9dfe34e1aa/0x9f8ff20000, compress 0x0/0x0/0x0)
2019-02-01 14:00:46.974661 7f39b18cbec0 -1 bluestore(/var/lib/ceph/osd/ceph-25) fsck error: leaked extent 0xb29b0a0000~400000
fsck success

It did not make any difference, though. Also, I cannot tell if the errors noted by fsck are related to this issue or not.

The crash itself loogs like this:

     -1> 2019-02-01 12:22:46.111821 7fe079d53700 -1 rocksdb:
submit_transaction error: Corruption: block checksum mismatch code = 2
Rocksdb transaction:
Put( Prefix = O key =
0x7f80000000000000021600000021213dfffffffffffffffeffffffffffffffff'o'
Value size = 30)
      0> 2019-02-01 12:22:46.117761 7fe079d53700 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.9/rpm/el7/BUILD/ceph-12.2.9/src/os/bluestore/BlueStore.cc:
In function 'void BlueStore::_kv_sync_thread()' thread 7fe079d53700 time
2019-02-01 12:22:46.111884
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.9/rpm/el7/BUILD/ceph-12.2.9/src/os/bluestore/BlueStore.cc:
8717: FAILED assert(r == 0)

  ceph version 12.2.9 (9e300932ef8a8916fb3fda78c58691a6ab0f4217) luminous(stable)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x562af51e5e90]
  2: (BlueStore::_kv_sync_thread()+0x3482) [0x562af5090162]
  3: (BlueStore::KVSyncThread::entry()+0xd) [0x562af50d701d]
  4: (()+0x7e25) [0x7fe089e12e25]
  5: (clone()+0x6d) [0x7fe088f03bad]
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 

I attached a log file of the full startup-and-crash cycle.

#6 Updated by Sage Weil over 1 year ago

  • Priority changed from Normal to High

#7 Updated by Sage Weil over 1 year ago

We're not sure how to proceed without being able to reprdocue the crash, and we have never seen this.

1. Would it be psosible to provide a copy of the rocksdb portion of your osd? I'm hoping that will export the rocksdb issue that woul dthen let us hit the same error locally with something like ceph-kvstore-tool. You'd do this with

ceph-bluestore-tool bluefs-export ...

2. Or, could you provide a full image of the osd? This is bigger and obviously isn't possible if the data is sensitive, but if 1 doesn't work hopefully 2 would let us see the problem.

Thanks!

#8 Updated by Radoslaw Zarzynski over 1 year ago

Keeping "needs more info" state.

#9 Updated by Dan van der Ster over 1 year ago

We just saw this on an osd (block.db on ssd, data on hdd). OSD is from a cephfs cluster running 12.2.11.

We're actively converting this cluster from filestore to bluestore; this osd had just been created as bluestore around 08:50 on 2019-04-08 and was still backfilling in its PGs.

The osd started crashing like this:

2019-04-08 14:57:16.223895 7f1df1264700  2 rocksdb: [/builddir/build/BUILD/ceph-12.2.11/src/rocksdb/db/db_impl_compaction_flush.cc:1275] Wait
ing after background compaction error: Corruption: block checksum mismatch, Accumulated background error counts: 1
2019-04-08 14:57:16.304853 7f1df2a67700 -1 rocksdb: submit_transaction error: Corruption: block checksum mismatch code = 2 Rocksdb transactio
n: 
Put( Prefix = O key = 0x7f8000000000000001d840000021213dfffffffffffffffeffffffffffffffff'o' Value size = 29)
2019-04-08 14:57:16.307051 7f1df2a67700 -1 /builddir/build/BUILD/ceph-12.2.11/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_kv
_sync_thread()' thread 7f1df2a67700 time 2019-04-08 14:57:16.304885
/builddir/build/BUILD/ceph-12.2.11/src/os/bluestore/BlueStore.cc: 8795: FAILED assert(r == 0)

 ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x55aa5f5d8b20]
 2: (BlueStore::_kv_sync_thread()+0x3482) [0x55aa5f4811c2]
 3: (BlueStore::KVSyncThread::entry()+0xd) [0x55aa5f4c86dd]
 4: (()+0x7dd5) [0x7f1e02b2cdd5]
 5: (clone()+0x6d) [0x7f1e01c1cead]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

We zapped the hdd and ssd partition, recreated with the same osd-id, and then during backfilling got this around 12 hours later:

    -2> 2019-04-09 03:21:33.861491 7fa811497700  0 osd.121 pg_epoch: 60035 pg[1.308( v 60035'87922443 lc 59819'87922140 (59734'87920547,60035
'87922443] local-lis/les=60034/60035 n=147011 ec=369/369 lis/c 60034/59631 les/c/f 60035/59632/0 60034/60034/60034) [121,142,61] r=0 lpr=6003
4 pi=[59631,60034)/1 crt=60033'87922442 mlcod 59819'87922140 active+recovering+degraded m=208 mbc={255={(2+0)=36}}] _update_calc_stats ml 36 
upset size 3 up 2
    -1> 2019-04-09 03:21:33.887078 7fa80cc8e700 -1 abort: Corruption: Bad table magic number
     0> 2019-04-09 03:21:33.891522 7fa80cc8e700 -1 *** Caught signal (Aborted) **
 in thread 7fa80cc8e700 thread_name:tp_osd_tp

 ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous (stable)
 1: (()+0xa63b61) [0x55e9ea9a3b61]
 2: (()+0xf5d0) [0x7fa828d735d0]
 3: (gsignal()+0x37) [0x7fa827d94207]
 4: (abort()+0x148) [0x7fa827d958f8]
 5: (RocksDBStore::get(std::string const&, char const*, unsigned long, ceph::buffer::list*)+0x1ce) [0x55e9ea8f3b6e]
 6: (BlueStore::Collection::get_onode(ghobject_t const&, bool)+0x548) [0x55e9ea89e3d8]
 7: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0xd9e) [0x55e9ea8b085e]
 8: (BlueStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&
, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x3a0) [0x55e9ea8b1f90]
 9: (ObjectStore::queue_transaction(ObjectStore::Sequencer*, ObjectStore::Transaction&&, Context*, Context*, Context*, boost::intrusive_ptr<T
rackedOp>, ThreadPool::TPHandle*)+0x171) [0x55e9ea495a31]
 10: (PrimaryLogPG::remove_missing_object(hobject_t const&, eversion_t, Context*)+0x70b) [0x55e9ea5a2a5b]
 11: (PrimaryLogPG::recover_missing(hobject_t const&, eversion_t, int, PGBackend::RecoveryHandle*)+0x9e1) [0x55e9ea5c2461]
 12: (PrimaryLogPG::recover_primary(unsigned long, ThreadPool::TPHandle&)+0xfe4) [0x55e9ea5ff834]
 13: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0x490) [0x55e9ea607ab0]

Now we tried fsck, got this output:

2019-04-09 11:24:29.051127 7fdb4be78ec0 -1 bluestore(/var/lib/ceph/osd/ceph-121) fsck error: stray shard 0x80000
2019-04-09 11:24:29.051138 7fdb4be78ec0 -1 bluestore(/var/lib/ceph/osd/ceph-121) fsck error: 0x7f8000000000000001254031'A!40006456a32.00000000!='0xfffffffffffffffeffffffffffffffff6f00080000'x' is unexpected
2019-04-09 11:24:29.051149 7fdb4be78ec0 -1 bluestore(/var/lib/ceph/osd/ceph-121) fsck error: stray shard 0x100000
2019-04-09 11:24:29.051151 7fdb4be78ec0 -1 bluestore(/var/lib/ceph/osd/ceph-121) fsck error: 0x7f8000000000000001254031'A!40006456a32.00000000!='0xfffffffffffffffeffffffffffffffff6f00100000'x' is unexpected
2019-04-09 11:24:29.051155 7fdb4be78ec0 -1 bluestore(/var/lib/ceph/osd/ceph-121) fsck error: stray shard 0x180000
2019-04-09 11:24:29.051156 7fdb4be78ec0 -1 bluestore(/var/lib/ceph/osd/ceph-121) fsck error: 0x7f8000000000000001254031'A!40006456a32.00000000!='0xfffffffffffffffeffffffffffffffff6f00180000'x' is unexpected
2019-04-09 11:24:29.051160 7fdb4be78ec0 -1 bluestore(/var/lib/ceph/osd/ceph-121) fsck error: stray shard 0x200000
2019-04-09 11:24:29.051160 7fdb4be78ec0 -1 bluestore(/var/lib/ceph/osd/ceph-121) fsck error: 0x7f8000000000000001254031'A!40006456a32.00000000!='0xfffffffffffffffeffffffffffffffff6f00200000'x' is unexpected
2019-04-09 11:24:29.051164 7fdb4be78ec0 -1 bluestore(/var/lib/ceph/osd/ceph-121) fsck error: stray shard 0x280000
2019-04-09 11:24:29.051165 7fdb4be78ec0 -1 bluestore(/var/lib/ceph/osd/ceph-121) fsck error: 0x7f8000000000000001254031'A!40006456a32.00000000!='0xfffffffffffffffeffffffffffffffff6f00280000'x' is unexpected
2019-04-09 11:24:29.051168 7fdb4be78ec0 -1 bluestore(/var/lib/ceph/osd/ceph-121) fsck error: stray shard 0x300000
2019-04-09 11:24:29.051169 7fdb4be78ec0 -1 bluestore(/var/lib/ceph/osd/ceph-121) fsck error: 0x7f8000000000000001254031'A!40006456a32.00000000!='0xfffffffffffffffeffffffffffffffff6f00300000'x' is unexpected
2019-04-09 11:24:29.051172 7fdb4be78ec0 -1 bluestore(/var/lib/ceph/osd/ceph-121) fsck error: stray shard 0x380000
2019-04-09 11:24:29.051172 7fdb4be78ec0 -1 bluestore(/var/lib/ceph/osd/ceph-121) fsck error: 0x7f8000000000000001254031'A!40006456a32.00000000!='0xfffffffffffffffeffffffffffffffff6f00380000'x' is unexpected
2019-04-09 11:24:34.789904 7fdb4be78ec0 -1 abort: Corruption: Bad table magic number
*** Caught signal (Aborted) **
 in thread 7fdb4be78ec0 thread_name:ceph-bluestore-
 ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous (stable)
 1: (()+0x3fd311) [0x5592da925311]
 2: (()+0xf5d0) [0x7fdb40eb85d0]
 3: (gsignal()+0x37) [0x7fdb3f8a1207]
 4: (abort()+0x148) [0x7fdb3f8a28f8]
 5: (RocksDBStore::get(std::string const&, std::string const&, ceph::buffer::list*)+0x1c7) [0x5592da7dc4a7]
 6: (()+0x1fb244) [0x5592da723244]
 7: (()+0x1fa00f) [0x5592da72200f]
 8: (BlueStore::ExtentMap::fault_range(KeyValueDB*, unsigned int, unsigned int)+0x3a3) [0x5592da77e8f3]
 9: (BlueStore::_fsck(bool, bool)+0x1d79) [0x5592da7a2089]
 10: (main()+0x154f) [0x5592da655ddf]
 11: (__libc_start_main()+0xf5) [0x7fdb3f88d3d5]
 12: (()+0x1c4f8f) [0x5592da6ecf8f]
2019-04-09 11:24:34.791061 7fdb4be78ec0 -1 *** Caught signal (Aborted) **
 in thread 7fdb4be78ec0 thread_name:ceph-bluestore-

There are no kernel messages about medium errors. The SMART counters look fine, and a long SMART test is still ongoing.

I posted the bluefs-export to a5597af2-08c6-47a8-a3e9-029b1ac2e7bf

#10 Updated by Sage Weil over 1 year ago

  • Priority changed from High to Urgent

Taking a look at this. It's interesting that it happened twice on the same device(s)... did it occur again after that or did you just skip that device?

#11 Updated by Dan van der Ster over 1 year ago

Same devices (HDD for data and ssd partition for block.db) for both failures.

We have left the osd down since the second failure so can do whatever now to help debug.

#12 Updated by Sage Weil over 1 year ago

Dan, that dump appears to have multiple errors:

2019-04-25 09:15:18.766 7f8c0a4b0140  1 rocksdb: do_open column families: [default]
2019-04-25 09:15:18.772 7f8c027fc700  2 rocksdb: [/home/sage/src/ceph/src/rocksdb/table/block_based_table_reader.cc:1159] Encountered error while reading data from properties block Corruption: block checksum mismatch: expected 1627042428, got 2680008382  in 121/db/000307.sst offset 67524299 size 82
2019-04-25 09:15:18.772 7f8c0a4b0140  2 rocksdb: [/home/sage/src/ceph/src/rocksdb/db/version_set.cc:1315] Unable to load table properties for file 293 --- Corruption: bad block contents

2019-04-25 09:15:18.773 7f8c0a4b0140  2 rocksdb: [/home/sage/src/ceph/src/rocksdb/db/version_set.cc:1315] Unable to load table properties for file 294 --- Corruption: bad block contents

2019-04-25 09:15:18.773 7f8c0a4b0140  2 rocksdb: [/home/sage/src/ceph/src/rocksdb/db/version_set.cc:1315] Unable to load table properties for file 295 --- NotFound: 

2019-04-25 09:15:18.773 7f8c0a4b0140  2 rocksdb: [/home/sage/src/ceph/src/rocksdb/db/version_set.cc:1315] Unable to load table properties for file 296 --- Corruption: bad block contents

2019-04-25 09:15:18.773 7f8c0a4b0140  2 rocksdb: [/home/sage/src/ceph/src/rocksdb/db/version_set.cc:1315] Unable to load table properties for file 301 --- Corruption: bad block contents

2019-04-25 09:15:18.773 7f8c0a4b0140  2 rocksdb: [/home/sage/src/ceph/src/rocksdb/db/version_set.cc:1315] Unable to load table properties for file 302 --- NotFound: 

ceph-kvstore-tool: /home/sage/src/ceph/src/rocksdb/table/block.cc:731: uint32_t rocksdb::Block::NumRestarts() const: Assertion `size_ >= 2*sizeof(uint32_t)' failed.

would it be possible to try that OSD one more time, but with debug_bluefs=20? That may give us some clue where the corruption is coming from.

#13 Updated by Dan van der Ster over 1 year ago

would it be possible to try that OSD one more time

do you mean to zap/recreate it and try again or just start it as-is with debug_bluefs=20 ?

#14 Updated by Sage Weil over 1 year ago

Dan van der Ster wrote:

would it be possible to try that OSD one more time

do you mean to zap/recreate it and try again or just start it as-is with debug_bluefs=20 ?

zap/recreate, but with debug turned up

#15 Updated by Dan van der Ster over 1 year ago

Sage Weil wrote:

Dan van der Ster wrote:

would it be possible to try that OSD one more time

do you mean to zap/recreate it and try again or just start it as-is with debug_bluefs=20 ?

zap/recreate, but with debug turned up

ok that's done and it's running now. Last time it took a few hours to crash; I'll update when we have a crash (or when our log dir fills up ;) ).

#16 Updated by Igor Fedotov over 1 year ago

Just to record similar case and discovered root cause:
Our customer running Ceph version v12.2.11 complained about the same errors which prevent OSD to start:

7fc4e66eb700 -1 rocksdb: submit_transaction_sync error: Corruption: block checksum mismatch code = 2 Rocksdb transaction:
...

Additional investigation discovered several earlier OSD crashes caused by unexpected failures reported during BlueFS flush (triggered by RocksDB compaction). The first one had occurred 3 days before the initially reported failure.

Corresponding log output:
-1 bdev(.../block) _aio_thread got r=-61 ((61) No data available)
-1 .../KernelDevice.cc: In function 'void KernelDevice::_aio_thread()' thread 7f4cc3be9700 time ... KernelDevice.cc: 392: FAILED assert(0 == "got unexpected error from aio_t::get_return_value. " "This may suggest HW issue. Please check your dmesg!")

dmesg output analysis showed relevant disk write failures:
kernel: sd 1:1:0:7: [sdi] Unaligned partial completion (resid=32, sector_sz=512)
kernel: sd 1:1:0:7: [sdi] tag#9 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
kernel: sd 1:1:0:7: [sdi] tag#9 Sense Key : Medium Error [current]
kernel: sd 1:1:0:7: [sdi] tag#9 Add. Sense: Unrecovered read error
kernel: sd 1:1:0:7: [sdi] tag#9 CDB: Write(10) 2a 00 6c 03 7d 20 00 02 00 00
kernel: blk_update_request: critical medium error, dev sdi, sector 1812167968

smartctl report is clean though.
Notable thing is RAID in-between that probably hides proper SMART information.

Hence we've considered this as a HW failure.

#17 Updated by Paul Emmerich about 1 year ago

I'm seeing this on 14.2.2. Disk seems healthy.

The OSD in question suffered from https://tracker.ceph.com/issues/40080 before that crash happened

#18 Updated by Sage Weil about 1 year ago

  • Related to Bug #40080: Bitmap allocator return duplicate entries which cause interval_set assert added

#19 Updated by Sage Weil about 1 year ago

  • Subject changed from /build/ceph-13.2.2/src/os/bluestore/BlueStore.cc: 9073: FAILED assert(r == 0) to rocksdb: submit_transaction_sync error: Corruption: block checksum mismatch code = 2

#21 Updated by Neha Ojha about 1 year ago

  • Related to Bug #41367: rocksdb: submit_transaction error: Corruption: block checksum mismatch code = 2 added

#22 Updated by Sage Weil 8 months ago

  • Target version changed from v13.2.2 to v15.0.0

#23 Updated by Jamin Collins 8 months ago

It appears that I'm seeing the same problem with the (AFAIK) the most recent version of CEPH:

The OSDs in question are rotational devices fronted by an SSD backed LVM volume.

Jan 27 07:35:23 langhus-1 systemd[1]: Starting Ceph object storage daemon osd.0...
Jan 27 07:35:23 langhus-1 systemd[1]: Started Ceph object storage daemon osd.0.
Jan 27 07:35:24 langhus-1 ceph-osd[170413]: 2020-01-27 07:35:24.191 7fc46a8eac00 -1 Falling back to public interface
Jan 27 07:35:26 langhus-1 ceph-osd[170413]: /build/ceph/src/ceph-14.2.6/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_kv_sync_thread()' thread 7fc45de45700 time 2020-01-27 07:35:26.092409
Jan 27 07:35:26 langhus-1 ceph-osd[170413]: /build/ceph/src/ceph-14.2.6/src/os/bluestore/BlueStore.cc: 10954: FAILED ceph_assert(r == 0)
Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 2020-01-27 07:35:26.087 7fc45de45700 -1 rocksdb: submit_common error: Corruption: block checksum mismatch: expected 2932418700, got 2818836186  in db/001491.sst offset 18135301 size 3865 code = 2 Rocksdb transaction:
Jan 27 07:35:26 langhus-1 ceph-osd[170413]: Delete( Prefix = O key = 0x7f7ffffffffffffffcdd000000217363'rub_2.bb!='0xfffffffffffffffeffffffffffffffff'o')
Jan 27 07:35:26 langhus-1 ceph-osd[170413]: Put( Prefix = S key = 'nid_max' Value size = 8)
Jan 27 07:35:26 langhus-1 ceph-osd[170413]: Put( Prefix = S key = 'blobid_max' Value size = 8)
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable)
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14f) [0x563c6d22e845]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  2: (()+0x501a2f) [0x563c6d22ea2f]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  3: (BlueStore::_kv_sync_thread()+0x1160) [0x563c6d881ac0]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  4: (BlueStore::KVSyncThread::entry()+0xd) [0x563c6d8a82bd]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  5: (()+0x94cf) [0x7fc46ae774cf]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  6: (clone()+0x43) [0x7fc46aa2f2d3]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 2020-01-27 07:35:26.091 7fc45de45700 -1 /build/ceph/src/ceph-14.2.6/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_kv_sync_thread()' thread 7fc45de45700 time 2020-01-27 07:35:26.092409
Jan 27 07:35:26 langhus-1 ceph-osd[170413]: /build/ceph/src/ceph-14.2.6/src/os/bluestore/BlueStore.cc: 10954: FAILED ceph_assert(r == 0)
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable)
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14f) [0x563c6d22e845]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  2: (()+0x501a2f) [0x563c6d22ea2f]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  3: (BlueStore::_kv_sync_thread()+0x1160) [0x563c6d881ac0]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  4: (BlueStore::KVSyncThread::entry()+0xd) [0x563c6d8a82bd]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  5: (()+0x94cf) [0x7fc46ae774cf]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  6: (clone()+0x43) [0x7fc46aa2f2d3]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]: *** Caught signal (Aborted) **
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  in thread 7fc45de45700 thread_name:bstore_kv_sync
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable)
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  1: (()+0x14930) [0x7fc46ae82930]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  2: (gsignal()+0x145) [0x7fc46a96bf25]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  3: (abort()+0x12b) [0x7fc46a955897]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1aa) [0x563c6d22e8a0]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  5: (()+0x501a2f) [0x563c6d22ea2f]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  6: (BlueStore::_kv_sync_thread()+0x1160) [0x563c6d881ac0]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  7: (BlueStore::KVSyncThread::entry()+0xd) [0x563c6d8a82bd]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  8: (()+0x94cf) [0x7fc46ae774cf]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  9: (clone()+0x43) [0x7fc46aa2f2d3]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 2020-01-27 07:35:26.091 7fc45de45700 -1 *** Caught signal (Aborted) **
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  in thread 7fc45de45700 thread_name:bstore_kv_sync
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable)
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  1: (()+0x14930) [0x7fc46ae82930]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  2: (gsignal()+0x145) [0x7fc46a96bf25]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  3: (abort()+0x12b) [0x7fc46a955897]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1aa) [0x563c6d22e8a0]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  5: (()+0x501a2f) [0x563c6d22ea2f]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  6: (BlueStore::_kv_sync_thread()+0x1160) [0x563c6d881ac0]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  7: (BlueStore::KVSyncThread::entry()+0xd) [0x563c6d8a82bd]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  8: (()+0x94cf) [0x7fc46ae774cf]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  9: (clone()+0x43) [0x7fc46aa2f2d3]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:   -481> 2020-01-27 07:35:24.191 7fc46a8eac00 -1 Falling back to public interface
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:     -2> 2020-01-27 07:35:26.087 7fc45de45700 -1 rocksdb: submit_common error: Corruption: block checksum mismatch: expected 2932418700, got 2818836186  in db/001491.sst offset 18135301 size 3865 code = 2 Rocksdb transaction:
Jan 27 07:35:26 langhus-1 ceph-osd[170413]: Delete( Prefix = O key = 0x7f7ffffffffffffffcdd000000217363'rub_2.bb!='0xfffffffffffffffeffffffffffffffff'o')
Jan 27 07:35:26 langhus-1 ceph-osd[170413]: Put( Prefix = S key = 'nid_max' Value size = 8)
Jan 27 07:35:26 langhus-1 ceph-osd[170413]: Put( Prefix = S key = 'blobid_max' Value size = 8)
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:     -1> 2020-01-27 07:35:26.091 7fc45de45700 -1 /build/ceph/src/ceph-14.2.6/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_kv_sync_thread()' thread 7fc45de45700 time 2020-01-27 07:35:26.092409
Jan 27 07:35:26 langhus-1 ceph-osd[170413]: /build/ceph/src/ceph-14.2.6/src/os/bluestore/BlueStore.cc: 10954: FAILED ceph_assert(r == 0)
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable)
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14f) [0x563c6d22e845]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  2: (()+0x501a2f) [0x563c6d22ea2f]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  3: (BlueStore::_kv_sync_thread()+0x1160) [0x563c6d881ac0]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  4: (BlueStore::KVSyncThread::entry()+0xd) [0x563c6d8a82bd]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  5: (()+0x94cf) [0x7fc46ae774cf]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  6: (clone()+0x43) [0x7fc46aa2f2d3]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:      0> 2020-01-27 07:35:26.091 7fc45de45700 -1 *** Caught signal (Aborted) **
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  in thread 7fc45de45700 thread_name:bstore_kv_sync
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable)
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  1: (()+0x14930) [0x7fc46ae82930]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  2: (gsignal()+0x145) [0x7fc46a96bf25]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  3: (abort()+0x12b) [0x7fc46a955897]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1aa) [0x563c6d22e8a0]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  5: (()+0x501a2f) [0x563c6d22ea2f]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  6: (BlueStore::_kv_sync_thread()+0x1160) [0x563c6d881ac0]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  7: (BlueStore::KVSyncThread::entry()+0xd) [0x563c6d8a82bd]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  8: (()+0x94cf) [0x7fc46ae774cf]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  9: (clone()+0x43) [0x7fc46aa2f2d3]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:   -716> 2020-01-27 07:35:24.191 7fc46a8eac00 -1 Falling back to public interface
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:   -715> 2020-01-27 07:35:26.087 7fc45de45700 -1 rocksdb: submit_common error: Corruption: block checksum mismatch: expected 2932418700, got 2818836186  in db/001491.sst offset 18135301 size 3865 code = 2 Rocksdb transaction:
Jan 27 07:35:26 langhus-1 ceph-osd[170413]: Delete( Prefix = O key = 0x7f7ffffffffffffffcdd000000217363'rub_2.bb!='0xfffffffffffffffeffffffffffffffff'o')
Jan 27 07:35:26 langhus-1 ceph-osd[170413]: Put( Prefix = S key = 'nid_max' Value size = 8)
Jan 27 07:35:26 langhus-1 ceph-osd[170413]: Put( Prefix = S key = 'blobid_max' Value size = 8)
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:   -714> 2020-01-27 07:35:26.091 7fc45de45700 -1 /build/ceph/src/ceph-14.2.6/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_kv_sync_thread()' thread 7fc45de45700 time 2020-01-27 07:35:26.092409
Jan 27 07:35:26 langhus-1 ceph-osd[170413]: /build/ceph/src/ceph-14.2.6/src/os/bluestore/BlueStore.cc: 10954: FAILED ceph_assert(r == 0)
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable)
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14f) [0x563c6d22e845]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  2: (()+0x501a2f) [0x563c6d22ea2f]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  3: (BlueStore::_kv_sync_thread()+0x1160) [0x563c6d881ac0]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  4: (BlueStore::KVSyncThread::entry()+0xd) [0x563c6d8a82bd]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  5: (()+0x94cf) [0x7fc46ae774cf]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  6: (clone()+0x43) [0x7fc46aa2f2d3]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:   -713> 2020-01-27 07:35:26.091 7fc45de45700 -1 *** Caught signal (Aborted) **
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  in thread 7fc45de45700 thread_name:bstore_kv_sync
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable)
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  1: (()+0x14930) [0x7fc46ae82930]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  2: (gsignal()+0x145) [0x7fc46a96bf25]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  3: (abort()+0x12b) [0x7fc46a955897]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1aa) [0x563c6d22e8a0]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  5: (()+0x501a2f) [0x563c6d22ea2f]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  6: (BlueStore::_kv_sync_thread()+0x1160) [0x563c6d881ac0]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  7: (BlueStore::KVSyncThread::entry()+0xd) [0x563c6d8a82bd]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  8: (()+0x94cf) [0x7fc46ae774cf]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  9: (clone()+0x43) [0x7fc46aa2f2d3]
Jan 27 07:35:26 langhus-1 ceph-osd[170413]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Jan 27 07:35:26 langhus-1 systemd[1]: ceph-osd@0.service: Main process exited, code=killed, status=6/ABRT
Jan 27 07:35:26 langhus-1 systemd[1]: ceph-osd@0.service: Failed with result 'signal'.
Jan 27 07:35:26 langhus-1 systemd[1]: ceph-osd@0.service: Scheduled restart job, restart counter is at 3.
Jan 27 07:35:26 langhus-1 systemd[1]: Stopped Ceph object storage daemon osd.0.
Jan 27 07:35:26 langhus-1 systemd[1]: ceph-osd@0.service: Start request repeated too quickly.
Jan 27 07:35:26 langhus-1 systemd[1]: ceph-osd@0.service: Failed with result 'signal'.
Jan 27 07:35:26 langhus-1 systemd[1]: Failed to start Ceph object storage daemon osd.0.
$ sudo ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-0/ 
/build/ceph/src/ceph-14.2.6/src/kv/RocksDBStore.cc: In function 'virtual int RocksDBStore::get(const string&, const string&, ceph::bufferlist*)' thread 7f3612cd5d80 time 2020-01-27 08:12:34.731236
/build/ceph/src/ceph-14.2.6/src/kv/RocksDBStore.cc: 1211: ceph_abort_msg("block checksum mismatch: expected 2932418700, got 2818836186  in db/001491.sst offset 18135301 size 3865")
 ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable)
 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xda) [0x7f36139cca34]
 2: (RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list*)+0x39c) [0x5609d1f2376c]
 3: (()+0x269a3b) [0x5609d1cdea3b]
 4: (()+0x2574d1) [0x5609d1ccc4d1]
 5: (BlueStore::ExtentMap::fault_range(KeyValueDB*, unsigned int, unsigned int)+0x23c) [0x5609d1d1dbdc]
 6: (BlueStore::fsck_check_objects_shallow(BlueStore::FSCKDepth, long, boost::intrusive_ptr<BlueStore::Collection>, ghobject_t const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list const&, std::__cxx11::list<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, mempool::pool_allocator<(mempool::pool_index_t)5, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, std::map<boost::intrusive_ptr<BlueStore::Blob>, unsigned short, std::less<boost::intrusive_ptr<BlueStore::Blob> >, std::allocator<std::pair<boost::intrusive_ptr<BlueStore::Blob> const, unsigned short> > >*, BlueStore::FSCK_ObjectCtx const&)+0x22a) [0x5609d1d2a67a]
 7: (BlueStore::_fsck_check_objects(BlueStore::FSCKDepth, BlueStore::FSCK_ObjectCtx&)+0x1a3d) [0x5609d1d5b25d]
 8: (BlueStore::_fsck_on_open(BlueStore::FSCKDepth, bool)+0x1341) [0x5609d1d5f5b1]
 9: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x326) [0x5609d1d6aa86]
 10: (main()+0x1274) [0x5609d1c62314]
 11: (__libc_start_main()+0xf3) [0x7f3612f7c153]
 12: (_start()+0x2e) [0x5609d1c869ce]
*** Caught signal (Aborted) **
 in thread 7f3612cd5d80 thread_name:ceph-bluestore-
 ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable)
 1: (()+0x14930) [0x7f3613481930]
 2: (gsignal()+0x145) [0x7f3612f90f25]
 3: (abort()+0x12b) [0x7f3612f7a897]
 4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b3) [0x7f36139ccb0d]
 5: (RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list*)+0x39c) [0x5609d1f2376c]
 6: (()+0x269a3b) [0x5609d1cdea3b]
 7: (()+0x2574d1) [0x5609d1ccc4d1]
 8: (BlueStore::ExtentMap::fault_range(KeyValueDB*, unsigned int, unsigned int)+0x23c) [0x5609d1d1dbdc]
 9: (BlueStore::fsck_check_objects_shallow(BlueStore::FSCKDepth, long, boost::intrusive_ptr<BlueStore::Collection>, ghobject_t const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list const&, std::__cxx11::list<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, mempool::pool_allocator<(mempool::pool_index_t)5, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, std::map<boost::intrusive_ptr<BlueStore::Blob>, unsigned short, std::less<boost::intrusive_ptr<BlueStore::Blob> >, std::allocator<std::pair<boost::intrusive_ptr<BlueStore::Blob> const, unsigned short> > >*, BlueStore::FSCK_ObjectCtx const&)+0x22a) [0x5609d1d2a67a]
 10: (BlueStore::_fsck_check_objects(BlueStore::FSCKDepth, BlueStore::FSCK_ObjectCtx&)+0x1a3d) [0x5609d1d5b25d]
 11: (BlueStore::_fsck_on_open(BlueStore::FSCKDepth, bool)+0x1341) [0x5609d1d5f5b1]
 12: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x326) [0x5609d1d6aa86]
 13: (main()+0x1274) [0x5609d1c62314]
 14: (__libc_start_main()+0xf3) [0x7f3612f7c153]
 15: (_start()+0x2e) [0x5609d1c869ce]
2020-01-27 08:12:34.730 7f3612cd5d80 -1 /build/ceph/src/ceph-14.2.6/src/kv/RocksDBStore.cc: In function 'virtual int RocksDBStore::get(const string&, const string&, ceph::bufferlist*)' thread 7f3612cd5d80 time 2020-01-27 08:12:34.731236
/build/ceph/src/ceph-14.2.6/src/kv/RocksDBStore.cc: 1211: ceph_abort_msg("block checksum mismatch: expected 2932418700, got 2818836186  in db/001491.sst offset 18135301 size 3865")

 ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable)
 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xda) [0x7f36139cca34]
 2: (RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list*)+0x39c) [0x5609d1f2376c]
 3: (()+0x269a3b) [0x5609d1cdea3b]
 4: (()+0x2574d1) [0x5609d1ccc4d1]
 5: (BlueStore::ExtentMap::fault_range(KeyValueDB*, unsigned int, unsigned int)+0x23c) [0x5609d1d1dbdc]
 6: (BlueStore::fsck_check_objects_shallow(BlueStore::FSCKDepth, long, boost::intrusive_ptr<BlueStore::Collection>, ghobject_t const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list const&, std::__cxx11::list<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, mempool::pool_allocator<(mempool::pool_index_t)5, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, std::map<boost::intrusive_ptr<BlueStore::Blob>, unsigned short, std::less<boost::intrusive_ptr<BlueStore::Blob> >, std::allocator<std::pair<boost::intrusive_ptr<BlueStore::Blob> const, unsigned short> > >*, BlueStore::FSCK_ObjectCtx const&)+0x22a) [0x5609d1d2a67a]
 7: (BlueStore::_fsck_check_objects(BlueStore::FSCKDepth, BlueStore::FSCK_ObjectCtx&)+0x1a3d) [0x5609d1d5b25d]
 8: (BlueStore::_fsck_on_open(BlueStore::FSCKDepth, bool)+0x1341) [0x5609d1d5f5b1]
 9: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x326) [0x5609d1d6aa86]
 10: (main()+0x1274) [0x5609d1c62314]
 11: (__libc_start_main()+0xf3) [0x7f3612f7c153]
 12: (_start()+0x2e) [0x5609d1c869ce]

2020-01-27 08:12:34.730 7f3612cd5d80 -1 *** Caught signal (Aborted) **
 in thread 7f3612cd5d80 thread_name:ceph-bluestore-

 ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable)
 1: (()+0x14930) [0x7f3613481930]
 2: (gsignal()+0x145) [0x7f3612f90f25]
 3: (abort()+0x12b) [0x7f3612f7a897]
 4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b3) [0x7f36139ccb0d]
 5: (RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list*)+0x39c) [0x5609d1f2376c]
 6: (()+0x269a3b) [0x5609d1cdea3b]
 7: (()+0x2574d1) [0x5609d1ccc4d1]
 8: (BlueStore::ExtentMap::fault_range(KeyValueDB*, unsigned int, unsigned int)+0x23c) [0x5609d1d1dbdc]
 9: (BlueStore::fsck_check_objects_shallow(BlueStore::FSCKDepth, long, boost::intrusive_ptr<BlueStore::Collection>, ghobject_t const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list const&, std::__cxx11::list<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, mempool::pool_allocator<(mempool::pool_index_t)5, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, std::map<boost::intrusive_ptr<BlueStore::Blob>, unsigned short, std::less<boost::intrusive_ptr<BlueStore::Blob> >, std::allocator<std::pair<boost::intrusive_ptr<BlueStore::Blob> const, unsigned short> > >*, BlueStore::FSCK_ObjectCtx const&)+0x22a) [0x5609d1d2a67a]
 10: (BlueStore::_fsck_check_objects(BlueStore::FSCKDepth, BlueStore::FSCK_ObjectCtx&)+0x1a3d) [0x5609d1d5b25d]
 11: (BlueStore::_fsck_on_open(BlueStore::FSCKDepth, bool)+0x1341) [0x5609d1d5f5b1]
 12: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x326) [0x5609d1d6aa86]
 13: (main()+0x1274) [0x5609d1c62314]
 14: (__libc_start_main()+0xf3) [0x7f3612f7c153]
 15: (_start()+0x2e) [0x5609d1c869ce]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

  -268> 2020-01-27 08:12:34.730 7f3612cd5d80 -1 /build/ceph/src/ceph-14.2.6/src/kv/RocksDBStore.cc: In function 'virtual int RocksDBStore::get(const string&, const string&, ceph::bufferlist*)' thread 7f3612cd5d80 time 2020-01-27 08:12:34.731236
/build/ceph/src/ceph-14.2.6/src/kv/RocksDBStore.cc: 1211: ceph_abort_msg("block checksum mismatch: expected 2932418700, got 2818836186  in db/001491.sst offset 18135301 size 3865")

 ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable)
 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xda) [0x7f36139cca34]
 2: (RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list*)+0x39c) [0x5609d1f2376c]
 3: (()+0x269a3b) [0x5609d1cdea3b]
 4: (()+0x2574d1) [0x5609d1ccc4d1]
 5: (BlueStore::ExtentMap::fault_range(KeyValueDB*, unsigned int, unsigned int)+0x23c) [0x5609d1d1dbdc]
 6: (BlueStore::fsck_check_objects_shallow(BlueStore::FSCKDepth, long, boost::intrusive_ptr<BlueStore::Collection>, ghobject_t const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list const&, std::__cxx11::list<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, mempool::pool_allocator<(mempool::pool_index_t)5, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, std::map<boost::intrusive_ptr<BlueStore::Blob>, unsigned short, std::less<boost::intrusive_ptr<BlueStore::Blob> >, std::allocator<std::pair<boost::intrusive_ptr<BlueStore::Blob> const, unsigned short> > >*, BlueStore::FSCK_ObjectCtx const&)+0x22a) [0x5609d1d2a67a]
 7: (BlueStore::_fsck_check_objects(BlueStore::FSCKDepth, BlueStore::FSCK_ObjectCtx&)+0x1a3d) [0x5609d1d5b25d]
 8: (BlueStore::_fsck_on_open(BlueStore::FSCKDepth, bool)+0x1341) [0x5609d1d5f5b1]
 9: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x326) [0x5609d1d6aa86]
 10: (main()+0x1274) [0x5609d1c62314]
 11: (__libc_start_main()+0xf3) [0x7f3612f7c153]
 12: (_start()+0x2e) [0x5609d1c869ce]

  -267> 2020-01-27 08:12:34.730 7f3612cd5d80 -1 *** Caught signal (Aborted) **
 in thread 7f3612cd5d80 thread_name:ceph-bluestore-

 ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable)
 1: (()+0x14930) [0x7f3613481930]
 2: (gsignal()+0x145) [0x7f3612f90f25]
 3: (abort()+0x12b) [0x7f3612f7a897]
 4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b3) [0x7f36139ccb0d]
 5: (RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list*)+0x39c) [0x5609d1f2376c]
 6: (()+0x269a3b) [0x5609d1cdea3b]
 7: (()+0x2574d1) [0x5609d1ccc4d1]
 8: (BlueStore::ExtentMap::fault_range(KeyValueDB*, unsigned int, unsigned int)+0x23c) [0x5609d1d1dbdc]
 9: (BlueStore::fsck_check_objects_shallow(BlueStore::FSCKDepth, long, boost::intrusive_ptr<BlueStore::Collection>, ghobject_t const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list const&, std::__cxx11::list<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, mempool::pool_allocator<(mempool::pool_index_t)5, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, std::map<boost::intrusive_ptr<BlueStore::Blob>, unsigned short, std::less<boost::intrusive_ptr<BlueStore::Blob> >, std::allocator<std::pair<boost::intrusive_ptr<BlueStore::Blob> const, unsigned short> > >*, BlueStore::FSCK_ObjectCtx const&)+0x22a) [0x5609d1d2a67a]
 10: (BlueStore::_fsck_check_objects(BlueStore::FSCKDepth, BlueStore::FSCK_ObjectCtx&)+0x1a3d) [0x5609d1d5b25d]
 11: (BlueStore::_fsck_on_open(BlueStore::FSCKDepth, bool)+0x1341) [0x5609d1d5f5b1]
 12: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x326) [0x5609d1d6aa86]
 13: (main()+0x1274) [0x5609d1c62314]
 14: (__libc_start_main()+0xf3) [0x7f3612f7c153]
 15: (_start()+0x2e) [0x5609d1c869ce]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

  -268> 2020-01-27 08:12:34.730 7f3612cd5d80 -1 /build/ceph/src/ceph-14.2.6/src/kv/RocksDBStore.cc: In function 'virtual int RocksDBStore::get(const string&, const string&, ceph::bufferlist*)' thread 7f3612cd5d80 time 2020-01-27 08:12:34.731236
/build/ceph/src/ceph-14.2.6/src/kv/RocksDBStore.cc: 1211: ceph_abort_msg("block checksum mismatch: expected 2932418700, got 2818836186  in db/001491.sst offset 18135301 size 3865")

 ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable)
 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xda) [0x7f36139cca34]
 2: (RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list*)+0x39c) [0x5609d1f2376c]
 3: (()+0x269a3b) [0x5609d1cdea3b]
 4: (()+0x2574d1) [0x5609d1ccc4d1]
 5: (BlueStore::ExtentMap::fault_range(KeyValueDB*, unsigned int, unsigned int)+0x23c) [0x5609d1d1dbdc]
 6: (BlueStore::fsck_check_objects_shallow(BlueStore::FSCKDepth, long, boost::intrusive_ptr<BlueStore::Collection>, ghobject_t const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list const&, std::__cxx11::list<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, mempool::pool_allocator<(mempool::pool_index_t)5, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, std::map<boost::intrusive_ptr<BlueStore::Blob>, unsigned short, std::less<boost::intrusive_ptr<BlueStore::Blob> >, std::allocator<std::pair<boost::intrusive_ptr<BlueStore::Blob> const, unsigned short> > >*, BlueStore::FSCK_ObjectCtx const&)+0x22a) [0x5609d1d2a67a]
 7: (BlueStore::_fsck_check_objects(BlueStore::FSCKDepth, BlueStore::FSCK_ObjectCtx&)+0x1a3d) [0x5609d1d5b25d]
 8: (BlueStore::_fsck_on_open(BlueStore::FSCKDepth, bool)+0x1341) [0x5609d1d5f5b1]
 9: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x326) [0x5609d1d6aa86]
 10: (main()+0x1274) [0x5609d1c62314]
 11: (__libc_start_main()+0xf3) [0x7f3612f7c153]
 12: (_start()+0x2e) [0x5609d1c869ce]

  -267> 2020-01-27 08:12:34.730 7f3612cd5d80 -1 *** Caught signal (Aborted) **
 in thread 7f3612cd5d80 thread_name:ceph-bluestore-

 ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable)
 1: (()+0x14930) [0x7f3613481930]
 2: (gsignal()+0x145) [0x7f3612f90f25]
 3: (abort()+0x12b) [0x7f3612f7a897]
 4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b3) [0x7f36139ccb0d]
 5: (RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list*)+0x39c) [0x5609d1f2376c]
 6: (()+0x269a3b) [0x5609d1cdea3b]
 7: (()+0x2574d1) [0x5609d1ccc4d1]
 8: (BlueStore::ExtentMap::fault_range(KeyValueDB*, unsigned int, unsigned int)+0x23c) [0x5609d1d1dbdc]
 9: (BlueStore::fsck_check_objects_shallow(BlueStore::FSCKDepth, long, boost::intrusive_ptr<BlueStore::Collection>, ghobject_t const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list const&, std::__cxx11::list<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, mempool::pool_allocator<(mempool::pool_index_t)5, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, std::map<boost::intrusive_ptr<BlueStore::Blob>, unsigned short, std::less<boost::intrusive_ptr<BlueStore::Blob> >, std::allocator<std::pair<boost::intrusive_ptr<BlueStore::Blob> const, unsigned short> > >*, BlueStore::FSCK_ObjectCtx const&)+0x22a) [0x5609d1d2a67a]
 10: (BlueStore::_fsck_check_objects(BlueStore::FSCKDepth, BlueStore::FSCK_ObjectCtx&)+0x1a3d) [0x5609d1d5b25d]
 11: (BlueStore::_fsck_on_open(BlueStore::FSCKDepth, bool)+0x1341) [0x5609d1d5f5b1]
 12: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x326) [0x5609d1d6aa86]
 13: (main()+0x1274) [0x5609d1c62314]
 14: (__libc_start_main()+0xf3) [0x7f3612f7c153]
 15: (_start()+0x2e) [0x5609d1c869ce]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Aborted
Jan 27 07:35:25 langhus-1 systemd[1]: Starting Ceph object storage daemon osd.5...
Jan 27 07:35:25 langhus-1 systemd[1]: Started Ceph object storage daemon osd.5.
Jan 27 07:35:25 langhus-1 ceph-osd[170517]: 2020-01-27 07:35:25.937 7ff1dc81ec00 -1 Falling back to public interface
Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 2020-01-27 07:35:27.494 7ff1cfd79700 -1 rocksdb: submit_common error: Corruption: block checksum mismatch: expected 2705794548, got 186875627  in db/000956.sst offset 684976 size 53335 code = 2 Rocksdb transaction:
Jan 27 07:35:27 langhus-1 ceph-osd[170517]: Put( Prefix = O key = 0x7f80000000000000028f00000021213dfffffffffffffffeffffffffffffffff'o' Value size = 29)
Jan 27 07:35:27 langhus-1 ceph-osd[170517]: Put( Prefix = S key = 'nid_max' Value size = 8)
Jan 27 07:35:27 langhus-1 ceph-osd[170517]: Put( Prefix = S key = 'blobid_max' Value size = 8)
Jan 27 07:35:27 langhus-1 ceph-osd[170517]: /build/ceph/src/ceph-14.2.6/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_kv_sync_thread()' thread 7ff1cfd79700 time 2020-01-27 07:35:27.496963
Jan 27 07:35:27 langhus-1 ceph-osd[170517]: /build/ceph/src/ceph-14.2.6/src/os/bluestore/BlueStore.cc: 10954: FAILED ceph_assert(r == 0)
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable)
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14f) [0x55d416bfd845]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  2: (()+0x501a2f) [0x55d416bfda2f]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  3: (BlueStore::_kv_sync_thread()+0x1160) [0x55d417250ac0]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  4: (BlueStore::KVSyncThread::entry()+0xd) [0x55d4172772bd]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  5: (()+0x94cf) [0x7ff1dcdab4cf]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  6: (clone()+0x43) [0x7ff1dc9632d3]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 2020-01-27 07:35:27.494 7ff1cfd79700 -1 /build/ceph/src/ceph-14.2.6/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_kv_sync_thread()' thread 7ff1cfd79700 time 2020-01-27 07:35:27.496963
Jan 27 07:35:27 langhus-1 ceph-osd[170517]: /build/ceph/src/ceph-14.2.6/src/os/bluestore/BlueStore.cc: 10954: FAILED ceph_assert(r == 0)
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable)
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14f) [0x55d416bfd845]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  2: (()+0x501a2f) [0x55d416bfda2f]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  3: (BlueStore::_kv_sync_thread()+0x1160) [0x55d417250ac0]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  4: (BlueStore::KVSyncThread::entry()+0xd) [0x55d4172772bd]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  5: (()+0x94cf) [0x7ff1dcdab4cf]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  6: (clone()+0x43) [0x7ff1dc9632d3]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]: *** Caught signal (Aborted) **
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  in thread 7ff1cfd79700 thread_name:bstore_kv_sync
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable)
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  1: (()+0x14930) [0x7ff1dcdb6930]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  2: (gsignal()+0x145) [0x7ff1dc89ff25]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  3: (abort()+0x12b) [0x7ff1dc889897]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1aa) [0x55d416bfd8a0]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  5: (()+0x501a2f) [0x55d416bfda2f]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  6: (BlueStore::_kv_sync_thread()+0x1160) [0x55d417250ac0]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  7: (BlueStore::KVSyncThread::entry()+0xd) [0x55d4172772bd]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  8: (()+0x94cf) [0x7ff1dcdab4cf]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  9: (clone()+0x43) [0x7ff1dc9632d3]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 2020-01-27 07:35:27.497 7ff1cfd79700 -1 *** Caught signal (Aborted) **
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  in thread 7ff1cfd79700 thread_name:bstore_kv_sync
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable)
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  1: (()+0x14930) [0x7ff1dcdb6930]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  2: (gsignal()+0x145) [0x7ff1dc89ff25]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  3: (abort()+0x12b) [0x7ff1dc889897]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1aa) [0x55d416bfd8a0]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  5: (()+0x501a2f) [0x55d416bfda2f]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  6: (BlueStore::_kv_sync_thread()+0x1160) [0x55d417250ac0]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  7: (BlueStore::KVSyncThread::entry()+0xd) [0x55d4172772bd]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  8: (()+0x94cf) [0x7ff1dcdab4cf]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  9: (clone()+0x43) [0x7ff1dc9632d3]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:   -483> 2020-01-27 07:35:25.937 7ff1dc81ec00 -1 Falling back to public interface
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:     -3> 2020-01-27 07:35:27.494 7ff1cfd79700 -1 rocksdb: submit_common error: Corruption: block checksum mismatch: expected 2705794548, got 186875627  in db/000956.sst offset 684976 size 53335 code = 2 Rocksdb transaction:
Jan 27 07:35:27 langhus-1 ceph-osd[170517]: Put( Prefix = O key = 0x7f80000000000000028f00000021213dfffffffffffffffeffffffffffffffff'o' Value size = 29)
Jan 27 07:35:27 langhus-1 ceph-osd[170517]: Put( Prefix = S key = 'nid_max' Value size = 8)
Jan 27 07:35:27 langhus-1 ceph-osd[170517]: Put( Prefix = S key = 'blobid_max' Value size = 8)
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:     -1> 2020-01-27 07:35:27.494 7ff1cfd79700 -1 /build/ceph/src/ceph-14.2.6/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_kv_sync_thread()' thread 7ff1cfd79700 time 2020-01-27 07:35:27.496963
Jan 27 07:35:27 langhus-1 ceph-osd[170517]: /build/ceph/src/ceph-14.2.6/src/os/bluestore/BlueStore.cc: 10954: FAILED ceph_assert(r == 0)
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable)
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14f) [0x55d416bfd845]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  2: (()+0x501a2f) [0x55d416bfda2f]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  3: (BlueStore::_kv_sync_thread()+0x1160) [0x55d417250ac0]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  4: (BlueStore::KVSyncThread::entry()+0xd) [0x55d4172772bd]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  5: (()+0x94cf) [0x7ff1dcdab4cf]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  6: (clone()+0x43) [0x7ff1dc9632d3]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:      0> 2020-01-27 07:35:27.497 7ff1cfd79700 -1 *** Caught signal (Aborted) **
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  in thread 7ff1cfd79700 thread_name:bstore_kv_sync
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable)
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  1: (()+0x14930) [0x7ff1dcdb6930]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  2: (gsignal()+0x145) [0x7ff1dc89ff25]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  3: (abort()+0x12b) [0x7ff1dc889897]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1aa) [0x55d416bfd8a0]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  5: (()+0x501a2f) [0x55d416bfda2f]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  6: (BlueStore::_kv_sync_thread()+0x1160) [0x55d417250ac0]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  7: (BlueStore::KVSyncThread::entry()+0xd) [0x55d4172772bd]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  8: (()+0x94cf) [0x7ff1dcdab4cf]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  9: (clone()+0x43) [0x7ff1dc9632d3]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:   -719> 2020-01-27 07:35:25.937 7ff1dc81ec00 -1 Falling back to public interface
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:   -718> 2020-01-27 07:35:27.494 7ff1cfd79700 -1 rocksdb: submit_common error: Corruption: block checksum mismatch: expected 2705794548, got 186875627  in db/000956.sst offset 684976 size 53335 code = 2 Rocksdb transaction:
Jan 27 07:35:27 langhus-1 ceph-osd[170517]: Put( Prefix = O key = 0x7f80000000000000028f00000021213dfffffffffffffffeffffffffffffffff'o' Value size = 29)
Jan 27 07:35:27 langhus-1 ceph-osd[170517]: Put( Prefix = S key = 'nid_max' Value size = 8)
Jan 27 07:35:27 langhus-1 ceph-osd[170517]: Put( Prefix = S key = 'blobid_max' Value size = 8)
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:   -717> 2020-01-27 07:35:27.494 7ff1cfd79700 -1 /build/ceph/src/ceph-14.2.6/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_kv_sync_thread()' thread 7ff1cfd79700 time 2020-01-27 07:35:27.496963
Jan 27 07:35:27 langhus-1 ceph-osd[170517]: /build/ceph/src/ceph-14.2.6/src/os/bluestore/BlueStore.cc: 10954: FAILED ceph_assert(r == 0)
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable)
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14f) [0x55d416bfd845]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  2: (()+0x501a2f) [0x55d416bfda2f]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  3: (BlueStore::_kv_sync_thread()+0x1160) [0x55d417250ac0]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  4: (BlueStore::KVSyncThread::entry()+0xd) [0x55d4172772bd]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  5: (()+0x94cf) [0x7ff1dcdab4cf]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  6: (clone()+0x43) [0x7ff1dc9632d3]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:   -716> 2020-01-27 07:35:27.497 7ff1cfd79700 -1 *** Caught signal (Aborted) **
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  in thread 7ff1cfd79700 thread_name:bstore_kv_sync
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable)
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  1: (()+0x14930) [0x7ff1dcdb6930]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  2: (gsignal()+0x145) [0x7ff1dc89ff25]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  3: (abort()+0x12b) [0x7ff1dc889897]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1aa) [0x55d416bfd8a0]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  5: (()+0x501a2f) [0x55d416bfda2f]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  6: (BlueStore::_kv_sync_thread()+0x1160) [0x55d417250ac0]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  7: (BlueStore::KVSyncThread::entry()+0xd) [0x55d4172772bd]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  8: (()+0x94cf) [0x7ff1dcdab4cf]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  9: (clone()+0x43) [0x7ff1dc9632d3]
Jan 27 07:35:27 langhus-1 ceph-osd[170517]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Jan 27 07:35:28 langhus-1 systemd[1]: ceph-osd@5.service: Main process exited, code=killed, status=6/ABRT
Jan 27 07:35:28 langhus-1 systemd[1]: ceph-osd@5.service: Failed with result 'signal'.
Jan 27 07:35:28 langhus-1 systemd[1]: ceph-osd@5.service: Scheduled restart job, restart counter is at 3.
Jan 27 07:35:28 langhus-1 systemd[1]: Stopped Ceph object storage daemon osd.5.
Jan 27 07:35:28 langhus-1 systemd[1]: ceph-osd@5.service: Start request repeated too quickly.
Jan 27 07:35:28 langhus-1 systemd[1]: ceph-osd@5.service: Failed with result 'signal'.
Jan 27 07:35:28 langhus-1 systemd[1]: Failed to start Ceph object storage daemon osd.5.
$ sudo ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-5/
/build/ceph/src/ceph-14.2.6/src/kv/RocksDBStore.cc: In function 'virtual int RocksDBStore::get(const string&, const string&, ceph::bufferlist*)' thread 7f5fa5aead80 time 2020-01-27 08:13:04.023081
/build/ceph/src/ceph-14.2.6/src/kv/RocksDBStore.cc: 1211: ceph_abort_msg("block checksum mismatch: expected 1754497987, got 317490254  in db/000957.sst offset 705635 size 3911")
 ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable)
 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xda) [0x7f5fa67e1a34]
 2: (RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list*)+0x39c) [0x563c50a4a76c]
 3: (()+0x269a3b) [0x563c50805a3b]
 4: (()+0x2574d1) [0x563c507f34d1]
 5: (BlueStore::ExtentMap::fault_range(KeyValueDB*, unsigned int, unsigned int)+0x23c) [0x563c50844bdc]
 6: (BlueStore::fsck_check_objects_shallow(BlueStore::FSCKDepth, long, boost::intrusive_ptr<BlueStore::Collection>, ghobject_t const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list const&, std::__cxx11::list<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, mempool::pool_allocator<(mempool::pool_index_t)5, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, std::map<boost::intrusive_ptr<BlueStore::Blob>, unsigned short, std::less<boost::intrusive_ptr<BlueStore::Blob> >, std::allocator<std::pair<boost::intrusive_ptr<BlueStore::Blob> const, unsigned short> > >*, BlueStore::FSCK_ObjectCtx const&)+0x22a) [0x563c5085167a]
 7: (BlueStore::_fsck_check_objects(BlueStore::FSCKDepth, BlueStore::FSCK_ObjectCtx&)+0x1a3d) [0x563c5088225d]
 8: (BlueStore::_fsck_on_open(BlueStore::FSCKDepth, bool)+0x1341) [0x563c508865b1]
 9: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x326) [0x563c50891a86]
 10: (main()+0x1274) [0x563c50789314]
 11: (__libc_start_main()+0xf3) [0x7f5fa5d91153]
 12: (_start()+0x2e) [0x563c507ad9ce]
*** Caught signal (Aborted) **
 in thread 7f5fa5aead80 thread_name:ceph-bluestore-
2020-01-27 08:13:04.020 7f5fa5aead80 -1 /build/ceph/src/ceph-14.2.6/src/kv/RocksDBStore.cc: In function 'virtual int RocksDBStore::get(const string&, const string&, ceph::bufferlist*)' thread 7f5fa5aead80 time 2020-01-27 08:13:04.023081
/build/ceph/src/ceph-14.2.6/src/kv/RocksDBStore.cc: 1211: ceph_abort_msg("block checksum mismatch: expected 1754497987, got 317490254  in db/000957.sst offset 705635 size 3911")

 ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable)
 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xda) [0x7f5fa67e1a34]
 2: (RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list*)+0x39c) [0x563c50a4a76c]
 3: (()+0x269a3b) [0x563c50805a3b]
 4: (()+0x2574d1) [0x563c507f34d1]
 5: (BlueStore::ExtentMap::fault_range(KeyValueDB*, unsigned int, unsigned int)+0x23c) [0x563c50844bdc]
 6: (BlueStore::fsck_check_objects_shallow(BlueStore::FSCKDepth, long, boost::intrusive_ptr<BlueStore::Collection>, ghobject_t const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list const&, std::__cxx11::list<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, mempool::pool_allocator<(mempool::pool_index_t)5, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, std::map<boost::intrusive_ptr<BlueStore::Blob>, unsigned short, std::less<boost::intrusive_ptr<BlueStore::Blob> >, std::allocator<std::pair<boost::intrusive_ptr<BlueStore::Blob> const, unsigned short> > >*, BlueStore::FSCK_ObjectCtx const&)+0x22a) [0x563c5085167a]
 7: (BlueStore::_fsck_check_objects(BlueStore::FSCKDepth, BlueStore::FSCK_ObjectCtx&)+0x1a3d) [0x563c5088225d]
 8: (BlueStore::_fsck_on_open(BlueStore::FSCKDepth, bool)+0x1341) [0x563c508865b1]
 9: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x326) [0x563c50891a86]
 10: (main()+0x1274) [0x563c50789314]
 11: (__libc_start_main()+0xf3) [0x7f5fa5d91153]
 12: (_start()+0x2e) [0x563c507ad9ce]

 ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable)
 1: (()+0x14930) [0x7f5fa6296930]
 2: (gsignal()+0x145) [0x7f5fa5da5f25]
 3: (abort()+0x12b) [0x7f5fa5d8f897]
 4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b3) [0x7f5fa67e1b0d]
 5: (RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list*)+0x39c) [0x563c50a4a76c]
 6: (()+0x269a3b) [0x563c50805a3b]
 7: (()+0x2574d1) [0x563c507f34d1]
 8: (BlueStore::ExtentMap::fault_range(KeyValueDB*, unsigned int, unsigned int)+0x23c) [0x563c50844bdc]
 9: (BlueStore::fsck_check_objects_shallow(BlueStore::FSCKDepth, long, boost::intrusive_ptr<BlueStore::Collection>, ghobject_t const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list const&, std::__cxx11::list<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, mempool::pool_allocator<(mempool::pool_index_t)5, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, std::map<boost::intrusive_ptr<BlueStore::Blob>, unsigned short, std::less<boost::intrusive_ptr<BlueStore::Blob> >, std::allocator<std::pair<boost::intrusive_ptr<BlueStore::Blob> const, unsigned short> > >*, BlueStore::FSCK_ObjectCtx const&)+0x22a) [0x563c5085167a]
 10: (BlueStore::_fsck_check_objects(BlueStore::FSCKDepth, BlueStore::FSCK_ObjectCtx&)+0x1a3d) [0x563c5088225d]
 11: (BlueStore::_fsck_on_open(BlueStore::FSCKDepth, bool)+0x1341) [0x563c508865b1]
 12: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x326) [0x563c50891a86]
 13: (main()+0x1274) [0x563c50789314]
 14: (__libc_start_main()+0xf3) [0x7f5fa5d91153]
 15: (_start()+0x2e) [0x563c507ad9ce]
2020-01-27 08:13:04.020 7f5fa5aead80 -1 *** Caught signal (Aborted) **
 in thread 7f5fa5aead80 thread_name:ceph-bluestore-

 ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable)
 1: (()+0x14930) [0x7f5fa6296930]
 2: (gsignal()+0x145) [0x7f5fa5da5f25]
 3: (abort()+0x12b) [0x7f5fa5d8f897]
 4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b3) [0x7f5fa67e1b0d]
 5: (RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list*)+0x39c) [0x563c50a4a76c]
 6: (()+0x269a3b) [0x563c50805a3b]
 7: (()+0x2574d1) [0x563c507f34d1]
 8: (BlueStore::ExtentMap::fault_range(KeyValueDB*, unsigned int, unsigned int)+0x23c) [0x563c50844bdc]
 9: (BlueStore::fsck_check_objects_shallow(BlueStore::FSCKDepth, long, boost::intrusive_ptr<BlueStore::Collection>, ghobject_t const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list const&, std::__cxx11::list<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, mempool::pool_allocator<(mempool::pool_index_t)5, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, std::map<boost::intrusive_ptr<BlueStore::Blob>, unsigned short, std::less<boost::intrusive_ptr<BlueStore::Blob> >, std::allocator<std::pair<boost::intrusive_ptr<BlueStore::Blob> const, unsigned short> > >*, BlueStore::FSCK_ObjectCtx const&)+0x22a) [0x563c5085167a]
 10: (BlueStore::_fsck_check_objects(BlueStore::FSCKDepth, BlueStore::FSCK_ObjectCtx&)+0x1a3d) [0x563c5088225d]
 11: (BlueStore::_fsck_on_open(BlueStore::FSCKDepth, bool)+0x1341) [0x563c508865b1]
 12: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x326) [0x563c50891a86]
 13: (main()+0x1274) [0x563c50789314]
 14: (__libc_start_main()+0xf3) [0x7f5fa5d91153]
 15: (_start()+0x2e) [0x563c507ad9ce]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

  -248> 2020-01-27 08:13:04.020 7f5fa5aead80 -1 /build/ceph/src/ceph-14.2.6/src/kv/RocksDBStore.cc: In function 'virtual int RocksDBStore::get(const string&, const string&, ceph::bufferlist*)' thread 7f5fa5aead80 time 2020-01-27 08:13:04.023081
/build/ceph/src/ceph-14.2.6/src/kv/RocksDBStore.cc: 1211: ceph_abort_msg("block checksum mismatch: expected 1754497987, got 317490254  in db/000957.sst offset 705635 size 3911")

 ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable)
 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xda) [0x7f5fa67e1a34]
 2: (RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list*)+0x39c) [0x563c50a4a76c]
 3: (()+0x269a3b) [0x563c50805a3b]
 4: (()+0x2574d1) [0x563c507f34d1]
 5: (BlueStore::ExtentMap::fault_range(KeyValueDB*, unsigned int, unsigned int)+0x23c) [0x563c50844bdc]
 6: (BlueStore::fsck_check_objects_shallow(BlueStore::FSCKDepth, long, boost::intrusive_ptr<BlueStore::Collection>, ghobject_t const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list const&, std::__cxx11::list<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, mempool::pool_allocator<(mempool::pool_index_t)5, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, std::map<boost::intrusive_ptr<BlueStore::Blob>, unsigned short, std::less<boost::intrusive_ptr<BlueStore::Blob> >, std::allocator<std::pair<boost::intrusive_ptr<BlueStore::Blob> const, unsigned short> > >*, BlueStore::FSCK_ObjectCtx const&)+0x22a) [0x563c5085167a]
 7: (BlueStore::_fsck_check_objects(BlueStore::FSCKDepth, BlueStore::FSCK_ObjectCtx&)+0x1a3d) [0x563c5088225d]
 8: (BlueStore::_fsck_on_open(BlueStore::FSCKDepth, bool)+0x1341) [0x563c508865b1]
 9: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x326) [0x563c50891a86]
 10: (main()+0x1274) [0x563c50789314]
 11: (__libc_start_main()+0xf3) [0x7f5fa5d91153]
 12: (_start()+0x2e) [0x563c507ad9ce]

  -247> 2020-01-27 08:13:04.020 7f5fa5aead80 -1 *** Caught signal (Aborted) **
 in thread 7f5fa5aead80 thread_name:ceph-bluestore-

 ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable)
 1: (()+0x14930) [0x7f5fa6296930]
 2: (gsignal()+0x145) [0x7f5fa5da5f25]
 3: (abort()+0x12b) [0x7f5fa5d8f897]
 4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b3) [0x7f5fa67e1b0d]
 5: (RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list*)+0x39c) [0x563c50a4a76c]
 6: (()+0x269a3b) [0x563c50805a3b]
 7: (()+0x2574d1) [0x563c507f34d1]
 8: (BlueStore::ExtentMap::fault_range(KeyValueDB*, unsigned int, unsigned int)+0x23c) [0x563c50844bdc]
 9: (BlueStore::fsck_check_objects_shallow(BlueStore::FSCKDepth, long, boost::intrusive_ptr<BlueStore::Collection>, ghobject_t const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list const&, std::__cxx11::list<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, mempool::pool_allocator<(mempool::pool_index_t)5, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, std::map<boost::intrusive_ptr<BlueStore::Blob>, unsigned short, std::less<boost::intrusive_ptr<BlueStore::Blob> >, std::allocator<std::pair<boost::intrusive_ptr<BlueStore::Blob> const, unsigned short> > >*, BlueStore::FSCK_ObjectCtx const&)+0x22a) [0x563c5085167a]
 10: (BlueStore::_fsck_check_objects(BlueStore::FSCKDepth, BlueStore::FSCK_ObjectCtx&)+0x1a3d) [0x563c5088225d]
 11: (BlueStore::_fsck_on_open(BlueStore::FSCKDepth, bool)+0x1341) [0x563c508865b1]
 12: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x326) [0x563c50891a86]
 13: (main()+0x1274) [0x563c50789314]
 14: (__libc_start_main()+0xf3) [0x7f5fa5d91153]
 15: (_start()+0x2e) [0x563c507ad9ce]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

  -248> 2020-01-27 08:13:04.020 7f5fa5aead80 -1 /build/ceph/src/ceph-14.2.6/src/kv/RocksDBStore.cc: In function 'virtual int RocksDBStore::get(const string&, const string&, ceph::bufferlist*)' thread 7f5fa5aead80 time 2020-01-27 08:13:04.023081
/build/ceph/src/ceph-14.2.6/src/kv/RocksDBStore.cc: 1211: ceph_abort_msg("block checksum mismatch: expected 1754497987, got 317490254  in db/000957.sst offset 705635 size 3911")

 ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable)
 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xda) [0x7f5fa67e1a34]
 2: (RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list*)+0x39c) [0x563c50a4a76c]
 3: (()+0x269a3b) [0x563c50805a3b]
 4: (()+0x2574d1) [0x563c507f34d1]
 5: (BlueStore::ExtentMap::fault_range(KeyValueDB*, unsigned int, unsigned int)+0x23c) [0x563c50844bdc]
 6: (BlueStore::fsck_check_objects_shallow(BlueStore::FSCKDepth, long, boost::intrusive_ptr<BlueStore::Collection>, ghobject_t const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list const&, std::__cxx11::list<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, mempool::pool_allocator<(mempool::pool_index_t)5, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, std::map<boost::intrusive_ptr<BlueStore::Blob>, unsigned short, std::less<boost::intrusive_ptr<BlueStore::Blob> >, std::allocator<std::pair<boost::intrusive_ptr<BlueStore::Blob> const, unsigned short> > >*, BlueStore::FSCK_ObjectCtx const&)+0x22a) [0x563c5085167a]
 7: (BlueStore::_fsck_check_objects(BlueStore::FSCKDepth, BlueStore::FSCK_ObjectCtx&)+0x1a3d) [0x563c5088225d]
 8: (BlueStore::_fsck_on_open(BlueStore::FSCKDepth, bool)+0x1341) [0x563c508865b1]
 9: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x326) [0x563c50891a86]
 10: (main()+0x1274) [0x563c50789314]
 11: (__libc_start_main()+0xf3) [0x7f5fa5d91153]
 12: (_start()+0x2e) [0x563c507ad9ce]

  -247> 2020-01-27 08:13:04.020 7f5fa5aead80 -1 *** Caught signal (Aborted) **
 in thread 7f5fa5aead80 thread_name:ceph-bluestore-

 ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable)
 1: (()+0x14930) [0x7f5fa6296930]
 2: (gsignal()+0x145) [0x7f5fa5da5f25]
 3: (abort()+0x12b) [0x7f5fa5d8f897]
 4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b3) [0x7f5fa67e1b0d]
 5: (RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list*)+0x39c) [0x563c50a4a76c]
 6: (()+0x269a3b) [0x563c50805a3b]
 7: (()+0x2574d1) [0x563c507f34d1]
 8: (BlueStore::ExtentMap::fault_range(KeyValueDB*, unsigned int, unsigned int)+0x23c) [0x563c50844bdc]
 9: (BlueStore::fsck_check_objects_shallow(BlueStore::FSCKDepth, long, boost::intrusive_ptr<BlueStore::Collection>, ghobject_t const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list const&, std::__cxx11::list<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, mempool::pool_allocator<(mempool::pool_index_t)5, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, std::map<boost::intrusive_ptr<BlueStore::Blob>, unsigned short, std::less<boost::intrusive_ptr<BlueStore::Blob> >, std::allocator<std::pair<boost::intrusive_ptr<BlueStore::Blob> const, unsigned short> > >*, BlueStore::FSCK_ObjectCtx const&)+0x22a) [0x563c5085167a]
 10: (BlueStore::_fsck_check_objects(BlueStore::FSCKDepth, BlueStore::FSCK_ObjectCtx&)+0x1a3d) [0x563c5088225d]
 11: (BlueStore::_fsck_on_open(BlueStore::FSCKDepth, bool)+0x1341) [0x563c508865b1]
 12: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x326) [0x563c50891a86]
 13: (main()+0x1274) [0x563c50789314]
 14: (__libc_start_main()+0xf3) [0x7f5fa5d91153]
 15: (_start()+0x2e) [0x563c507ad9ce]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Aborted

#24 Updated by Jamin Collins 8 months ago

I have a bluefs-export of each OSD. They are a few GB in size, how would you like me to provide them?

#25 Updated by Igor Fedotov 8 months ago

Hi Jamin,
wondering if you ever had v14.2.3 or 14.2.4 installed at these OSDs?

#26 Updated by Jamin Collins 8 months ago

Igor Fedotov wrote:

Hi Jamin,
wondering if you ever had v14.2.3 or 14.2.4 installed at these OSDs?

I'm sure at one point I did, but they were very recently completely rebuilt under 14.2.6.

I have been completely removing each OSD from the cluster to rebuild it with an adequately sized SSD volume housed in an LVM volume group.

The process I've been using to remove each OSD is:

  • stop the OSD and disable the systemd OSD process
  • remove the OSD from the crush map (ceph osd crush remove osd.X)
  • wait for the cluster to fully recover
  • remove the OSD auth (ceph auth del osd.X)
  • remove the OSD (ceph osd rm X)
  • remove the LVM volumes and groups

#27 Updated by Jamin Collins 8 months ago

At the time of failure both OSDs had been fully removed and rebuilt under 14.2.6. They had been running this way for several days, roughly five based on the systemd logs:

Jan 22 09:03:04 langhus-1 ceph-osd[1025]: 2020-01-22 09:03:04.853 7f78cb024700 -1 osd.0 45928 set_numa_affinity unable to identify publ>
Jan 27 02:05:22 langhus-1 ceph-osd[1025]: /build/ceph/src/ceph-14.2.6/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_kv_s>

#28 Updated by Igor Fedotov 8 months ago

Thanks, Jamin.
I was thinking this could be caused by earlier DB corruption but became visible later (see https://tracker.ceph.com/issues/42223). But this hypothesis doesn't work if you redeployed OSD (and hence cleaned-up all persistent data for this specific instance) instead of pure software upgrade.

#29 Updated by Jamin Collins 8 months ago

This same host has now experienced what looks like similar corruption of it's ceph-mon filesystem:

Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 2020-01-29 23:32:07.653 7fb83c8da700 -1 rocksdb: submit_common error: Corruption: block check>
Jan 29 23:32:07 langhus-1 ceph-mon[1058]: Put( Prefix = p key = 'xos'0x00353836'80538' Value size = 34430)
Jan 29 23:32:07 langhus-1 ceph-mon[1058]: Put( Prefix = p key = 'xos'0x0070656e'ding_v' Value size = 8)
Jan 29 23:32:07 langhus-1 ceph-mon[1058]: Put( Prefix = p key = 'xos'0x0070656e'ding_pn' Value size = 8)
Jan 29 23:32:07 langhus-1 ceph-mon[1058]: /build/ceph/src/ceph-14.2.6/src/mon/MonitorDBStore.h: In function 'int MonitorDBStore::apply_>
Jan 29 23:32:07 langhus-1 ceph-mon[1058]: /build/ceph/src/ceph-14.2.6/src/mon/MonitorDBStore.h: 324: ceph_abort_msg("failed to write to>
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable)
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::>
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  2: (MonitorDBStore::apply_transaction(std::shared_ptr<MonitorDBStore::Transaction>)+0x1226) >
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  3: (Paxos::handle_begin(boost::intrusive_ptr<MonOpRequest>)+0x439) [0x556c9c64d419]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  4: (Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x29b) [0x556c9c65377b]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  5: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x1113) [0x556c9c565293]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  6: (Monitor::_ms_dispatch(Message*)+0x921) [0x556c9c565e61]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  7: (Monitor::ms_dispatch(Message*)+0x27) [0x556c9c59c7f7]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  8: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x27) [0x556c9c597427]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  9: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0x5d8) [0x7fb845ba9>
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  10: (DispatchQueue::entry()+0x8f2) [0x7fb845ba7132]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  11: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fb845c73b4d]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  12: (()+0x94cf) [0x7fb8450194cf]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  13: (clone()+0x43) [0x7fb844bf92d3]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 2020-01-29 23:32:07.666 7fb83c8da700 -1 /build/ceph/src/ceph-14.2.6/src/mon/MonitorDBStore.h:>
Jan 29 23:32:07 langhus-1 ceph-mon[1058]: /build/ceph/src/ceph-14.2.6/src/mon/MonitorDBStore.h: 324: ceph_abort_msg("failed to write to>
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable)
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::>
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  2: (MonitorDBStore::apply_transaction(std::shared_ptr<MonitorDBStore::Transaction>)+0x1226) >
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  3: (Paxos::handle_begin(boost::intrusive_ptr<MonOpRequest>)+0x439) [0x556c9c64d419]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  4: (Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x29b) [0x556c9c65377b]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  5: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x1113) [0x556c9c565293]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  6: (Monitor::_ms_dispatch(Message*)+0x921) [0x556c9c565e61]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  7: (Monitor::ms_dispatch(Message*)+0x27) [0x556c9c59c7f7]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  8: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x27) [0x556c9c597427]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  9: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0x5d8) [0x7fb845ba9>
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  10: (DispatchQueue::entry()+0x8f2) [0x7fb845ba7132]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  11: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fb845c73b4d]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  12: (()+0x94cf) [0x7fb8450194cf]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  13: (clone()+0x43) [0x7fb844bf92d3]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]: *** Caught signal (Aborted) **
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  in thread 7fb83c8da700 thread_name:ms_dispatch
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable)
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  1: (()+0x14930) [0x7fb845024930]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  2: (gsignal()+0x145) [0x7fb844b35f25]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  3: (abort()+0x12b) [0x7fb844b1f897]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::>
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  5: (MonitorDBStore::apply_transaction(std::shared_ptr<MonitorDBStore::Transaction>)+0x1226) >
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  6: (Paxos::handle_begin(boost::intrusive_ptr<MonOpRequest>)+0x439) [0x556c9c64d419]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  7: (Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x29b) [0x556c9c65377b]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  8: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x1113) [0x556c9c565293]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  9: (Monitor::_ms_dispatch(Message*)+0x921) [0x556c9c565e61]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  10: (Monitor::ms_dispatch(Message*)+0x27) [0x556c9c59c7f7]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  11: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x27) [0x556c9c597427]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  12: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0x5d8) [0x7fb845ba>
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  13: (DispatchQueue::entry()+0x8f2) [0x7fb845ba7132]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  14: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fb845c73b4d]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  15: (()+0x94cf) [0x7fb8450194cf]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  16: (clone()+0x43) [0x7fb844bf92d3]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 2020-01-29 23:32:07.669 7fb83c8da700 -1 *** Caught signal (Aborted) **
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  in thread 7fb83c8da700 thread_name:ms_dispatch
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable)
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  1: (()+0x14930) [0x7fb845024930]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  2: (gsignal()+0x145) [0x7fb844b35f25]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  3: (abort()+0x12b) [0x7fb844b1f897]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::>
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  5: (MonitorDBStore::apply_transaction(std::shared_ptr<MonitorDBStore::Transaction>)+0x1226) >
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  6: (Paxos::handle_begin(boost::intrusive_ptr<MonOpRequest>)+0x439) [0x556c9c64d419]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  7: (Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x29b) [0x556c9c65377b]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  8: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x1113) [0x556c9c565293]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  9: (Monitor::_ms_dispatch(Message*)+0x921) [0x556c9c565e61]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  10: (Monitor::ms_dispatch(Message*)+0x27) [0x556c9c59c7f7]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  11: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x27) [0x556c9c597427]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  12: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0x5d8) [0x7fb845ba>
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  13: (DispatchQueue::entry()+0x8f2) [0x7fb845ba7132]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  14: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fb845c73b4d]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  15: (()+0x94cf) [0x7fb8450194cf]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  16: (clone()+0x43) [0x7fb844bf92d3]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:     -2> 2020-01-29 23:32:07.653 7fb83c8da700 -1 rocksdb: submit_common error: Corruption: blo>
Jan 29 23:32:07 langhus-1 ceph-mon[1058]: Put( Prefix = p key = 'xos'0x00353836'80538' Value size = 34430)
Jan 29 23:32:07 langhus-1 ceph-mon[1058]: Put( Prefix = p key = 'xos'0x0070656e'ding_v' Value size = 8)
Jan 29 23:32:07 langhus-1 ceph-mon[1058]: Put( Prefix = p key = 'xos'0x0070656e'ding_pn' Value size = 8)
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:     -1> 2020-01-29 23:32:07.666 7fb83c8da700 -1 /build/ceph/src/ceph-14.2.6/src/mon/MonitorDB>
Jan 29 23:32:07 langhus-1 ceph-mon[1058]: /build/ceph/src/ceph-14.2.6/src/mon/MonitorDBStore.h: 324: ceph_abort_msg("failed to write to>
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable)
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::>
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  2: (MonitorDBStore::apply_transaction(std::shared_ptr<MonitorDBStore::Transaction>)+0x1226) >
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  3: (Paxos::handle_begin(boost::intrusive_ptr<MonOpRequest>)+0x439) [0x556c9c64d419]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  4: (Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x29b) [0x556c9c65377b]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  5: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x1113) [0x556c9c565293]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  6: (Monitor::_ms_dispatch(Message*)+0x921) [0x556c9c565e61]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  7: (Monitor::ms_dispatch(Message*)+0x27) [0x556c9c59c7f7]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  8: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x27) [0x556c9c597427]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  9: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0x5d8) [0x7fb845ba9>
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  10: (DispatchQueue::entry()+0x8f2) [0x7fb845ba7132]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  11: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fb845c73b4d]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  12: (()+0x94cf) [0x7fb8450194cf]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  13: (clone()+0x43) [0x7fb844bf92d3]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:      0> 2020-01-29 23:32:07.669 7fb83c8da700 -1 *** Caught signal (Aborted) **
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  in thread 7fb83c8da700 thread_name:ms_dispatch
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable)
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  1: (()+0x14930) [0x7fb845024930]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  2: (gsignal()+0x145) [0x7fb844b35f25]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  3: (abort()+0x12b) [0x7fb844b1f897]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::>
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  5: (MonitorDBStore::apply_transaction(std::shared_ptr<MonitorDBStore::Transaction>)+0x1226) >
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  6: (Paxos::handle_begin(boost::intrusive_ptr<MonOpRequest>)+0x439) [0x556c9c64d419]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  7: (Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x29b) [0x556c9c65377b]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  8: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x1113) [0x556c9c565293]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  9: (Monitor::_ms_dispatch(Message*)+0x921) [0x556c9c565e61]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  10: (Monitor::ms_dispatch(Message*)+0x27) [0x556c9c59c7f7]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  11: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x27) [0x556c9c597427]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  12: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0x5d8) [0x7fb845ba>
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  13: (DispatchQueue::entry()+0x8f2) [0x7fb845ba7132]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  14: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fb845c73b4d]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  15: (()+0x94cf) [0x7fb8450194cf]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  16: (clone()+0x43) [0x7fb844bf92d3]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  -9999> 2020-01-29 23:32:07.653 7fb83c8da700 -1 rocksdb: submit_common error: Corruption: blo>
Jan 29 23:32:07 langhus-1 ceph-mon[1058]: Put( Prefix = p key = 'xos'0x00353836'80538' Value size = 34430)
Jan 29 23:32:07 langhus-1 ceph-mon[1058]: Put( Prefix = p key = 'xos'0x0070656e'ding_v' Value size = 8)
Jan 29 23:32:07 langhus-1 ceph-mon[1058]: Put( Prefix = p key = 'xos'0x0070656e'ding_pn' Value size = 8)
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  -9998> 2020-01-29 23:32:07.666 7fb83c8da700 -1 /build/ceph/src/ceph-14.2.6/src/mon/MonitorDB>
Jan 29 23:32:07 langhus-1 ceph-mon[1058]: /build/ceph/src/ceph-14.2.6/src/mon/MonitorDBStore.h: 324: ceph_abort_msg("failed to write to>
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable)
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::>
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  2: (MonitorDBStore::apply_transaction(std::shared_ptr<MonitorDBStore::Transaction>)+0x1226) >
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  3: (Paxos::handle_begin(boost::intrusive_ptr<MonOpRequest>)+0x439) [0x556c9c64d419]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  4: (Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x29b) [0x556c9c65377b]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  5: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x1113) [0x556c9c565293]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  6: (Monitor::_ms_dispatch(Message*)+0x921) [0x556c9c565e61]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  7: (Monitor::ms_dispatch(Message*)+0x27) [0x556c9c59c7f7]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  8: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x27) [0x556c9c597427]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  9: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0x5d8) [0x7fb845ba9>
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  10: (DispatchQueue::entry()+0x8f2) [0x7fb845ba7132]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  11: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fb845c73b4d]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  12: (()+0x94cf) [0x7fb8450194cf]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  13: (clone()+0x43) [0x7fb844bf92d3]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  -9997> 2020-01-29 23:32:07.669 7fb83c8da700 -1 *** Caught signal (Aborted) **
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  in thread 7fb83c8da700 thread_name:ms_dispatch
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable)
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  1: (()+0x14930) [0x7fb845024930]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  2: (gsignal()+0x145) [0x7fb844b35f25]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  3: (abort()+0x12b) [0x7fb844b1f897]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::>
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  5: (MonitorDBStore::apply_transaction(std::shared_ptr<MonitorDBStore::Transaction>)+0x1226) >
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  6: (Paxos::handle_begin(boost::intrusive_ptr<MonOpRequest>)+0x439) [0x556c9c64d419]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  7: (Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x29b) [0x556c9c65377b]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  8: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x1113) [0x556c9c565293]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  9: (Monitor::_ms_dispatch(Message*)+0x921) [0x556c9c565e61]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  10: (Monitor::ms_dispatch(Message*)+0x27) [0x556c9c59c7f7]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  11: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x27) [0x556c9c597427]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  12: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0x5d8) [0x7fb845ba>
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  13: (DispatchQueue::entry()+0x8f2) [0x7fb845ba7132]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  14: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fb845c73b4d]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  15: (()+0x94cf) [0x7fb8450194cf]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  16: (clone()+0x43) [0x7fb844bf92d3]
Jan 29 23:32:07 langhus-1 ceph-mon[1058]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Jan 29 23:32:08 langhus-1 systemd[1]: ceph-mon@langhus-1.service: Main process exited, code=killed, status=6/ABRT
Jan 29 23:32:08 langhus-1 systemd[1]: ceph-mon@langhus-1.service: Failed with result 'signal'.
Jan 29 23:32:18 langhus-1 systemd[1]: ceph-mon@langhus-1.service: Scheduled restart job, restart counter is at 1.
Jan 29 23:32:18 langhus-1 systemd[1]: Stopped Ceph cluster monitor daemon.

#30 Updated by Igor Fedotov 8 months ago

Just in case - have you checked H/w errors via dmesg?

#31 Updated by Igor Fedotov 8 months ago

And are DB devices for OSD and MON different?

#32 Updated by Jamin Collins 8 months ago

Host details:

$ grep model /proc/cpuinfo | tail -n 1
model name    : AMD Ryzen 7 3700X 8-Core Processor
$ sudo nvme list | tail -n1
/dev/nvme0n1     S41GNX0M435108       SAMSUNG MZVLB256HAHQ-000L7               1         255.05  GB / 256.06  GB    512   B +  0 B   1L2QEXD7
$ ls -l /var/lib/ceph/osd/ceph-*/| grep db
lrwxrwxrwx 1 ceph ceph  20 Jan 29 13:20 block.db -> /dev/ceph-db/osd0.db
lrwxrwxrwx 1 ceph ceph 21 Jan 27 13:57 block.db -> /dev/ceph-db/osd10.db
lrwxrwxrwx 1 ceph ceph  20 Jan 29 07:52 block.db -> /dev/ceph-db/osd5.db

$ sudo pvs | grep ceph-db
  /dev/sdc       ceph-db lvm2 a--   931.51g  551.51g

$ sudo hdparm -i /dev/sdc

/dev/sdc:

 Model=Samsung SSD 850 EVO mSATA 1TB, FwRev=32101030, SerialNo=S33FNX0J100209D
 Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs }
 RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=0
 BuffType=unknown, BuffSize=unknown, MaxMultSect=1, MultSect=1
 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=1953525168
 IORDY=on/off, tPIO={min:240,w/IORDY:120}, tDMA={min:120,rec:120}
 PIO modes:  pio0 pio1 pio2 pio3 pio4 
 DMA modes:  mdma0 mdma1 mdma2 
 UDMA modes: udma0 udma1 udma2 udma3 udma4 udma5 *udma6 
 AdvancedPM=yes: disabled (255) WriteCache=disabled
 Drive conforms to: Unspecified:  ATA/ATAPI-4,5,6,7
$ sudo hdparm -i /dev/sda

/dev/sda:

 Model=ST4000VX007-2DT166, FwRev=CV11, SerialNo=WDH1FSZY
 Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs RotSpdTol>.5% }
 RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=0
 BuffType=unknown, BuffSize=8192kB, MaxMultSect=16, MultSect=off
 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=7814037168
 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
 PIO modes:  pio0 pio1 pio2 pio3 pio4 
 DMA modes:  mdma0 mdma1 mdma2 
 UDMA modes: udma0 udma1 udma2 udma3 udma4 udma5 *udma6 
 AdvancedPM=no WriteCache=enabled
 Drive conforms to: unknown:  ATA/ATAPI-4,5,6,7

 * signifies the current active mode

$ sudo hdparm -i /dev/sdb

/dev/sdb:

 Model=HGST HDN724040ALE640, FwRev=MJAOA5E0, SerialNo=PK1334PCJ7ZNRS
 Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs }
 RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=56
 BuffType=DualPortCache, BuffSize=unknown, MaxMultSect=16, MultSect=16
 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=7814037168
 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
 PIO modes:  pio0 pio1 pio2 pio3 pio4 
 DMA modes:  mdma0 mdma1 mdma2 
 UDMA modes: udma0 udma1 udma2 udma3 udma4 udma5 *udma6 
 AdvancedPM=yes: disabled (255) WriteCache=enabled
 Drive conforms to: unknown:  ATA/ATAPI-2,3,4,5,6,7

 * signifies the current active mode
$ sudo dmidecode -t memory
# dmidecode 3.2
Getting SMBIOS data from sysfs.
SMBIOS 3.2.1 present.
# SMBIOS implementations newer than version 3.2.0 are not
# fully supported by this version of dmidecode.

Handle 0x000C, DMI type 16, 23 bytes
Physical Memory Array
    Location: System Board Or Motherboard
    Use: System Memory
    Error Correction Type: None
    Maximum Capacity: 128 GB
    Error Information Handle: 0x000B
    Number Of Devices: 4

Handle 0x0014, DMI type 17, 40 bytes
Memory Device
    Array Handle: 0x000C
    Error Information Handle: 0x0013
    Total Width: Unknown
    Data Width: Unknown
    Size: No Module Installed
    Form Factor: Unknown
    Set: None
    Locator: DIMM 0
    Bank Locator: P0 CHANNEL A
    Type: Unknown
    Type Detail: Unknown
    Speed: Unknown
    Manufacturer: Unknown
    Serial Number: Unknown
    Asset Tag: Not Specified
    Part Number: Unknown
    Rank: Unknown
    Configured Memory Speed: Unknown
    Minimum Voltage: Unknown
    Maximum Voltage: Unknown
    Configured Voltage: Unknown

Handle 0x0016, DMI type 17, 40 bytes
Memory Device
    Array Handle: 0x000C
    Error Information Handle: 0x0015
    Total Width: 64 bits
    Data Width: 64 bits
    Size: 16384 MB
    Form Factor: DIMM
    Set: None
    Locator: DIMM 1
    Bank Locator: P0 CHANNEL A
    Type: DDR4
    Type Detail: Synchronous Unbuffered (Unregistered)
    Speed: 2133 MT/s
    Manufacturer: Unknown
    Serial Number: 00000000
    Asset Tag: Not Specified
    Part Number: F4-3000C16-16GSXFB
    Rank: 2
    Configured Memory Speed: 2133 MT/s
    Minimum Voltage: 1.2 V
    Maximum Voltage: 1.2 V
    Configured Voltage: 1.2 V

Handle 0x0019, DMI type 17, 40 bytes
Memory Device
    Array Handle: 0x000C
    Error Information Handle: 0x0018
    Total Width: Unknown
    Data Width: Unknown
    Size: No Module Installed
    Form Factor: Unknown
    Set: None
    Locator: DIMM 0
    Bank Locator: P0 CHANNEL B
    Type: Unknown
    Type Detail: Unknown
    Speed: Unknown
    Manufacturer: Unknown
    Serial Number: Unknown
    Asset Tag: Not Specified
    Part Number: Unknown
    Rank: Unknown
    Configured Memory Speed: Unknown
    Minimum Voltage: Unknown
    Maximum Voltage: Unknown
    Configured Voltage: Unknown

Handle 0x001B, DMI type 17, 40 bytes
Memory Device
    Array Handle: 0x000C
    Error Information Handle: 0x001A
    Total Width: 64 bits
    Data Width: 64 bits
    Size: 16384 MB
    Form Factor: DIMM
    Set: None
    Locator: DIMM 1
    Bank Locator: P0 CHANNEL B
    Type: DDR4
    Type Detail: Synchronous Unbuffered (Unregistered)
    Speed: 2133 MT/s
    Manufacturer: Unknown
    Serial Number: 00000000
    Asset Tag: Not Specified
    Part Number: F4-3000C16-16GSXFB
    Rank: 2
    Configured Memory Speed: 2133 MT/s
    Minimum Voltage: 1.2 V
    Maximum Voltage: 1.2 V
    Configured Voltage: 1.2 V
$ sudo dmidecode -t baseboard
# dmidecode 3.2
Getting SMBIOS data from sysfs.
SMBIOS 3.2.1 present.
# SMBIOS implementations newer than version 3.2.0 are not
# fully supported by this version of dmidecode.

Handle 0x0002, DMI type 2, 15 bytes
Base Board Information
    Manufacturer: ASRock
    Product Name: B450M Pro4-F
    Version:                       
    Serial Number: M80-C9009201693
    Asset Tag:                       
    Features:
        Board is a hosting board
        Board is replaceable
    Location In Chassis:                       
    Chassis Handle: 0x0003
    Type: Motherboard
    Contained Object Handles: 0

The OSDs in question (0 and 5) are fronted by the Samsung 850 EVO. The monitor that failed is stored on the Samsung NVME. The OSDs are made by different manufacturers.

#33 Updated by Jamin Collins 8 months ago

Igor Fedotov wrote:

Just in case - have you checked H/w errors via dmesg?

No hardware messages in dmesg

Igor Fedotov wrote:

And are DB devices for OSD and MON different?

yes the DB devices and MON are on different storage devices

#34 Updated by Jamin Collins 8 months ago

Kernel:

$ uname -r
5.4.11-arch1-1

#35 Updated by Igor Fedotov 8 months ago

I definitely have no clue what's happened so can suggest some basic/obvious checks only:
1) Run smartctl -a for devices in question
2) Check/share OSD logs prior to the first crash occurrences. Some errors/odd behavior there?
3) Just a single host is currently behave badly, isn't it?

Having checksum failures at both OSD and MON with different drives behind makes me think about H/W or OS issues...

And a side note unrelated to the ticket - consumer-grade SSD drives (like Samsung 850 EVO) are terribly bad for using as a backend for BlueStore DB.
The rationale is the lack of power loss protection which causes very inefficient sync write performance.There were plenty of discussions at ceph-users mailing list and in some blogs. I faced that in my lab too. Hence suggest to consider replacement sooner rather than later.

#36 Updated by Igor Fedotov 8 months ago

Another side note - AFAIR enabled write caching has been reported as a bad practice too.

#37 Updated by Jamin Collins 8 months ago

Igor Fedotov wrote:

I definitely have no clue what's happened so can suggest some basic/obvious checks only:
1) Run smartctl -a for devices in question

The OSD devices are a bit older, but both pass the 'smartctl -a' check.

2) Check/share OSD logs prior to the first crash occurrences. Some errors/odd behavior there?

Checked, didn't see anything that jumped out at me, osd.0's log is attached. I had to trim some stuff from the beginning but left a full day before the crash. Similar with osd.5's log, but I had to remove some from the beginning and end to get the file size down (even compressed).

3) Just a single host is currently behave badly, isn't it?

Yes, it a single new host with the drives migrated to it.

Having checksum failures at both OSD and MON with different drives behind makes me think about H/W or OS issues...

Would agree, but other than the host hardware, the OS is the same load on the other 4 nodes in the cluster.

And a side note unrelated to the ticket - consumer-grade SSD drives (like Samsung 850 EVO) are terribly bad for using as a backend for BlueStore DB.
The rationale is the lack of power loss protection which causes very inefficient sync write performance.There were plenty of discussions at ceph-users mailing list and in some blogs. I faced that in my lab too. Hence suggest to consider replacement sooner rather than later.

The other four nodes in the cluster all have some form of consumer grade SSD in them, most from less respected manufacturers. The move to the Samsung EVO 850 and NVME drive were both new to the cluster as part of the hardware upgrade on this host. The move to an AMD CPU is also new.

#38 Updated by Igor Fedotov 8 months ago

These logs (osd-5 specifically) are very interesting!

Let's start with OSD-5. Looking for 'checksum' keyword.
- First occurrence:
2020-01-26 04:22:01.641 7f970ff79700 -1 bluestore(/var/lib/ceph/osd/ceph-5) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x6d000, got 0xdaf7cdb7, expected 0xe26480a9, device location [0x1ab27a2d000~1000], logical extent 0x6d000~1000, object #1:b05ad75a:::rbd_data.2b9d9c6b8b4567.000000000007e0bb:head#

It's main device, not DB! User data. And just a single checksum failure, likely read retry returned valid data!

- Next occurrence:
2020-01-26 10:38:44.527 7f971077a700 -1 /build/ceph/src/ceph-14.2.6/src/kv/RocksDBStore.cc: In function 'virtual int RocksDBStore::get(const string&, const string&, ceph::bufferlist*)' thread 7f971077a700 time 2020-01-26 10:38:44.518539
/build/ceph/src/ceph-14.2.6/src/kv/RocksDBStore.cc: 1211: ceph_abort_msg("block checksum mismatch: expected 2257455429, got 1374367646 in db/000651.sst offset 51208102 size 3912")

Not clear what's the device caused the issue but DB is involved. In case of bluefs spillover read could go to main device as well. Note ssT file name: db/000651.sst
OSD managed to restart after that crash.

- bypass some irrelevant/repeated 'checksum' occurrence and find:
2020-01-26 23:09:11.356 7fb34e244700 -1 bluestore(/var/lib/ceph/osd/ceph-5) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x20000, got 0xcfeb644a, expected 0xdb041ee1, device location [0x202dc0f0000~1000], logical extent 0x1a0000~1000, object #2:66d6f9d9:::rbd_data.5a6c74b0dc51.0000000000049d3f:head#

Again main device, different object and device location.

- the next one is related to DB again:
2020-01-26 23:39:46.214 7fb360268700 3 rocksdb: [db/db_impl_compaction_flush.cc:2659] Compaction error: Corruption: block checksum mismatch: expected 2705794548, got 186875627 in db/000956.sst offset 684976 size 53335

Note different SST file name: db/000956.sst

- and then in a postmortem log one can get more info on the previous main device crc failure:

-4097> 2020-01-26 23:09:11.356 7fb34e244700 -1 bluestore(/var/lib/ceph/osd/ceph-5) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x20000, got 0xcfeb644a, expected 0xdb041ee1, device location [0x202dc0f0000~1000], logical extent 0x1a0000~1000, object #2:66d6f9d9:::rbd_data.5a6c74b0dc51.0000000000049d3f:head#
-4096> 2020-01-26 23:09:11.356 7fb34e244700 5 bluestore(/var/lib/ceph/osd/ceph-5) _do_read read at 0x18b000~32000 failed 1 times before succeeding

which says that read failed only once before the success, i.e. retry was successful indeed!

After that OSD is unable to start up and is failing at db/000956.sst all the way.

The above (interim!!!! main device read failures) makes me think that finally you're observing another reincarnation of
https://tracker.ceph.com/issues/22464

https://github.com/ceph/ceph/pull/24649

It's about exactly the same main device failures. Presumably caused by high memory pressure (I'm afraid nobody knows for sure). The mentioned patch has a workaround for such a case by reattempting failed reads. And it showed pretty good results so far. Including main device failures for your OSD-5.
But this patch fixes user data reading ONLY! It doesn't apply for DB data at both main or DB devices.
And I've been waiting for this issue to reappear for DB data for a while...
Now I presume this has happened. And chances are that RocksDB failed to withstand such a read failure at some point and finally got damaged.

You may want to check "bluestore_reads_with_retries" performance counters for other OSDs at this host if any. Non-zero value will prove the above analysis.
Also could you please set debug_bluefs to 20m try to restart OSD and collect the fresh log. I'd like to check where broken SST files lie (i.e. was there any spillover to main device) - just curious if flash drive access might suffer from the same reading issue.

#39 Updated by Igor Fedotov 8 months ago

I was about to suggest memory utilization monitoring for this host. Including swapping. But finally realized that current state might be completely different as 2 OSDs are dead. Nevertheless please keep that in mind.

#40 Updated by Igor Fedotov 8 months ago

As for OSD-0 provided log has single permanent checksum failure all the way. But it makes sense to check earlier logs for similar checksum failures as for OSD-5.

And finally OSD-0 has started so I'm curious what happened for this to succeed?

#41 Updated by Jamin Collins 8 months ago

Once I got the cluster back to a healthy state, I removed and recreated both osd.0 and osd.5 to fully recover the cluster.

How do I check "bluestore_reads_with_retries" values for the OSDs within the cluster?

#42 Updated by Igor Fedotov 8 months ago

Run: ceph daemon osd.N perf dump
and look for the keyword in the output

#43 Updated by Jamin Collins 8 months ago

Does the "bluestore_reads_with_retries" reset with an OSD restart? Asking because all three OSDs on the host report 0 currently, but all have also been recently restarted.

$ sudo ceph daemon osd.0 perf dump | jq .bluestore.bluestore_reads_with_retries
0

$ sudo ceph daemon osd.5 perf dump | jq .bluestore.bluestore_reads_with_retries
0

$ sudo ceph daemon osd.10 perf dump | jq .bluestore.bluestore_reads_with_retries
0

Also, I presume you want the debug log from a failing OSD, right? If so, I'll gather when one of these fail again.

#44 Updated by Igor Fedotov 8 months ago

yes, they reset on restart

#45 Updated by Sage Weil 7 months ago

  • Priority changed from Urgent to High

#46 Updated by Neha Ojha about 2 months ago

  • Priority changed from High to Normal

#47 Updated by Aleksandr Rudenko about 1 month ago

I'm seeing this on 12.2.12

part of OSD log:

 -18> 2020-08-12 19:23:44.329010 7f3ca01d4d40  0 filestore(/var/lib/ceph/osd/ceph-323) start omap initiation
   -17> 2020-08-12 19:23:44.329090 7f3ca01d4d40  0  set rocksdb option base_background_compactions = 2
   -16> 2020-08-12 19:23:44.329105 7f3ca01d4d40  0  set rocksdb option compaction_readahead_size = 2097152
   -15> 2020-08-12 19:23:44.329120 7f3ca01d4d40  0  set rocksdb option compression = kNoCompression
   -14> 2020-08-12 19:23:44.329131 7f3ca01d4d40  0  set rocksdb option max_background_compactions = 16
   -13> 2020-08-12 19:23:44.329138 7f3ca01d4d40  0  set rocksdb option max_write_buffer_number = 4
   -12> 2020-08-12 19:23:44.329144 7f3ca01d4d40  0  set rocksdb option min_write_buffer_number_to_merge = 2
   -11> 2020-08-12 19:23:44.329190 7f3ca01d4d40  0  set rocksdb option base_background_compactions = 2
   -10> 2020-08-12 19:23:44.329199 7f3ca01d4d40  0  set rocksdb option compaction_readahead_size = 2097152
    -9> 2020-08-12 19:23:44.329205 7f3ca01d4d40  0  set rocksdb option compression = kNoCompression
    -8> 2020-08-12 19:23:44.329210 7f3ca01d4d40  0  set rocksdb option max_background_compactions = 16
    -7> 2020-08-12 19:23:44.329215 7f3ca01d4d40  0  set rocksdb option max_write_buffer_number = 4
    -6> 2020-08-12 19:23:44.329220 7f3ca01d4d40  0  set rocksdb option min_write_buffer_number_to_merge = 2
    -5> 2020-08-12 19:23:47.998200 7f3ca01d4d40  0 filestore(/var/lib/ceph/osd/ceph-323) mount(1759): enabling WRITEAHEAD journal mode: checkpoint is not enabled
    -4> 2020-08-12 19:23:48.005244 7f3ca01d4d40 -1 rocksdb: submit_transaction error: Corruption: block checksum mismatch code = 2 Rocksdb transaction:
Delete( Prefix = _ key = 'SER_0000000004298992_AXATTR_'0x00'_')
Delete( Prefix = _ key = 'SER_0000000004298992_AXATTR_'0x005f7573'er.rgw.acl')
Delete( Prefix = _ key = 'SER_0000000004298992_AXATTR_'0x005f7573'er.rgw.content_type')
Delete( Prefix = _ key = 'SER_0000000004298992_AXATTR_'0x005f7573'er.rgw.etag')
Delete( Prefix = _ key = 'SER_0000000004298992_AXATTR_'0x005f7573'er.rgw.idtag')
Delete( Prefix = _ key = 'SER_0000000004298992_AXATTR_'0x005f7573'er.rgw.manifest')
Delete( Prefix = _ key = 'SER_0000000004298992_AXATTR_'0x005f7573'er.rgw.pg_ver')
Delete( Prefix = _ key = 'SER_0000000004298992_AXATTR_'0x005f7573'er.rgw.source_zone')
Delete( Prefix = _ key = 'SER_0000000004298992_AXATTR_'0x005f7573'er.rgw.tail_tag')
Delete( Prefix = _ key = 'SER_0000000004298992_AXATTR_'0x005f7573'er.rgw.x-amz-content-sha256')
    -3> 2020-08-12 19:23:48.005265 7f3ca01d4d40 -1 filestore(/var/lib/ceph/osd/ceph-323)  error (1) Operation not permitted not handled on operation 0x7f3ccdea5042 (22733464.0.1, or op 1, counting from 0)
    -2> 2020-08-12 19:23:48.005278 7f3ca01d4d40  0 filestore(/var/lib/ceph/osd/ceph-323) EPERM suggests file(s) in osd data dir not owned by ceph user, or leveldb corruption
    -1> 2020-08-12 19:23:48.005282 7f3ca01d4d40  0 filestore(/var/lib/ceph/osd/ceph-323)  transaction dump:
{
    "ops": [
        {
            "op_num": 0,
            "op_name": "touch",
            "collection": "10.3e50_head",
            "oid": "#10:0a7fab97:::default.38952138.358_Veeam%2fArchive%2ftest12%2f12020b78-734e-442c-97a1-e6627ad504c7%2f82f94cc1-8b50-413d-3c35-001c99f3f69d%2fblocks%2fa1e5c4330b80543b875f50ee439ef697%2f13

It's filestore OSD.
osd disk is healthy.
OSD's journal on SSD which is healthy.

Also available in: Atom PDF