Bug #37282
openrocksdb: submit_transaction_sync error: Corruption: block checksum mismatch code = 2
0%
Description
I have an OSD that will not start. It keep crashing. Not sure where to go from here. Unfortunately, it happened right after 2 other drives died. This means I have PGs down and cannot access the files in cephfs.
# /usr/bin/ceph-osd -f --cluster ceph --id 8 --setuser ceph --setgroup ceph starting osd.8 at - osd_data /var/lib/ceph/osd/ceph-8 /var/lib/ceph/osd/ceph-8/journal /build/ceph-13.2.2/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_kv_sync_thread()' thread 7f4c4aea3700 time 2018-11-15 17:28:00.093400 /build/ceph-13.2.2/src/os/bluestore/BlueStore.cc: 9073: FAILED assert(r == 0) 2018-11-15 17:28:00.091 7f4c4aea3700 -1 rocksdb: submit_common error: Corruption: block checksum mismatch: expected 2133069443, got 3635521166 in db/002194.sst offset 30843842 size 4614o code = 2 Rocksdb transaction: Put( Prefix = P key = 0x00000000005543dd'.can_rollback_to' Value size = 12) Put( Prefix = P key = 0x00000000005543dd'.rollback_info_trimmed_to' Value size = 12) Put( Prefix = O key = 0x858000000000000015f000000021213dfffffffffffffffeffffffffffffffff'o' Value size = 31) Put( Prefix = S key = 'nid_max' Value size = 8) Put( Prefix = S key = 'blobid_max' Value size = 8) ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x7f4c615c65c2] 2: (()+0x26c787) [0x7f4c615c6787] 3: (BlueStore::_kv_sync_thread()+0x13e6) [0x55c37dfe1ce6] 4: (BlueStore::KVSyncThread::entry()+0xd) [0x55c37e02664d] 5: (()+0x76db) [0x7f4c5fcc06db] 6: (clone()+0x3f) [0x7f4c5ec8988f] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 2018-11-15 17:28:00.091 7f4c4aea3700 -1 /build/ceph-13.2.2/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_kv_sync_thread()' thread 7f4c4aea3700 time 2018-11-15 17:28:00.093400 /build/ceph-13.2.2/src/os/bluestore/BlueStore.cc: 9073: FAILED assert(r == 0) ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x7f4c615c65c2] 2: (()+0x26c787) [0x7f4c615c6787] 3: (BlueStore::_kv_sync_thread()+0x13e6) [0x55c37dfe1ce6] 4: (BlueStore::KVSyncThread::entry()+0xd) [0x55c37e02664d] 5: (()+0x76db) [0x7f4c5fcc06db] 6: (clone()+0x3f) [0x7f4c5ec8988f] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. -1> 2018-11-15 17:28:00.091 7f4c4aea3700 -1 rocksdb: submit_common error: Corruption: block checksum mismatch: expected 2133069443, got 3635521166 in db/002194.sst offset 30843842 size 4614o code = 2 Rocksdb transaction: Put( Prefix = P key = 0x00000000005543dd'.can_rollback_to' Value size = 12) Put( Prefix = P key = 0x00000000005543dd'.rollback_info_trimmed_to' Value size = 12) Put( Prefix = O key = 0x858000000000000015f000000021213dfffffffffffffffeffffffffffffffff'o' Value size = 31) Put( Prefix = S key = 'nid_max' Value size = 8) Put( Prefix = S key = 'blobid_max' Value size = 8) 0> 2018-11-15 17:28:00.091 7f4c4aea3700 -1 /build/ceph-13.2.2/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_kv_sync_thread()' thread 7f4c4aea3700 time 2018-11-15 17:28:00.093400 /build/ceph-13.2.2/src/os/bluestore/BlueStore.cc: 9073: FAILED assert(r == 0) ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x7f4c615c65c2] 2: (()+0x26c787) [0x7f4c615c6787] 3: (BlueStore::_kv_sync_thread()+0x13e6) [0x55c37dfe1ce6] 4: (BlueStore::KVSyncThread::entry()+0xd) [0x55c37e02664d] 5: (()+0x76db) [0x7f4c5fcc06db] 6: (clone()+0x3f) [0x7f4c5ec8988f] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. *** Caught signal (Aborted) ** in thread 7f4c4aea3700 thread_name:bstore_kv_sync ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable) 1: (()+0x91a780) [0x55c37e0f8780] 2: (()+0x12890) [0x7f4c5fccb890] 3: (gsignal()+0xc7) [0x7f4c5eba6e97] 4: (abort()+0x141) [0x7f4c5eba8801] 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x250) [0x7f4c615c6710] 6: (()+0x26c787) [0x7f4c615c6787] 7: (BlueStore::_kv_sync_thread()+0x13e6) [0x55c37dfe1ce6] 8: (BlueStore::KVSyncThread::entry()+0xd) [0x55c37e02664d] 9: (()+0x76db) [0x7f4c5fcc06db] 10: (clone()+0x3f) [0x7f4c5ec8988f] 2018-11-15 17:28:00.095 7f4c4aea3700 -1 *** Caught signal (Aborted) ** in thread 7f4c4aea3700 thread_name:bstore_kv_sync ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable) 1: (()+0x91a780) [0x55c37e0f8780] 2: (()+0x12890) [0x7f4c5fccb890] 3: (gsignal()+0xc7) [0x7f4c5eba6e97] 4: (abort()+0x141) [0x7f4c5eba8801] 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x250) [0x7f4c615c6710] 6: (()+0x26c787) [0x7f4c615c6787] 7: (BlueStore::_kv_sync_thread()+0x13e6) [0x55c37dfe1ce6] 8: (BlueStore::KVSyncThread::entry()+0xd) [0x55c37e02664d] 9: (()+0x76db) [0x7f4c5fcc06db] 10: (clone()+0x3f) [0x7f4c5ec8988f] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 0> 2018-11-15 17:28:00.095 7f4c4aea3700 -1 *** Caught signal (Aborted) ** in thread 7f4c4aea3700 thread_name:bstore_kv_sync ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable) 1: (()+0x91a780) [0x55c37e0f8780] 2: (()+0x12890) [0x7f4c5fccb890] 3: (gsignal()+0xc7) [0x7f4c5eba6e97] 4: (abort()+0x141) [0x7f4c5eba8801] 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x250) [0x7f4c615c6710] 6: (()+0x26c787) [0x7f4c615c6787] 7: (BlueStore::_kv_sync_thread()+0x13e6) [0x55c37dfe1ce6] 8: (BlueStore::KVSyncThread::entry()+0xd) [0x55c37e02664d] 9: (()+0x76db) [0x7f4c5fcc06db] 10: (clone()+0x3f) [0x7f4c5ec8988f] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Aborted (core dumped)
Files
Updated by Igor Fedotov over 5 years ago
Firstly I suggest to verify the disk drive behind DB volume for physical errors.
Updated by Jeff Smith over 5 years ago
I have checked the kernel log and smartctl and do not see any errors.
Updated by Igor Fedotov over 5 years ago
Somewhat similar issue, may be useful as recovery guidance:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-November/031595.html
Updated by Josh Durgin over 5 years ago
- Status changed from New to Need More Info
Updated by David Sieger about 5 years ago
- File ceph-osd.25.log ceph-osd.25.log added
I might have been bitten by the same issue. The OSD in question is has its main data on a spinning drive and its database on a partition of an SSD. A hardware issue has not completely been ruled out, it just looks unlikely as far as I was able to investigate.
I ran ceph-bluestore-tool fsck
on the OSD, that resulted in this output:
$ sudo ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-25 2019-02-01 14:00:16.482736 7f39b18cbec0 -1 bluestore(/var/lib/ceph/osd/ceph-25) fsck error: stray shard 0x300000 2019-02-01 14:00:16.482753 7f39b18cbec0 -1 bluestore(/var/lib/ceph/osd/ceph-25) fsck error: 0x7f8000000000000002df7e2d92217262'.0.5a6ca9.238e1f29.0000000082fc!='0xfffffffffffffffeffffffffffffffff6f00300000'x' is unexpected 2019-02-01 14:00:16.482773 7f39b18cbec0 -1 bluestore(/var/lib/ceph/osd/ceph-25) fsck error: stray shard 0x380000 2019-02-01 14:00:16.482774 7f39b18cbec0 -1 bluestore(/var/lib/ceph/osd/ceph-25) fsck error: 0x7f8000000000000002df7e2d92217262'.0.5a6ca9.238e1f29.0000000082fc!='0xfffffffffffffffeffffffffffffffff6f00380000'x' is unexpected 2019-02-01 14:00:44.644396 7f39b18cbec0 -1 bluestore(/var/lib/ceph/osd/ceph-25) fsck error: actual store_statfs(0x49108d0000/0xe8e0c00000, stored 0x9dfe74e1aa/0x9f90320000, compress 0x0/0x0/0x0) != expected store_statfs(0x49108d0000/0xe8e0c00000, stored 0x9dfe34e1aa/0x9f8ff20000, compress 0x0/0x0/0x0) 2019-02-01 14:00:46.974661 7f39b18cbec0 -1 bluestore(/var/lib/ceph/osd/ceph-25) fsck error: leaked extent 0xb29b0a0000~400000 fsck success
It did not make any difference, though. Also, I cannot tell if the errors noted by fsck are related to this issue or not.
The crash itself loogs like this:
-1> 2019-02-01 12:22:46.111821 7fe079d53700 -1 rocksdb: submit_transaction error: Corruption: block checksum mismatch code = 2 Rocksdb transaction: Put( Prefix = O key = 0x7f80000000000000021600000021213dfffffffffffffffeffffffffffffffff'o' Value size = 30) 0> 2019-02-01 12:22:46.117761 7fe079d53700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.9/rpm/el7/BUILD/ceph-12.2.9/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_kv_sync_thread()' thread 7fe079d53700 time 2019-02-01 12:22:46.111884 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.9/rpm/el7/BUILD/ceph-12.2.9/src/os/bluestore/BlueStore.cc: 8717: FAILED assert(r == 0) ceph version 12.2.9 (9e300932ef8a8916fb3fda78c58691a6ab0f4217) luminous(stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x562af51e5e90] 2: (BlueStore::_kv_sync_thread()+0x3482) [0x562af5090162] 3: (BlueStore::KVSyncThread::entry()+0xd) [0x562af50d701d] 4: (()+0x7e25) [0x7fe089e12e25] 5: (clone()+0x6d) [0x7fe088f03bad] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
I attached a log file of the full startup-and-crash cycle.
Updated by Sage Weil about 5 years ago
We're not sure how to proceed without being able to reprdocue the crash, and we have never seen this.
1. Would it be psosible to provide a copy of the rocksdb portion of your osd? I'm hoping that will export the rocksdb issue that woul dthen let us hit the same error locally with something like ceph-kvstore-tool. You'd do this with
ceph-bluestore-tool bluefs-export ...
2. Or, could you provide a full image of the osd? This is bigger and obviously isn't possible if the data is sensitive, but if 1 doesn't work hopefully 2 would let us see the problem.
Thanks!
Updated by Radoslaw Zarzynski about 5 years ago
Keeping "needs more info" state.
Updated by Dan van der Ster about 5 years ago
We just saw this on an osd (block.db on ssd, data on hdd). OSD is from a cephfs cluster running 12.2.11.
We're actively converting this cluster from filestore to bluestore; this osd had just been created as bluestore around 08:50 on 2019-04-08 and was still backfilling in its PGs.
The osd started crashing like this:
2019-04-08 14:57:16.223895 7f1df1264700 2 rocksdb: [/builddir/build/BUILD/ceph-12.2.11/src/rocksdb/db/db_impl_compaction_flush.cc:1275] Wait ing after background compaction error: Corruption: block checksum mismatch, Accumulated background error counts: 1 2019-04-08 14:57:16.304853 7f1df2a67700 -1 rocksdb: submit_transaction error: Corruption: block checksum mismatch code = 2 Rocksdb transactio n: Put( Prefix = O key = 0x7f8000000000000001d840000021213dfffffffffffffffeffffffffffffffff'o' Value size = 29) 2019-04-08 14:57:16.307051 7f1df2a67700 -1 /builddir/build/BUILD/ceph-12.2.11/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_kv _sync_thread()' thread 7f1df2a67700 time 2019-04-08 14:57:16.304885 /builddir/build/BUILD/ceph-12.2.11/src/os/bluestore/BlueStore.cc: 8795: FAILED assert(r == 0) ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x55aa5f5d8b20] 2: (BlueStore::_kv_sync_thread()+0x3482) [0x55aa5f4811c2] 3: (BlueStore::KVSyncThread::entry()+0xd) [0x55aa5f4c86dd] 4: (()+0x7dd5) [0x7f1e02b2cdd5] 5: (clone()+0x6d) [0x7f1e01c1cead] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
We zapped the hdd and ssd partition, recreated with the same osd-id, and then during backfilling got this around 12 hours later:
-2> 2019-04-09 03:21:33.861491 7fa811497700 0 osd.121 pg_epoch: 60035 pg[1.308( v 60035'87922443 lc 59819'87922140 (59734'87920547,60035 '87922443] local-lis/les=60034/60035 n=147011 ec=369/369 lis/c 60034/59631 les/c/f 60035/59632/0 60034/60034/60034) [121,142,61] r=0 lpr=6003 4 pi=[59631,60034)/1 crt=60033'87922442 mlcod 59819'87922140 active+recovering+degraded m=208 mbc={255={(2+0)=36}}] _update_calc_stats ml 36 upset size 3 up 2 -1> 2019-04-09 03:21:33.887078 7fa80cc8e700 -1 abort: Corruption: Bad table magic number 0> 2019-04-09 03:21:33.891522 7fa80cc8e700 -1 *** Caught signal (Aborted) ** in thread 7fa80cc8e700 thread_name:tp_osd_tp ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous (stable) 1: (()+0xa63b61) [0x55e9ea9a3b61] 2: (()+0xf5d0) [0x7fa828d735d0] 3: (gsignal()+0x37) [0x7fa827d94207] 4: (abort()+0x148) [0x7fa827d958f8] 5: (RocksDBStore::get(std::string const&, char const*, unsigned long, ceph::buffer::list*)+0x1ce) [0x55e9ea8f3b6e] 6: (BlueStore::Collection::get_onode(ghobject_t const&, bool)+0x548) [0x55e9ea89e3d8] 7: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0xd9e) [0x55e9ea8b085e] 8: (BlueStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >& , boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x3a0) [0x55e9ea8b1f90] 9: (ObjectStore::queue_transaction(ObjectStore::Sequencer*, ObjectStore::Transaction&&, Context*, Context*, Context*, boost::intrusive_ptr<T rackedOp>, ThreadPool::TPHandle*)+0x171) [0x55e9ea495a31] 10: (PrimaryLogPG::remove_missing_object(hobject_t const&, eversion_t, Context*)+0x70b) [0x55e9ea5a2a5b] 11: (PrimaryLogPG::recover_missing(hobject_t const&, eversion_t, int, PGBackend::RecoveryHandle*)+0x9e1) [0x55e9ea5c2461] 12: (PrimaryLogPG::recover_primary(unsigned long, ThreadPool::TPHandle&)+0xfe4) [0x55e9ea5ff834] 13: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0x490) [0x55e9ea607ab0]
Now we tried fsck, got this output:
2019-04-09 11:24:29.051127 7fdb4be78ec0 -1 bluestore(/var/lib/ceph/osd/ceph-121) fsck error: stray shard 0x80000 2019-04-09 11:24:29.051138 7fdb4be78ec0 -1 bluestore(/var/lib/ceph/osd/ceph-121) fsck error: 0x7f8000000000000001254031'A!40006456a32.00000000!='0xfffffffffffffffeffffffffffffffff6f00080000'x' is unexpected 2019-04-09 11:24:29.051149 7fdb4be78ec0 -1 bluestore(/var/lib/ceph/osd/ceph-121) fsck error: stray shard 0x100000 2019-04-09 11:24:29.051151 7fdb4be78ec0 -1 bluestore(/var/lib/ceph/osd/ceph-121) fsck error: 0x7f8000000000000001254031'A!40006456a32.00000000!='0xfffffffffffffffeffffffffffffffff6f00100000'x' is unexpected 2019-04-09 11:24:29.051155 7fdb4be78ec0 -1 bluestore(/var/lib/ceph/osd/ceph-121) fsck error: stray shard 0x180000 2019-04-09 11:24:29.051156 7fdb4be78ec0 -1 bluestore(/var/lib/ceph/osd/ceph-121) fsck error: 0x7f8000000000000001254031'A!40006456a32.00000000!='0xfffffffffffffffeffffffffffffffff6f00180000'x' is unexpected 2019-04-09 11:24:29.051160 7fdb4be78ec0 -1 bluestore(/var/lib/ceph/osd/ceph-121) fsck error: stray shard 0x200000 2019-04-09 11:24:29.051160 7fdb4be78ec0 -1 bluestore(/var/lib/ceph/osd/ceph-121) fsck error: 0x7f8000000000000001254031'A!40006456a32.00000000!='0xfffffffffffffffeffffffffffffffff6f00200000'x' is unexpected 2019-04-09 11:24:29.051164 7fdb4be78ec0 -1 bluestore(/var/lib/ceph/osd/ceph-121) fsck error: stray shard 0x280000 2019-04-09 11:24:29.051165 7fdb4be78ec0 -1 bluestore(/var/lib/ceph/osd/ceph-121) fsck error: 0x7f8000000000000001254031'A!40006456a32.00000000!='0xfffffffffffffffeffffffffffffffff6f00280000'x' is unexpected 2019-04-09 11:24:29.051168 7fdb4be78ec0 -1 bluestore(/var/lib/ceph/osd/ceph-121) fsck error: stray shard 0x300000 2019-04-09 11:24:29.051169 7fdb4be78ec0 -1 bluestore(/var/lib/ceph/osd/ceph-121) fsck error: 0x7f8000000000000001254031'A!40006456a32.00000000!='0xfffffffffffffffeffffffffffffffff6f00300000'x' is unexpected 2019-04-09 11:24:29.051172 7fdb4be78ec0 -1 bluestore(/var/lib/ceph/osd/ceph-121) fsck error: stray shard 0x380000 2019-04-09 11:24:29.051172 7fdb4be78ec0 -1 bluestore(/var/lib/ceph/osd/ceph-121) fsck error: 0x7f8000000000000001254031'A!40006456a32.00000000!='0xfffffffffffffffeffffffffffffffff6f00380000'x' is unexpected 2019-04-09 11:24:34.789904 7fdb4be78ec0 -1 abort: Corruption: Bad table magic number *** Caught signal (Aborted) ** in thread 7fdb4be78ec0 thread_name:ceph-bluestore- ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous (stable) 1: (()+0x3fd311) [0x5592da925311] 2: (()+0xf5d0) [0x7fdb40eb85d0] 3: (gsignal()+0x37) [0x7fdb3f8a1207] 4: (abort()+0x148) [0x7fdb3f8a28f8] 5: (RocksDBStore::get(std::string const&, std::string const&, ceph::buffer::list*)+0x1c7) [0x5592da7dc4a7] 6: (()+0x1fb244) [0x5592da723244] 7: (()+0x1fa00f) [0x5592da72200f] 8: (BlueStore::ExtentMap::fault_range(KeyValueDB*, unsigned int, unsigned int)+0x3a3) [0x5592da77e8f3] 9: (BlueStore::_fsck(bool, bool)+0x1d79) [0x5592da7a2089] 10: (main()+0x154f) [0x5592da655ddf] 11: (__libc_start_main()+0xf5) [0x7fdb3f88d3d5] 12: (()+0x1c4f8f) [0x5592da6ecf8f] 2019-04-09 11:24:34.791061 7fdb4be78ec0 -1 *** Caught signal (Aborted) ** in thread 7fdb4be78ec0 thread_name:ceph-bluestore-
There are no kernel messages about medium errors. The SMART counters look fine, and a long SMART test is still ongoing.
I posted the bluefs-export to a5597af2-08c6-47a8-a3e9-029b1ac2e7bf
Updated by Sage Weil almost 5 years ago
- Priority changed from High to Urgent
Taking a look at this. It's interesting that it happened twice on the same device(s)... did it occur again after that or did you just skip that device?
Updated by Dan van der Ster almost 5 years ago
Same devices (HDD for data and ssd partition for block.db) for both failures.
We have left the osd down since the second failure so can do whatever now to help debug.
Updated by Sage Weil almost 5 years ago
Dan, that dump appears to have multiple errors:
2019-04-25 09:15:18.766 7f8c0a4b0140 1 rocksdb: do_open column families: [default] 2019-04-25 09:15:18.772 7f8c027fc700 2 rocksdb: [/home/sage/src/ceph/src/rocksdb/table/block_based_table_reader.cc:1159] Encountered error while reading data from properties block Corruption: block checksum mismatch: expected 1627042428, got 2680008382 in 121/db/000307.sst offset 67524299 size 82 2019-04-25 09:15:18.772 7f8c0a4b0140 2 rocksdb: [/home/sage/src/ceph/src/rocksdb/db/version_set.cc:1315] Unable to load table properties for file 293 --- Corruption: bad block contents 2019-04-25 09:15:18.773 7f8c0a4b0140 2 rocksdb: [/home/sage/src/ceph/src/rocksdb/db/version_set.cc:1315] Unable to load table properties for file 294 --- Corruption: bad block contents 2019-04-25 09:15:18.773 7f8c0a4b0140 2 rocksdb: [/home/sage/src/ceph/src/rocksdb/db/version_set.cc:1315] Unable to load table properties for file 295 --- NotFound: 2019-04-25 09:15:18.773 7f8c0a4b0140 2 rocksdb: [/home/sage/src/ceph/src/rocksdb/db/version_set.cc:1315] Unable to load table properties for file 296 --- Corruption: bad block contents 2019-04-25 09:15:18.773 7f8c0a4b0140 2 rocksdb: [/home/sage/src/ceph/src/rocksdb/db/version_set.cc:1315] Unable to load table properties for file 301 --- Corruption: bad block contents 2019-04-25 09:15:18.773 7f8c0a4b0140 2 rocksdb: [/home/sage/src/ceph/src/rocksdb/db/version_set.cc:1315] Unable to load table properties for file 302 --- NotFound: ceph-kvstore-tool: /home/sage/src/ceph/src/rocksdb/table/block.cc:731: uint32_t rocksdb::Block::NumRestarts() const: Assertion `size_ >= 2*sizeof(uint32_t)' failed.
would it be possible to try that OSD one more time, but with debug_bluefs=20? That may give us some clue where the corruption is coming from.
Updated by Dan van der Ster almost 5 years ago
would it be possible to try that OSD one more time
do you mean to zap/recreate it and try again or just start it as-is with debug_bluefs=20 ?
Updated by Sage Weil almost 5 years ago
Dan van der Ster wrote:
would it be possible to try that OSD one more time
do you mean to zap/recreate it and try again or just start it as-is with debug_bluefs=20 ?
zap/recreate, but with debug turned up
Updated by Dan van der Ster almost 5 years ago
Sage Weil wrote:
Dan van der Ster wrote:
would it be possible to try that OSD one more time
do you mean to zap/recreate it and try again or just start it as-is with debug_bluefs=20 ?
zap/recreate, but with debug turned up
ok that's done and it's running now. Last time it took a few hours to crash; I'll update when we have a crash (or when our log dir fills up ;) ).
Updated by Igor Fedotov almost 5 years ago
Just to record similar case and discovered root cause:
Our customer running Ceph version v12.2.11 complained about the same errors which prevent OSD to start:
7fc4e66eb700 -1 rocksdb: submit_transaction_sync error: Corruption: block checksum mismatch code = 2 Rocksdb transaction:
...
Additional investigation discovered several earlier OSD crashes caused by unexpected failures reported during BlueFS flush (triggered by RocksDB compaction). The first one had occurred 3 days before the initially reported failure.
Corresponding log output:
-1 bdev(.../block) _aio_thread got r=-61 ((61) No data available)
-1 .../KernelDevice.cc: In function 'void KernelDevice::_aio_thread()' thread 7f4cc3be9700 time ... KernelDevice.cc: 392: FAILED assert(0 == "got unexpected error from aio_t::get_return_value. " "This may suggest HW issue. Please check your dmesg!")
dmesg output analysis showed relevant disk write failures:
kernel: sd 1:1:0:7: [sdi] Unaligned partial completion (resid=32, sector_sz=512)
kernel: sd 1:1:0:7: [sdi] tag#9 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
kernel: sd 1:1:0:7: [sdi] tag#9 Sense Key : Medium Error [current]
kernel: sd 1:1:0:7: [sdi] tag#9 Add. Sense: Unrecovered read error
kernel: sd 1:1:0:7: [sdi] tag#9 CDB: Write(10) 2a 00 6c 03 7d 20 00 02 00 00
kernel: blk_update_request: critical medium error, dev sdi, sector 1812167968
smartctl report is clean though.
Notable thing is RAID in-between that probably hides proper SMART information.
Hence we've considered this as a HW failure.
Updated by Paul Emmerich over 4 years ago
I'm seeing this on 14.2.2. Disk seems healthy.
The OSD in question suffered from https://tracker.ceph.com/issues/40080 before that crash happened
Updated by Sage Weil over 4 years ago
- Related to Bug #40080: Bitmap allocator return duplicate entries which cause interval_set assert added
Updated by Sage Weil over 4 years ago
- Subject changed from /build/ceph-13.2.2/src/os/bluestore/BlueStore.cc: 9073: FAILED assert(r == 0) to rocksdb: submit_transaction_sync error: Corruption: block checksum mismatch code = 2
Updated by Igor Fedotov over 4 years ago
One more occurrence:
https://tracker.ceph.com/issues/41367
Updated by Neha Ojha over 4 years ago
- Related to Bug #41367: rocksdb: submit_transaction error: Corruption: block checksum mismatch code = 2 added
Updated by Sage Weil about 4 years ago
- Target version changed from v13.2.2 to v15.0.0
Updated by Jamin Collins about 4 years ago
It appears that I'm seeing the same problem with the (AFAIK) the most recent version of CEPH:
The OSDs in question are rotational devices fronted by an SSD backed LVM volume.
Jan 27 07:35:23 langhus-1 systemd[1]: Starting Ceph object storage daemon osd.0... Jan 27 07:35:23 langhus-1 systemd[1]: Started Ceph object storage daemon osd.0. Jan 27 07:35:24 langhus-1 ceph-osd[170413]: 2020-01-27 07:35:24.191 7fc46a8eac00 -1 Falling back to public interface Jan 27 07:35:26 langhus-1 ceph-osd[170413]: /build/ceph/src/ceph-14.2.6/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_kv_sync_thread()' thread 7fc45de45700 time 2020-01-27 07:35:26.092409 Jan 27 07:35:26 langhus-1 ceph-osd[170413]: /build/ceph/src/ceph-14.2.6/src/os/bluestore/BlueStore.cc: 10954: FAILED ceph_assert(r == 0) Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 2020-01-27 07:35:26.087 7fc45de45700 -1 rocksdb: submit_common error: Corruption: block checksum mismatch: expected 2932418700, got 2818836186 in db/001491.sst offset 18135301 size 3865 code = 2 Rocksdb transaction: Jan 27 07:35:26 langhus-1 ceph-osd[170413]: Delete( Prefix = O key = 0x7f7ffffffffffffffcdd000000217363'rub_2.bb!='0xfffffffffffffffeffffffffffffffff'o') Jan 27 07:35:26 langhus-1 ceph-osd[170413]: Put( Prefix = S key = 'nid_max' Value size = 8) Jan 27 07:35:26 langhus-1 ceph-osd[170413]: Put( Prefix = S key = 'blobid_max' Value size = 8) Jan 27 07:35:26 langhus-1 ceph-osd[170413]: ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable) Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14f) [0x563c6d22e845] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 2: (()+0x501a2f) [0x563c6d22ea2f] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 3: (BlueStore::_kv_sync_thread()+0x1160) [0x563c6d881ac0] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 4: (BlueStore::KVSyncThread::entry()+0xd) [0x563c6d8a82bd] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 5: (()+0x94cf) [0x7fc46ae774cf] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 6: (clone()+0x43) [0x7fc46aa2f2d3] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 2020-01-27 07:35:26.091 7fc45de45700 -1 /build/ceph/src/ceph-14.2.6/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_kv_sync_thread()' thread 7fc45de45700 time 2020-01-27 07:35:26.092409 Jan 27 07:35:26 langhus-1 ceph-osd[170413]: /build/ceph/src/ceph-14.2.6/src/os/bluestore/BlueStore.cc: 10954: FAILED ceph_assert(r == 0) Jan 27 07:35:26 langhus-1 ceph-osd[170413]: ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable) Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14f) [0x563c6d22e845] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 2: (()+0x501a2f) [0x563c6d22ea2f] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 3: (BlueStore::_kv_sync_thread()+0x1160) [0x563c6d881ac0] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 4: (BlueStore::KVSyncThread::entry()+0xd) [0x563c6d8a82bd] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 5: (()+0x94cf) [0x7fc46ae774cf] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 6: (clone()+0x43) [0x7fc46aa2f2d3] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: *** Caught signal (Aborted) ** Jan 27 07:35:26 langhus-1 ceph-osd[170413]: in thread 7fc45de45700 thread_name:bstore_kv_sync Jan 27 07:35:26 langhus-1 ceph-osd[170413]: ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable) Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 1: (()+0x14930) [0x7fc46ae82930] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 2: (gsignal()+0x145) [0x7fc46a96bf25] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 3: (abort()+0x12b) [0x7fc46a955897] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1aa) [0x563c6d22e8a0] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 5: (()+0x501a2f) [0x563c6d22ea2f] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 6: (BlueStore::_kv_sync_thread()+0x1160) [0x563c6d881ac0] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 7: (BlueStore::KVSyncThread::entry()+0xd) [0x563c6d8a82bd] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 8: (()+0x94cf) [0x7fc46ae774cf] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 9: (clone()+0x43) [0x7fc46aa2f2d3] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 2020-01-27 07:35:26.091 7fc45de45700 -1 *** Caught signal (Aborted) ** Jan 27 07:35:26 langhus-1 ceph-osd[170413]: in thread 7fc45de45700 thread_name:bstore_kv_sync Jan 27 07:35:26 langhus-1 ceph-osd[170413]: ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable) Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 1: (()+0x14930) [0x7fc46ae82930] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 2: (gsignal()+0x145) [0x7fc46a96bf25] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 3: (abort()+0x12b) [0x7fc46a955897] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1aa) [0x563c6d22e8a0] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 5: (()+0x501a2f) [0x563c6d22ea2f] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 6: (BlueStore::_kv_sync_thread()+0x1160) [0x563c6d881ac0] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 7: (BlueStore::KVSyncThread::entry()+0xd) [0x563c6d8a82bd] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 8: (()+0x94cf) [0x7fc46ae774cf] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 9: (clone()+0x43) [0x7fc46aa2f2d3] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Jan 27 07:35:26 langhus-1 ceph-osd[170413]: -481> 2020-01-27 07:35:24.191 7fc46a8eac00 -1 Falling back to public interface Jan 27 07:35:26 langhus-1 ceph-osd[170413]: -2> 2020-01-27 07:35:26.087 7fc45de45700 -1 rocksdb: submit_common error: Corruption: block checksum mismatch: expected 2932418700, got 2818836186 in db/001491.sst offset 18135301 size 3865 code = 2 Rocksdb transaction: Jan 27 07:35:26 langhus-1 ceph-osd[170413]: Delete( Prefix = O key = 0x7f7ffffffffffffffcdd000000217363'rub_2.bb!='0xfffffffffffffffeffffffffffffffff'o') Jan 27 07:35:26 langhus-1 ceph-osd[170413]: Put( Prefix = S key = 'nid_max' Value size = 8) Jan 27 07:35:26 langhus-1 ceph-osd[170413]: Put( Prefix = S key = 'blobid_max' Value size = 8) Jan 27 07:35:26 langhus-1 ceph-osd[170413]: -1> 2020-01-27 07:35:26.091 7fc45de45700 -1 /build/ceph/src/ceph-14.2.6/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_kv_sync_thread()' thread 7fc45de45700 time 2020-01-27 07:35:26.092409 Jan 27 07:35:26 langhus-1 ceph-osd[170413]: /build/ceph/src/ceph-14.2.6/src/os/bluestore/BlueStore.cc: 10954: FAILED ceph_assert(r == 0) Jan 27 07:35:26 langhus-1 ceph-osd[170413]: ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable) Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14f) [0x563c6d22e845] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 2: (()+0x501a2f) [0x563c6d22ea2f] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 3: (BlueStore::_kv_sync_thread()+0x1160) [0x563c6d881ac0] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 4: (BlueStore::KVSyncThread::entry()+0xd) [0x563c6d8a82bd] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 5: (()+0x94cf) [0x7fc46ae774cf] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 6: (clone()+0x43) [0x7fc46aa2f2d3] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 0> 2020-01-27 07:35:26.091 7fc45de45700 -1 *** Caught signal (Aborted) ** Jan 27 07:35:26 langhus-1 ceph-osd[170413]: in thread 7fc45de45700 thread_name:bstore_kv_sync Jan 27 07:35:26 langhus-1 ceph-osd[170413]: ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable) Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 1: (()+0x14930) [0x7fc46ae82930] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 2: (gsignal()+0x145) [0x7fc46a96bf25] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 3: (abort()+0x12b) [0x7fc46a955897] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1aa) [0x563c6d22e8a0] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 5: (()+0x501a2f) [0x563c6d22ea2f] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 6: (BlueStore::_kv_sync_thread()+0x1160) [0x563c6d881ac0] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 7: (BlueStore::KVSyncThread::entry()+0xd) [0x563c6d8a82bd] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 8: (()+0x94cf) [0x7fc46ae774cf] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 9: (clone()+0x43) [0x7fc46aa2f2d3] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Jan 27 07:35:26 langhus-1 ceph-osd[170413]: -716> 2020-01-27 07:35:24.191 7fc46a8eac00 -1 Falling back to public interface Jan 27 07:35:26 langhus-1 ceph-osd[170413]: -715> 2020-01-27 07:35:26.087 7fc45de45700 -1 rocksdb: submit_common error: Corruption: block checksum mismatch: expected 2932418700, got 2818836186 in db/001491.sst offset 18135301 size 3865 code = 2 Rocksdb transaction: Jan 27 07:35:26 langhus-1 ceph-osd[170413]: Delete( Prefix = O key = 0x7f7ffffffffffffffcdd000000217363'rub_2.bb!='0xfffffffffffffffeffffffffffffffff'o') Jan 27 07:35:26 langhus-1 ceph-osd[170413]: Put( Prefix = S key = 'nid_max' Value size = 8) Jan 27 07:35:26 langhus-1 ceph-osd[170413]: Put( Prefix = S key = 'blobid_max' Value size = 8) Jan 27 07:35:26 langhus-1 ceph-osd[170413]: -714> 2020-01-27 07:35:26.091 7fc45de45700 -1 /build/ceph/src/ceph-14.2.6/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_kv_sync_thread()' thread 7fc45de45700 time 2020-01-27 07:35:26.092409 Jan 27 07:35:26 langhus-1 ceph-osd[170413]: /build/ceph/src/ceph-14.2.6/src/os/bluestore/BlueStore.cc: 10954: FAILED ceph_assert(r == 0) Jan 27 07:35:26 langhus-1 ceph-osd[170413]: ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable) Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14f) [0x563c6d22e845] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 2: (()+0x501a2f) [0x563c6d22ea2f] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 3: (BlueStore::_kv_sync_thread()+0x1160) [0x563c6d881ac0] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 4: (BlueStore::KVSyncThread::entry()+0xd) [0x563c6d8a82bd] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 5: (()+0x94cf) [0x7fc46ae774cf] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 6: (clone()+0x43) [0x7fc46aa2f2d3] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: -713> 2020-01-27 07:35:26.091 7fc45de45700 -1 *** Caught signal (Aborted) ** Jan 27 07:35:26 langhus-1 ceph-osd[170413]: in thread 7fc45de45700 thread_name:bstore_kv_sync Jan 27 07:35:26 langhus-1 ceph-osd[170413]: ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable) Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 1: (()+0x14930) [0x7fc46ae82930] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 2: (gsignal()+0x145) [0x7fc46a96bf25] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 3: (abort()+0x12b) [0x7fc46a955897] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1aa) [0x563c6d22e8a0] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 5: (()+0x501a2f) [0x563c6d22ea2f] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 6: (BlueStore::_kv_sync_thread()+0x1160) [0x563c6d881ac0] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 7: (BlueStore::KVSyncThread::entry()+0xd) [0x563c6d8a82bd] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 8: (()+0x94cf) [0x7fc46ae774cf] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: 9: (clone()+0x43) [0x7fc46aa2f2d3] Jan 27 07:35:26 langhus-1 ceph-osd[170413]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Jan 27 07:35:26 langhus-1 systemd[1]: ceph-osd@0.service: Main process exited, code=killed, status=6/ABRT Jan 27 07:35:26 langhus-1 systemd[1]: ceph-osd@0.service: Failed with result 'signal'. Jan 27 07:35:26 langhus-1 systemd[1]: ceph-osd@0.service: Scheduled restart job, restart counter is at 3. Jan 27 07:35:26 langhus-1 systemd[1]: Stopped Ceph object storage daemon osd.0. Jan 27 07:35:26 langhus-1 systemd[1]: ceph-osd@0.service: Start request repeated too quickly. Jan 27 07:35:26 langhus-1 systemd[1]: ceph-osd@0.service: Failed with result 'signal'. Jan 27 07:35:26 langhus-1 systemd[1]: Failed to start Ceph object storage daemon osd.0.
$ sudo ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-0/ /build/ceph/src/ceph-14.2.6/src/kv/RocksDBStore.cc: In function 'virtual int RocksDBStore::get(const string&, const string&, ceph::bufferlist*)' thread 7f3612cd5d80 time 2020-01-27 08:12:34.731236 /build/ceph/src/ceph-14.2.6/src/kv/RocksDBStore.cc: 1211: ceph_abort_msg("block checksum mismatch: expected 2932418700, got 2818836186 in db/001491.sst offset 18135301 size 3865") ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable) 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xda) [0x7f36139cca34] 2: (RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list*)+0x39c) [0x5609d1f2376c] 3: (()+0x269a3b) [0x5609d1cdea3b] 4: (()+0x2574d1) [0x5609d1ccc4d1] 5: (BlueStore::ExtentMap::fault_range(KeyValueDB*, unsigned int, unsigned int)+0x23c) [0x5609d1d1dbdc] 6: (BlueStore::fsck_check_objects_shallow(BlueStore::FSCKDepth, long, boost::intrusive_ptr<BlueStore::Collection>, ghobject_t const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list const&, std::__cxx11::list<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, mempool::pool_allocator<(mempool::pool_index_t)5, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, std::map<boost::intrusive_ptr<BlueStore::Blob>, unsigned short, std::less<boost::intrusive_ptr<BlueStore::Blob> >, std::allocator<std::pair<boost::intrusive_ptr<BlueStore::Blob> const, unsigned short> > >*, BlueStore::FSCK_ObjectCtx const&)+0x22a) [0x5609d1d2a67a] 7: (BlueStore::_fsck_check_objects(BlueStore::FSCKDepth, BlueStore::FSCK_ObjectCtx&)+0x1a3d) [0x5609d1d5b25d] 8: (BlueStore::_fsck_on_open(BlueStore::FSCKDepth, bool)+0x1341) [0x5609d1d5f5b1] 9: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x326) [0x5609d1d6aa86] 10: (main()+0x1274) [0x5609d1c62314] 11: (__libc_start_main()+0xf3) [0x7f3612f7c153] 12: (_start()+0x2e) [0x5609d1c869ce] *** Caught signal (Aborted) ** in thread 7f3612cd5d80 thread_name:ceph-bluestore- ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable) 1: (()+0x14930) [0x7f3613481930] 2: (gsignal()+0x145) [0x7f3612f90f25] 3: (abort()+0x12b) [0x7f3612f7a897] 4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b3) [0x7f36139ccb0d] 5: (RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list*)+0x39c) [0x5609d1f2376c] 6: (()+0x269a3b) [0x5609d1cdea3b] 7: (()+0x2574d1) [0x5609d1ccc4d1] 8: (BlueStore::ExtentMap::fault_range(KeyValueDB*, unsigned int, unsigned int)+0x23c) [0x5609d1d1dbdc] 9: (BlueStore::fsck_check_objects_shallow(BlueStore::FSCKDepth, long, boost::intrusive_ptr<BlueStore::Collection>, ghobject_t const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list const&, std::__cxx11::list<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, mempool::pool_allocator<(mempool::pool_index_t)5, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, std::map<boost::intrusive_ptr<BlueStore::Blob>, unsigned short, std::less<boost::intrusive_ptr<BlueStore::Blob> >, std::allocator<std::pair<boost::intrusive_ptr<BlueStore::Blob> const, unsigned short> > >*, BlueStore::FSCK_ObjectCtx const&)+0x22a) [0x5609d1d2a67a] 10: (BlueStore::_fsck_check_objects(BlueStore::FSCKDepth, BlueStore::FSCK_ObjectCtx&)+0x1a3d) [0x5609d1d5b25d] 11: (BlueStore::_fsck_on_open(BlueStore::FSCKDepth, bool)+0x1341) [0x5609d1d5f5b1] 12: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x326) [0x5609d1d6aa86] 13: (main()+0x1274) [0x5609d1c62314] 14: (__libc_start_main()+0xf3) [0x7f3612f7c153] 15: (_start()+0x2e) [0x5609d1c869ce] 2020-01-27 08:12:34.730 7f3612cd5d80 -1 /build/ceph/src/ceph-14.2.6/src/kv/RocksDBStore.cc: In function 'virtual int RocksDBStore::get(const string&, const string&, ceph::bufferlist*)' thread 7f3612cd5d80 time 2020-01-27 08:12:34.731236 /build/ceph/src/ceph-14.2.6/src/kv/RocksDBStore.cc: 1211: ceph_abort_msg("block checksum mismatch: expected 2932418700, got 2818836186 in db/001491.sst offset 18135301 size 3865") ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable) 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xda) [0x7f36139cca34] 2: (RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list*)+0x39c) [0x5609d1f2376c] 3: (()+0x269a3b) [0x5609d1cdea3b] 4: (()+0x2574d1) [0x5609d1ccc4d1] 5: (BlueStore::ExtentMap::fault_range(KeyValueDB*, unsigned int, unsigned int)+0x23c) [0x5609d1d1dbdc] 6: (BlueStore::fsck_check_objects_shallow(BlueStore::FSCKDepth, long, boost::intrusive_ptr<BlueStore::Collection>, ghobject_t const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list const&, std::__cxx11::list<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, mempool::pool_allocator<(mempool::pool_index_t)5, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, std::map<boost::intrusive_ptr<BlueStore::Blob>, unsigned short, std::less<boost::intrusive_ptr<BlueStore::Blob> >, std::allocator<std::pair<boost::intrusive_ptr<BlueStore::Blob> const, unsigned short> > >*, BlueStore::FSCK_ObjectCtx const&)+0x22a) [0x5609d1d2a67a] 7: (BlueStore::_fsck_check_objects(BlueStore::FSCKDepth, BlueStore::FSCK_ObjectCtx&)+0x1a3d) [0x5609d1d5b25d] 8: (BlueStore::_fsck_on_open(BlueStore::FSCKDepth, bool)+0x1341) [0x5609d1d5f5b1] 9: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x326) [0x5609d1d6aa86] 10: (main()+0x1274) [0x5609d1c62314] 11: (__libc_start_main()+0xf3) [0x7f3612f7c153] 12: (_start()+0x2e) [0x5609d1c869ce] 2020-01-27 08:12:34.730 7f3612cd5d80 -1 *** Caught signal (Aborted) ** in thread 7f3612cd5d80 thread_name:ceph-bluestore- ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable) 1: (()+0x14930) [0x7f3613481930] 2: (gsignal()+0x145) [0x7f3612f90f25] 3: (abort()+0x12b) [0x7f3612f7a897] 4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b3) [0x7f36139ccb0d] 5: (RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list*)+0x39c) [0x5609d1f2376c] 6: (()+0x269a3b) [0x5609d1cdea3b] 7: (()+0x2574d1) [0x5609d1ccc4d1] 8: (BlueStore::ExtentMap::fault_range(KeyValueDB*, unsigned int, unsigned int)+0x23c) [0x5609d1d1dbdc] 9: (BlueStore::fsck_check_objects_shallow(BlueStore::FSCKDepth, long, boost::intrusive_ptr<BlueStore::Collection>, ghobject_t const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list const&, std::__cxx11::list<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, mempool::pool_allocator<(mempool::pool_index_t)5, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, std::map<boost::intrusive_ptr<BlueStore::Blob>, unsigned short, std::less<boost::intrusive_ptr<BlueStore::Blob> >, std::allocator<std::pair<boost::intrusive_ptr<BlueStore::Blob> const, unsigned short> > >*, BlueStore::FSCK_ObjectCtx const&)+0x22a) [0x5609d1d2a67a] 10: (BlueStore::_fsck_check_objects(BlueStore::FSCKDepth, BlueStore::FSCK_ObjectCtx&)+0x1a3d) [0x5609d1d5b25d] 11: (BlueStore::_fsck_on_open(BlueStore::FSCKDepth, bool)+0x1341) [0x5609d1d5f5b1] 12: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x326) [0x5609d1d6aa86] 13: (main()+0x1274) [0x5609d1c62314] 14: (__libc_start_main()+0xf3) [0x7f3612f7c153] 15: (_start()+0x2e) [0x5609d1c869ce] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. -268> 2020-01-27 08:12:34.730 7f3612cd5d80 -1 /build/ceph/src/ceph-14.2.6/src/kv/RocksDBStore.cc: In function 'virtual int RocksDBStore::get(const string&, const string&, ceph::bufferlist*)' thread 7f3612cd5d80 time 2020-01-27 08:12:34.731236 /build/ceph/src/ceph-14.2.6/src/kv/RocksDBStore.cc: 1211: ceph_abort_msg("block checksum mismatch: expected 2932418700, got 2818836186 in db/001491.sst offset 18135301 size 3865") ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable) 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xda) [0x7f36139cca34] 2: (RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list*)+0x39c) [0x5609d1f2376c] 3: (()+0x269a3b) [0x5609d1cdea3b] 4: (()+0x2574d1) [0x5609d1ccc4d1] 5: (BlueStore::ExtentMap::fault_range(KeyValueDB*, unsigned int, unsigned int)+0x23c) [0x5609d1d1dbdc] 6: (BlueStore::fsck_check_objects_shallow(BlueStore::FSCKDepth, long, boost::intrusive_ptr<BlueStore::Collection>, ghobject_t const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list const&, std::__cxx11::list<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, mempool::pool_allocator<(mempool::pool_index_t)5, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, std::map<boost::intrusive_ptr<BlueStore::Blob>, unsigned short, std::less<boost::intrusive_ptr<BlueStore::Blob> >, std::allocator<std::pair<boost::intrusive_ptr<BlueStore::Blob> const, unsigned short> > >*, BlueStore::FSCK_ObjectCtx const&)+0x22a) [0x5609d1d2a67a] 7: (BlueStore::_fsck_check_objects(BlueStore::FSCKDepth, BlueStore::FSCK_ObjectCtx&)+0x1a3d) [0x5609d1d5b25d] 8: (BlueStore::_fsck_on_open(BlueStore::FSCKDepth, bool)+0x1341) [0x5609d1d5f5b1] 9: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x326) [0x5609d1d6aa86] 10: (main()+0x1274) [0x5609d1c62314] 11: (__libc_start_main()+0xf3) [0x7f3612f7c153] 12: (_start()+0x2e) [0x5609d1c869ce] -267> 2020-01-27 08:12:34.730 7f3612cd5d80 -1 *** Caught signal (Aborted) ** in thread 7f3612cd5d80 thread_name:ceph-bluestore- ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable) 1: (()+0x14930) [0x7f3613481930] 2: (gsignal()+0x145) [0x7f3612f90f25] 3: (abort()+0x12b) [0x7f3612f7a897] 4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b3) [0x7f36139ccb0d] 5: (RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list*)+0x39c) [0x5609d1f2376c] 6: (()+0x269a3b) [0x5609d1cdea3b] 7: (()+0x2574d1) [0x5609d1ccc4d1] 8: (BlueStore::ExtentMap::fault_range(KeyValueDB*, unsigned int, unsigned int)+0x23c) [0x5609d1d1dbdc] 9: (BlueStore::fsck_check_objects_shallow(BlueStore::FSCKDepth, long, boost::intrusive_ptr<BlueStore::Collection>, ghobject_t const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list const&, std::__cxx11::list<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, mempool::pool_allocator<(mempool::pool_index_t)5, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, std::map<boost::intrusive_ptr<BlueStore::Blob>, unsigned short, std::less<boost::intrusive_ptr<BlueStore::Blob> >, std::allocator<std::pair<boost::intrusive_ptr<BlueStore::Blob> const, unsigned short> > >*, BlueStore::FSCK_ObjectCtx const&)+0x22a) [0x5609d1d2a67a] 10: (BlueStore::_fsck_check_objects(BlueStore::FSCKDepth, BlueStore::FSCK_ObjectCtx&)+0x1a3d) [0x5609d1d5b25d] 11: (BlueStore::_fsck_on_open(BlueStore::FSCKDepth, bool)+0x1341) [0x5609d1d5f5b1] 12: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x326) [0x5609d1d6aa86] 13: (main()+0x1274) [0x5609d1c62314] 14: (__libc_start_main()+0xf3) [0x7f3612f7c153] 15: (_start()+0x2e) [0x5609d1c869ce] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. -268> 2020-01-27 08:12:34.730 7f3612cd5d80 -1 /build/ceph/src/ceph-14.2.6/src/kv/RocksDBStore.cc: In function 'virtual int RocksDBStore::get(const string&, const string&, ceph::bufferlist*)' thread 7f3612cd5d80 time 2020-01-27 08:12:34.731236 /build/ceph/src/ceph-14.2.6/src/kv/RocksDBStore.cc: 1211: ceph_abort_msg("block checksum mismatch: expected 2932418700, got 2818836186 in db/001491.sst offset 18135301 size 3865") ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable) 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xda) [0x7f36139cca34] 2: (RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list*)+0x39c) [0x5609d1f2376c] 3: (()+0x269a3b) [0x5609d1cdea3b] 4: (()+0x2574d1) [0x5609d1ccc4d1] 5: (BlueStore::ExtentMap::fault_range(KeyValueDB*, unsigned int, unsigned int)+0x23c) [0x5609d1d1dbdc] 6: (BlueStore::fsck_check_objects_shallow(BlueStore::FSCKDepth, long, boost::intrusive_ptr<BlueStore::Collection>, ghobject_t const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list const&, std::__cxx11::list<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, mempool::pool_allocator<(mempool::pool_index_t)5, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, std::map<boost::intrusive_ptr<BlueStore::Blob>, unsigned short, std::less<boost::intrusive_ptr<BlueStore::Blob> >, std::allocator<std::pair<boost::intrusive_ptr<BlueStore::Blob> const, unsigned short> > >*, BlueStore::FSCK_ObjectCtx const&)+0x22a) [0x5609d1d2a67a] 7: (BlueStore::_fsck_check_objects(BlueStore::FSCKDepth, BlueStore::FSCK_ObjectCtx&)+0x1a3d) [0x5609d1d5b25d] 8: (BlueStore::_fsck_on_open(BlueStore::FSCKDepth, bool)+0x1341) [0x5609d1d5f5b1] 9: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x326) [0x5609d1d6aa86] 10: (main()+0x1274) [0x5609d1c62314] 11: (__libc_start_main()+0xf3) [0x7f3612f7c153] 12: (_start()+0x2e) [0x5609d1c869ce] -267> 2020-01-27 08:12:34.730 7f3612cd5d80 -1 *** Caught signal (Aborted) ** in thread 7f3612cd5d80 thread_name:ceph-bluestore- ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable) 1: (()+0x14930) [0x7f3613481930] 2: (gsignal()+0x145) [0x7f3612f90f25] 3: (abort()+0x12b) [0x7f3612f7a897] 4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b3) [0x7f36139ccb0d] 5: (RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list*)+0x39c) [0x5609d1f2376c] 6: (()+0x269a3b) [0x5609d1cdea3b] 7: (()+0x2574d1) [0x5609d1ccc4d1] 8: (BlueStore::ExtentMap::fault_range(KeyValueDB*, unsigned int, unsigned int)+0x23c) [0x5609d1d1dbdc] 9: (BlueStore::fsck_check_objects_shallow(BlueStore::FSCKDepth, long, boost::intrusive_ptr<BlueStore::Collection>, ghobject_t const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list const&, std::__cxx11::list<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, mempool::pool_allocator<(mempool::pool_index_t)5, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, std::map<boost::intrusive_ptr<BlueStore::Blob>, unsigned short, std::less<boost::intrusive_ptr<BlueStore::Blob> >, std::allocator<std::pair<boost::intrusive_ptr<BlueStore::Blob> const, unsigned short> > >*, BlueStore::FSCK_ObjectCtx const&)+0x22a) [0x5609d1d2a67a] 10: (BlueStore::_fsck_check_objects(BlueStore::FSCKDepth, BlueStore::FSCK_ObjectCtx&)+0x1a3d) [0x5609d1d5b25d] 11: (BlueStore::_fsck_on_open(BlueStore::FSCKDepth, bool)+0x1341) [0x5609d1d5f5b1] 12: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x326) [0x5609d1d6aa86] 13: (main()+0x1274) [0x5609d1c62314] 14: (__libc_start_main()+0xf3) [0x7f3612f7c153] 15: (_start()+0x2e) [0x5609d1c869ce] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Aborted
Jan 27 07:35:25 langhus-1 systemd[1]: Starting Ceph object storage daemon osd.5... Jan 27 07:35:25 langhus-1 systemd[1]: Started Ceph object storage daemon osd.5. Jan 27 07:35:25 langhus-1 ceph-osd[170517]: 2020-01-27 07:35:25.937 7ff1dc81ec00 -1 Falling back to public interface Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 2020-01-27 07:35:27.494 7ff1cfd79700 -1 rocksdb: submit_common error: Corruption: block checksum mismatch: expected 2705794548, got 186875627 in db/000956.sst offset 684976 size 53335 code = 2 Rocksdb transaction: Jan 27 07:35:27 langhus-1 ceph-osd[170517]: Put( Prefix = O key = 0x7f80000000000000028f00000021213dfffffffffffffffeffffffffffffffff'o' Value size = 29) Jan 27 07:35:27 langhus-1 ceph-osd[170517]: Put( Prefix = S key = 'nid_max' Value size = 8) Jan 27 07:35:27 langhus-1 ceph-osd[170517]: Put( Prefix = S key = 'blobid_max' Value size = 8) Jan 27 07:35:27 langhus-1 ceph-osd[170517]: /build/ceph/src/ceph-14.2.6/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_kv_sync_thread()' thread 7ff1cfd79700 time 2020-01-27 07:35:27.496963 Jan 27 07:35:27 langhus-1 ceph-osd[170517]: /build/ceph/src/ceph-14.2.6/src/os/bluestore/BlueStore.cc: 10954: FAILED ceph_assert(r == 0) Jan 27 07:35:27 langhus-1 ceph-osd[170517]: ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable) Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14f) [0x55d416bfd845] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 2: (()+0x501a2f) [0x55d416bfda2f] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 3: (BlueStore::_kv_sync_thread()+0x1160) [0x55d417250ac0] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 4: (BlueStore::KVSyncThread::entry()+0xd) [0x55d4172772bd] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 5: (()+0x94cf) [0x7ff1dcdab4cf] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 6: (clone()+0x43) [0x7ff1dc9632d3] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 2020-01-27 07:35:27.494 7ff1cfd79700 -1 /build/ceph/src/ceph-14.2.6/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_kv_sync_thread()' thread 7ff1cfd79700 time 2020-01-27 07:35:27.496963 Jan 27 07:35:27 langhus-1 ceph-osd[170517]: /build/ceph/src/ceph-14.2.6/src/os/bluestore/BlueStore.cc: 10954: FAILED ceph_assert(r == 0) Jan 27 07:35:27 langhus-1 ceph-osd[170517]: ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable) Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14f) [0x55d416bfd845] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 2: (()+0x501a2f) [0x55d416bfda2f] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 3: (BlueStore::_kv_sync_thread()+0x1160) [0x55d417250ac0] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 4: (BlueStore::KVSyncThread::entry()+0xd) [0x55d4172772bd] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 5: (()+0x94cf) [0x7ff1dcdab4cf] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 6: (clone()+0x43) [0x7ff1dc9632d3] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: *** Caught signal (Aborted) ** Jan 27 07:35:27 langhus-1 ceph-osd[170517]: in thread 7ff1cfd79700 thread_name:bstore_kv_sync Jan 27 07:35:27 langhus-1 ceph-osd[170517]: ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable) Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 1: (()+0x14930) [0x7ff1dcdb6930] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 2: (gsignal()+0x145) [0x7ff1dc89ff25] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 3: (abort()+0x12b) [0x7ff1dc889897] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1aa) [0x55d416bfd8a0] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 5: (()+0x501a2f) [0x55d416bfda2f] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 6: (BlueStore::_kv_sync_thread()+0x1160) [0x55d417250ac0] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 7: (BlueStore::KVSyncThread::entry()+0xd) [0x55d4172772bd] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 8: (()+0x94cf) [0x7ff1dcdab4cf] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 9: (clone()+0x43) [0x7ff1dc9632d3] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 2020-01-27 07:35:27.497 7ff1cfd79700 -1 *** Caught signal (Aborted) ** Jan 27 07:35:27 langhus-1 ceph-osd[170517]: in thread 7ff1cfd79700 thread_name:bstore_kv_sync Jan 27 07:35:27 langhus-1 ceph-osd[170517]: ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable) Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 1: (()+0x14930) [0x7ff1dcdb6930] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 2: (gsignal()+0x145) [0x7ff1dc89ff25] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 3: (abort()+0x12b) [0x7ff1dc889897] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1aa) [0x55d416bfd8a0] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 5: (()+0x501a2f) [0x55d416bfda2f] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 6: (BlueStore::_kv_sync_thread()+0x1160) [0x55d417250ac0] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 7: (BlueStore::KVSyncThread::entry()+0xd) [0x55d4172772bd] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 8: (()+0x94cf) [0x7ff1dcdab4cf] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 9: (clone()+0x43) [0x7ff1dc9632d3] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Jan 27 07:35:27 langhus-1 ceph-osd[170517]: -483> 2020-01-27 07:35:25.937 7ff1dc81ec00 -1 Falling back to public interface Jan 27 07:35:27 langhus-1 ceph-osd[170517]: -3> 2020-01-27 07:35:27.494 7ff1cfd79700 -1 rocksdb: submit_common error: Corruption: block checksum mismatch: expected 2705794548, got 186875627 in db/000956.sst offset 684976 size 53335 code = 2 Rocksdb transaction: Jan 27 07:35:27 langhus-1 ceph-osd[170517]: Put( Prefix = O key = 0x7f80000000000000028f00000021213dfffffffffffffffeffffffffffffffff'o' Value size = 29) Jan 27 07:35:27 langhus-1 ceph-osd[170517]: Put( Prefix = S key = 'nid_max' Value size = 8) Jan 27 07:35:27 langhus-1 ceph-osd[170517]: Put( Prefix = S key = 'blobid_max' Value size = 8) Jan 27 07:35:27 langhus-1 ceph-osd[170517]: -1> 2020-01-27 07:35:27.494 7ff1cfd79700 -1 /build/ceph/src/ceph-14.2.6/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_kv_sync_thread()' thread 7ff1cfd79700 time 2020-01-27 07:35:27.496963 Jan 27 07:35:27 langhus-1 ceph-osd[170517]: /build/ceph/src/ceph-14.2.6/src/os/bluestore/BlueStore.cc: 10954: FAILED ceph_assert(r == 0) Jan 27 07:35:27 langhus-1 ceph-osd[170517]: ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable) Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14f) [0x55d416bfd845] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 2: (()+0x501a2f) [0x55d416bfda2f] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 3: (BlueStore::_kv_sync_thread()+0x1160) [0x55d417250ac0] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 4: (BlueStore::KVSyncThread::entry()+0xd) [0x55d4172772bd] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 5: (()+0x94cf) [0x7ff1dcdab4cf] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 6: (clone()+0x43) [0x7ff1dc9632d3] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 0> 2020-01-27 07:35:27.497 7ff1cfd79700 -1 *** Caught signal (Aborted) ** Jan 27 07:35:27 langhus-1 ceph-osd[170517]: in thread 7ff1cfd79700 thread_name:bstore_kv_sync Jan 27 07:35:27 langhus-1 ceph-osd[170517]: ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable) Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 1: (()+0x14930) [0x7ff1dcdb6930] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 2: (gsignal()+0x145) [0x7ff1dc89ff25] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 3: (abort()+0x12b) [0x7ff1dc889897] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1aa) [0x55d416bfd8a0] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 5: (()+0x501a2f) [0x55d416bfda2f] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 6: (BlueStore::_kv_sync_thread()+0x1160) [0x55d417250ac0] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 7: (BlueStore::KVSyncThread::entry()+0xd) [0x55d4172772bd] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 8: (()+0x94cf) [0x7ff1dcdab4cf] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 9: (clone()+0x43) [0x7ff1dc9632d3] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Jan 27 07:35:27 langhus-1 ceph-osd[170517]: -719> 2020-01-27 07:35:25.937 7ff1dc81ec00 -1 Falling back to public interface Jan 27 07:35:27 langhus-1 ceph-osd[170517]: -718> 2020-01-27 07:35:27.494 7ff1cfd79700 -1 rocksdb: submit_common error: Corruption: block checksum mismatch: expected 2705794548, got 186875627 in db/000956.sst offset 684976 size 53335 code = 2 Rocksdb transaction: Jan 27 07:35:27 langhus-1 ceph-osd[170517]: Put( Prefix = O key = 0x7f80000000000000028f00000021213dfffffffffffffffeffffffffffffffff'o' Value size = 29) Jan 27 07:35:27 langhus-1 ceph-osd[170517]: Put( Prefix = S key = 'nid_max' Value size = 8) Jan 27 07:35:27 langhus-1 ceph-osd[170517]: Put( Prefix = S key = 'blobid_max' Value size = 8) Jan 27 07:35:27 langhus-1 ceph-osd[170517]: -717> 2020-01-27 07:35:27.494 7ff1cfd79700 -1 /build/ceph/src/ceph-14.2.6/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_kv_sync_thread()' thread 7ff1cfd79700 time 2020-01-27 07:35:27.496963 Jan 27 07:35:27 langhus-1 ceph-osd[170517]: /build/ceph/src/ceph-14.2.6/src/os/bluestore/BlueStore.cc: 10954: FAILED ceph_assert(r == 0) Jan 27 07:35:27 langhus-1 ceph-osd[170517]: ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable) Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14f) [0x55d416bfd845] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 2: (()+0x501a2f) [0x55d416bfda2f] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 3: (BlueStore::_kv_sync_thread()+0x1160) [0x55d417250ac0] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 4: (BlueStore::KVSyncThread::entry()+0xd) [0x55d4172772bd] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 5: (()+0x94cf) [0x7ff1dcdab4cf] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 6: (clone()+0x43) [0x7ff1dc9632d3] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: -716> 2020-01-27 07:35:27.497 7ff1cfd79700 -1 *** Caught signal (Aborted) ** Jan 27 07:35:27 langhus-1 ceph-osd[170517]: in thread 7ff1cfd79700 thread_name:bstore_kv_sync Jan 27 07:35:27 langhus-1 ceph-osd[170517]: ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable) Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 1: (()+0x14930) [0x7ff1dcdb6930] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 2: (gsignal()+0x145) [0x7ff1dc89ff25] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 3: (abort()+0x12b) [0x7ff1dc889897] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1aa) [0x55d416bfd8a0] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 5: (()+0x501a2f) [0x55d416bfda2f] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 6: (BlueStore::_kv_sync_thread()+0x1160) [0x55d417250ac0] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 7: (BlueStore::KVSyncThread::entry()+0xd) [0x55d4172772bd] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 8: (()+0x94cf) [0x7ff1dcdab4cf] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: 9: (clone()+0x43) [0x7ff1dc9632d3] Jan 27 07:35:27 langhus-1 ceph-osd[170517]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Jan 27 07:35:28 langhus-1 systemd[1]: ceph-osd@5.service: Main process exited, code=killed, status=6/ABRT Jan 27 07:35:28 langhus-1 systemd[1]: ceph-osd@5.service: Failed with result 'signal'. Jan 27 07:35:28 langhus-1 systemd[1]: ceph-osd@5.service: Scheduled restart job, restart counter is at 3. Jan 27 07:35:28 langhus-1 systemd[1]: Stopped Ceph object storage daemon osd.5. Jan 27 07:35:28 langhus-1 systemd[1]: ceph-osd@5.service: Start request repeated too quickly. Jan 27 07:35:28 langhus-1 systemd[1]: ceph-osd@5.service: Failed with result 'signal'. Jan 27 07:35:28 langhus-1 systemd[1]: Failed to start Ceph object storage daemon osd.5.
$ sudo ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-5/ /build/ceph/src/ceph-14.2.6/src/kv/RocksDBStore.cc: In function 'virtual int RocksDBStore::get(const string&, const string&, ceph::bufferlist*)' thread 7f5fa5aead80 time 2020-01-27 08:13:04.023081 /build/ceph/src/ceph-14.2.6/src/kv/RocksDBStore.cc: 1211: ceph_abort_msg("block checksum mismatch: expected 1754497987, got 317490254 in db/000957.sst offset 705635 size 3911") ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable) 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xda) [0x7f5fa67e1a34] 2: (RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list*)+0x39c) [0x563c50a4a76c] 3: (()+0x269a3b) [0x563c50805a3b] 4: (()+0x2574d1) [0x563c507f34d1] 5: (BlueStore::ExtentMap::fault_range(KeyValueDB*, unsigned int, unsigned int)+0x23c) [0x563c50844bdc] 6: (BlueStore::fsck_check_objects_shallow(BlueStore::FSCKDepth, long, boost::intrusive_ptr<BlueStore::Collection>, ghobject_t const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list const&, std::__cxx11::list<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, mempool::pool_allocator<(mempool::pool_index_t)5, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, std::map<boost::intrusive_ptr<BlueStore::Blob>, unsigned short, std::less<boost::intrusive_ptr<BlueStore::Blob> >, std::allocator<std::pair<boost::intrusive_ptr<BlueStore::Blob> const, unsigned short> > >*, BlueStore::FSCK_ObjectCtx const&)+0x22a) [0x563c5085167a] 7: (BlueStore::_fsck_check_objects(BlueStore::FSCKDepth, BlueStore::FSCK_ObjectCtx&)+0x1a3d) [0x563c5088225d] 8: (BlueStore::_fsck_on_open(BlueStore::FSCKDepth, bool)+0x1341) [0x563c508865b1] 9: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x326) [0x563c50891a86] 10: (main()+0x1274) [0x563c50789314] 11: (__libc_start_main()+0xf3) [0x7f5fa5d91153] 12: (_start()+0x2e) [0x563c507ad9ce] *** Caught signal (Aborted) ** in thread 7f5fa5aead80 thread_name:ceph-bluestore- 2020-01-27 08:13:04.020 7f5fa5aead80 -1 /build/ceph/src/ceph-14.2.6/src/kv/RocksDBStore.cc: In function 'virtual int RocksDBStore::get(const string&, const string&, ceph::bufferlist*)' thread 7f5fa5aead80 time 2020-01-27 08:13:04.023081 /build/ceph/src/ceph-14.2.6/src/kv/RocksDBStore.cc: 1211: ceph_abort_msg("block checksum mismatch: expected 1754497987, got 317490254 in db/000957.sst offset 705635 size 3911") ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable) 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xda) [0x7f5fa67e1a34] 2: (RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list*)+0x39c) [0x563c50a4a76c] 3: (()+0x269a3b) [0x563c50805a3b] 4: (()+0x2574d1) [0x563c507f34d1] 5: (BlueStore::ExtentMap::fault_range(KeyValueDB*, unsigned int, unsigned int)+0x23c) [0x563c50844bdc] 6: (BlueStore::fsck_check_objects_shallow(BlueStore::FSCKDepth, long, boost::intrusive_ptr<BlueStore::Collection>, ghobject_t const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list const&, std::__cxx11::list<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, mempool::pool_allocator<(mempool::pool_index_t)5, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, std::map<boost::intrusive_ptr<BlueStore::Blob>, unsigned short, std::less<boost::intrusive_ptr<BlueStore::Blob> >, std::allocator<std::pair<boost::intrusive_ptr<BlueStore::Blob> const, unsigned short> > >*, BlueStore::FSCK_ObjectCtx const&)+0x22a) [0x563c5085167a] 7: (BlueStore::_fsck_check_objects(BlueStore::FSCKDepth, BlueStore::FSCK_ObjectCtx&)+0x1a3d) [0x563c5088225d] 8: (BlueStore::_fsck_on_open(BlueStore::FSCKDepth, bool)+0x1341) [0x563c508865b1] 9: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x326) [0x563c50891a86] 10: (main()+0x1274) [0x563c50789314] 11: (__libc_start_main()+0xf3) [0x7f5fa5d91153] 12: (_start()+0x2e) [0x563c507ad9ce] ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable) 1: (()+0x14930) [0x7f5fa6296930] 2: (gsignal()+0x145) [0x7f5fa5da5f25] 3: (abort()+0x12b) [0x7f5fa5d8f897] 4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b3) [0x7f5fa67e1b0d] 5: (RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list*)+0x39c) [0x563c50a4a76c] 6: (()+0x269a3b) [0x563c50805a3b] 7: (()+0x2574d1) [0x563c507f34d1] 8: (BlueStore::ExtentMap::fault_range(KeyValueDB*, unsigned int, unsigned int)+0x23c) [0x563c50844bdc] 9: (BlueStore::fsck_check_objects_shallow(BlueStore::FSCKDepth, long, boost::intrusive_ptr<BlueStore::Collection>, ghobject_t const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list const&, std::__cxx11::list<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, mempool::pool_allocator<(mempool::pool_index_t)5, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, std::map<boost::intrusive_ptr<BlueStore::Blob>, unsigned short, std::less<boost::intrusive_ptr<BlueStore::Blob> >, std::allocator<std::pair<boost::intrusive_ptr<BlueStore::Blob> const, unsigned short> > >*, BlueStore::FSCK_ObjectCtx const&)+0x22a) [0x563c5085167a] 10: (BlueStore::_fsck_check_objects(BlueStore::FSCKDepth, BlueStore::FSCK_ObjectCtx&)+0x1a3d) [0x563c5088225d] 11: (BlueStore::_fsck_on_open(BlueStore::FSCKDepth, bool)+0x1341) [0x563c508865b1] 12: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x326) [0x563c50891a86] 13: (main()+0x1274) [0x563c50789314] 14: (__libc_start_main()+0xf3) [0x7f5fa5d91153] 15: (_start()+0x2e) [0x563c507ad9ce] 2020-01-27 08:13:04.020 7f5fa5aead80 -1 *** Caught signal (Aborted) ** in thread 7f5fa5aead80 thread_name:ceph-bluestore- ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable) 1: (()+0x14930) [0x7f5fa6296930] 2: (gsignal()+0x145) [0x7f5fa5da5f25] 3: (abort()+0x12b) [0x7f5fa5d8f897] 4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b3) [0x7f5fa67e1b0d] 5: (RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list*)+0x39c) [0x563c50a4a76c] 6: (()+0x269a3b) [0x563c50805a3b] 7: (()+0x2574d1) [0x563c507f34d1] 8: (BlueStore::ExtentMap::fault_range(KeyValueDB*, unsigned int, unsigned int)+0x23c) [0x563c50844bdc] 9: (BlueStore::fsck_check_objects_shallow(BlueStore::FSCKDepth, long, boost::intrusive_ptr<BlueStore::Collection>, ghobject_t const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list const&, std::__cxx11::list<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, mempool::pool_allocator<(mempool::pool_index_t)5, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, std::map<boost::intrusive_ptr<BlueStore::Blob>, unsigned short, std::less<boost::intrusive_ptr<BlueStore::Blob> >, std::allocator<std::pair<boost::intrusive_ptr<BlueStore::Blob> const, unsigned short> > >*, BlueStore::FSCK_ObjectCtx const&)+0x22a) [0x563c5085167a] 10: (BlueStore::_fsck_check_objects(BlueStore::FSCKDepth, BlueStore::FSCK_ObjectCtx&)+0x1a3d) [0x563c5088225d] 11: (BlueStore::_fsck_on_open(BlueStore::FSCKDepth, bool)+0x1341) [0x563c508865b1] 12: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x326) [0x563c50891a86] 13: (main()+0x1274) [0x563c50789314] 14: (__libc_start_main()+0xf3) [0x7f5fa5d91153] 15: (_start()+0x2e) [0x563c507ad9ce] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. -248> 2020-01-27 08:13:04.020 7f5fa5aead80 -1 /build/ceph/src/ceph-14.2.6/src/kv/RocksDBStore.cc: In function 'virtual int RocksDBStore::get(const string&, const string&, ceph::bufferlist*)' thread 7f5fa5aead80 time 2020-01-27 08:13:04.023081 /build/ceph/src/ceph-14.2.6/src/kv/RocksDBStore.cc: 1211: ceph_abort_msg("block checksum mismatch: expected 1754497987, got 317490254 in db/000957.sst offset 705635 size 3911") ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable) 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xda) [0x7f5fa67e1a34] 2: (RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list*)+0x39c) [0x563c50a4a76c] 3: (()+0x269a3b) [0x563c50805a3b] 4: (()+0x2574d1) [0x563c507f34d1] 5: (BlueStore::ExtentMap::fault_range(KeyValueDB*, unsigned int, unsigned int)+0x23c) [0x563c50844bdc] 6: (BlueStore::fsck_check_objects_shallow(BlueStore::FSCKDepth, long, boost::intrusive_ptr<BlueStore::Collection>, ghobject_t const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list const&, std::__cxx11::list<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, mempool::pool_allocator<(mempool::pool_index_t)5, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, std::map<boost::intrusive_ptr<BlueStore::Blob>, unsigned short, std::less<boost::intrusive_ptr<BlueStore::Blob> >, std::allocator<std::pair<boost::intrusive_ptr<BlueStore::Blob> const, unsigned short> > >*, BlueStore::FSCK_ObjectCtx const&)+0x22a) [0x563c5085167a] 7: (BlueStore::_fsck_check_objects(BlueStore::FSCKDepth, BlueStore::FSCK_ObjectCtx&)+0x1a3d) [0x563c5088225d] 8: (BlueStore::_fsck_on_open(BlueStore::FSCKDepth, bool)+0x1341) [0x563c508865b1] 9: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x326) [0x563c50891a86] 10: (main()+0x1274) [0x563c50789314] 11: (__libc_start_main()+0xf3) [0x7f5fa5d91153] 12: (_start()+0x2e) [0x563c507ad9ce] -247> 2020-01-27 08:13:04.020 7f5fa5aead80 -1 *** Caught signal (Aborted) ** in thread 7f5fa5aead80 thread_name:ceph-bluestore- ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable) 1: (()+0x14930) [0x7f5fa6296930] 2: (gsignal()+0x145) [0x7f5fa5da5f25] 3: (abort()+0x12b) [0x7f5fa5d8f897] 4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b3) [0x7f5fa67e1b0d] 5: (RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list*)+0x39c) [0x563c50a4a76c] 6: (()+0x269a3b) [0x563c50805a3b] 7: (()+0x2574d1) [0x563c507f34d1] 8: (BlueStore::ExtentMap::fault_range(KeyValueDB*, unsigned int, unsigned int)+0x23c) [0x563c50844bdc] 9: (BlueStore::fsck_check_objects_shallow(BlueStore::FSCKDepth, long, boost::intrusive_ptr<BlueStore::Collection>, ghobject_t const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list const&, std::__cxx11::list<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, mempool::pool_allocator<(mempool::pool_index_t)5, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, std::map<boost::intrusive_ptr<BlueStore::Blob>, unsigned short, std::less<boost::intrusive_ptr<BlueStore::Blob> >, std::allocator<std::pair<boost::intrusive_ptr<BlueStore::Blob> const, unsigned short> > >*, BlueStore::FSCK_ObjectCtx const&)+0x22a) [0x563c5085167a] 10: (BlueStore::_fsck_check_objects(BlueStore::FSCKDepth, BlueStore::FSCK_ObjectCtx&)+0x1a3d) [0x563c5088225d] 11: (BlueStore::_fsck_on_open(BlueStore::FSCKDepth, bool)+0x1341) [0x563c508865b1] 12: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x326) [0x563c50891a86] 13: (main()+0x1274) [0x563c50789314] 14: (__libc_start_main()+0xf3) [0x7f5fa5d91153] 15: (_start()+0x2e) [0x563c507ad9ce] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. -248> 2020-01-27 08:13:04.020 7f5fa5aead80 -1 /build/ceph/src/ceph-14.2.6/src/kv/RocksDBStore.cc: In function 'virtual int RocksDBStore::get(const string&, const string&, ceph::bufferlist*)' thread 7f5fa5aead80 time 2020-01-27 08:13:04.023081 /build/ceph/src/ceph-14.2.6/src/kv/RocksDBStore.cc: 1211: ceph_abort_msg("block checksum mismatch: expected 1754497987, got 317490254 in db/000957.sst offset 705635 size 3911") ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable) 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xda) [0x7f5fa67e1a34] 2: (RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list*)+0x39c) [0x563c50a4a76c] 3: (()+0x269a3b) [0x563c50805a3b] 4: (()+0x2574d1) [0x563c507f34d1] 5: (BlueStore::ExtentMap::fault_range(KeyValueDB*, unsigned int, unsigned int)+0x23c) [0x563c50844bdc] 6: (BlueStore::fsck_check_objects_shallow(BlueStore::FSCKDepth, long, boost::intrusive_ptr<BlueStore::Collection>, ghobject_t const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list const&, std::__cxx11::list<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, mempool::pool_allocator<(mempool::pool_index_t)5, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, std::map<boost::intrusive_ptr<BlueStore::Blob>, unsigned short, std::less<boost::intrusive_ptr<BlueStore::Blob> >, std::allocator<std::pair<boost::intrusive_ptr<BlueStore::Blob> const, unsigned short> > >*, BlueStore::FSCK_ObjectCtx const&)+0x22a) [0x563c5085167a] 7: (BlueStore::_fsck_check_objects(BlueStore::FSCKDepth, BlueStore::FSCK_ObjectCtx&)+0x1a3d) [0x563c5088225d] 8: (BlueStore::_fsck_on_open(BlueStore::FSCKDepth, bool)+0x1341) [0x563c508865b1] 9: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x326) [0x563c50891a86] 10: (main()+0x1274) [0x563c50789314] 11: (__libc_start_main()+0xf3) [0x7f5fa5d91153] 12: (_start()+0x2e) [0x563c507ad9ce] -247> 2020-01-27 08:13:04.020 7f5fa5aead80 -1 *** Caught signal (Aborted) ** in thread 7f5fa5aead80 thread_name:ceph-bluestore- ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable) 1: (()+0x14930) [0x7f5fa6296930] 2: (gsignal()+0x145) [0x7f5fa5da5f25] 3: (abort()+0x12b) [0x7f5fa5d8f897] 4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b3) [0x7f5fa67e1b0d] 5: (RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list*)+0x39c) [0x563c50a4a76c] 6: (()+0x269a3b) [0x563c50805a3b] 7: (()+0x2574d1) [0x563c507f34d1] 8: (BlueStore::ExtentMap::fault_range(KeyValueDB*, unsigned int, unsigned int)+0x23c) [0x563c50844bdc] 9: (BlueStore::fsck_check_objects_shallow(BlueStore::FSCKDepth, long, boost::intrusive_ptr<BlueStore::Collection>, ghobject_t const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list const&, std::__cxx11::list<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, mempool::pool_allocator<(mempool::pool_index_t)5, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, std::map<boost::intrusive_ptr<BlueStore::Blob>, unsigned short, std::less<boost::intrusive_ptr<BlueStore::Blob> >, std::allocator<std::pair<boost::intrusive_ptr<BlueStore::Blob> const, unsigned short> > >*, BlueStore::FSCK_ObjectCtx const&)+0x22a) [0x563c5085167a] 10: (BlueStore::_fsck_check_objects(BlueStore::FSCKDepth, BlueStore::FSCK_ObjectCtx&)+0x1a3d) [0x563c5088225d] 11: (BlueStore::_fsck_on_open(BlueStore::FSCKDepth, bool)+0x1341) [0x563c508865b1] 12: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x326) [0x563c50891a86] 13: (main()+0x1274) [0x563c50789314] 14: (__libc_start_main()+0xf3) [0x7f5fa5d91153] 15: (_start()+0x2e) [0x563c507ad9ce] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Aborted
Updated by Jamin Collins about 4 years ago
I have a bluefs-export of each OSD. They are a few GB in size, how would you like me to provide them?
Updated by Igor Fedotov about 4 years ago
Hi Jamin,
wondering if you ever had v14.2.3 or 14.2.4 installed at these OSDs?
Updated by Jamin Collins about 4 years ago
Igor Fedotov wrote:
Hi Jamin,
wondering if you ever had v14.2.3 or 14.2.4 installed at these OSDs?
I'm sure at one point I did, but they were very recently completely rebuilt under 14.2.6.
I have been completely removing each OSD from the cluster to rebuild it with an adequately sized SSD volume housed in an LVM volume group.
The process I've been using to remove each OSD is:
- stop the OSD and disable the systemd OSD process
- remove the OSD from the crush map (ceph osd crush remove osd.X)
- wait for the cluster to fully recover
- remove the OSD auth (ceph auth del osd.X)
- remove the OSD (ceph osd rm X)
- remove the LVM volumes and groups
Updated by Jamin Collins about 4 years ago
At the time of failure both OSDs had been fully removed and rebuilt under 14.2.6. They had been running this way for several days, roughly five based on the systemd logs:
Jan 22 09:03:04 langhus-1 ceph-osd[1025]: 2020-01-22 09:03:04.853 7f78cb024700 -1 osd.0 45928 set_numa_affinity unable to identify publ> Jan 27 02:05:22 langhus-1 ceph-osd[1025]: /build/ceph/src/ceph-14.2.6/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_kv_s>
Updated by Igor Fedotov about 4 years ago
Thanks, Jamin.
I was thinking this could be caused by earlier DB corruption but became visible later (see https://tracker.ceph.com/issues/42223). But this hypothesis doesn't work if you redeployed OSD (and hence cleaned-up all persistent data for this specific instance) instead of pure software upgrade.
Updated by Jamin Collins about 4 years ago
This same host has now experienced what looks like similar corruption of it's ceph-mon filesystem:
Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 2020-01-29 23:32:07.653 7fb83c8da700 -1 rocksdb: submit_common error: Corruption: block check> Jan 29 23:32:07 langhus-1 ceph-mon[1058]: Put( Prefix = p key = 'xos'0x00353836'80538' Value size = 34430) Jan 29 23:32:07 langhus-1 ceph-mon[1058]: Put( Prefix = p key = 'xos'0x0070656e'ding_v' Value size = 8) Jan 29 23:32:07 langhus-1 ceph-mon[1058]: Put( Prefix = p key = 'xos'0x0070656e'ding_pn' Value size = 8) Jan 29 23:32:07 langhus-1 ceph-mon[1058]: /build/ceph/src/ceph-14.2.6/src/mon/MonitorDBStore.h: In function 'int MonitorDBStore::apply_> Jan 29 23:32:07 langhus-1 ceph-mon[1058]: /build/ceph/src/ceph-14.2.6/src/mon/MonitorDBStore.h: 324: ceph_abort_msg("failed to write to> Jan 29 23:32:07 langhus-1 ceph-mon[1058]: ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable) Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::> Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 2: (MonitorDBStore::apply_transaction(std::shared_ptr<MonitorDBStore::Transaction>)+0x1226) > Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 3: (Paxos::handle_begin(boost::intrusive_ptr<MonOpRequest>)+0x439) [0x556c9c64d419] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 4: (Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x29b) [0x556c9c65377b] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 5: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x1113) [0x556c9c565293] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 6: (Monitor::_ms_dispatch(Message*)+0x921) [0x556c9c565e61] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 7: (Monitor::ms_dispatch(Message*)+0x27) [0x556c9c59c7f7] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 8: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x27) [0x556c9c597427] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 9: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0x5d8) [0x7fb845ba9> Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 10: (DispatchQueue::entry()+0x8f2) [0x7fb845ba7132] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 11: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fb845c73b4d] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 12: (()+0x94cf) [0x7fb8450194cf] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 13: (clone()+0x43) [0x7fb844bf92d3] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 2020-01-29 23:32:07.666 7fb83c8da700 -1 /build/ceph/src/ceph-14.2.6/src/mon/MonitorDBStore.h:> Jan 29 23:32:07 langhus-1 ceph-mon[1058]: /build/ceph/src/ceph-14.2.6/src/mon/MonitorDBStore.h: 324: ceph_abort_msg("failed to write to> Jan 29 23:32:07 langhus-1 ceph-mon[1058]: ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable) Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::> Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 2: (MonitorDBStore::apply_transaction(std::shared_ptr<MonitorDBStore::Transaction>)+0x1226) > Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 3: (Paxos::handle_begin(boost::intrusive_ptr<MonOpRequest>)+0x439) [0x556c9c64d419] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 4: (Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x29b) [0x556c9c65377b] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 5: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x1113) [0x556c9c565293] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 6: (Monitor::_ms_dispatch(Message*)+0x921) [0x556c9c565e61] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 7: (Monitor::ms_dispatch(Message*)+0x27) [0x556c9c59c7f7] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 8: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x27) [0x556c9c597427] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 9: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0x5d8) [0x7fb845ba9> Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 10: (DispatchQueue::entry()+0x8f2) [0x7fb845ba7132] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 11: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fb845c73b4d] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 12: (()+0x94cf) [0x7fb8450194cf] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 13: (clone()+0x43) [0x7fb844bf92d3] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: *** Caught signal (Aborted) ** Jan 29 23:32:07 langhus-1 ceph-mon[1058]: in thread 7fb83c8da700 thread_name:ms_dispatch Jan 29 23:32:07 langhus-1 ceph-mon[1058]: ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable) Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 1: (()+0x14930) [0x7fb845024930] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 2: (gsignal()+0x145) [0x7fb844b35f25] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 3: (abort()+0x12b) [0x7fb844b1f897] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::> Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 5: (MonitorDBStore::apply_transaction(std::shared_ptr<MonitorDBStore::Transaction>)+0x1226) > Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 6: (Paxos::handle_begin(boost::intrusive_ptr<MonOpRequest>)+0x439) [0x556c9c64d419] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 7: (Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x29b) [0x556c9c65377b] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 8: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x1113) [0x556c9c565293] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 9: (Monitor::_ms_dispatch(Message*)+0x921) [0x556c9c565e61] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 10: (Monitor::ms_dispatch(Message*)+0x27) [0x556c9c59c7f7] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 11: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x27) [0x556c9c597427] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 12: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0x5d8) [0x7fb845ba> Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 13: (DispatchQueue::entry()+0x8f2) [0x7fb845ba7132] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 14: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fb845c73b4d] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 15: (()+0x94cf) [0x7fb8450194cf] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 16: (clone()+0x43) [0x7fb844bf92d3] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 2020-01-29 23:32:07.669 7fb83c8da700 -1 *** Caught signal (Aborted) ** Jan 29 23:32:07 langhus-1 ceph-mon[1058]: in thread 7fb83c8da700 thread_name:ms_dispatch Jan 29 23:32:07 langhus-1 ceph-mon[1058]: ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable) Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 1: (()+0x14930) [0x7fb845024930] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 2: (gsignal()+0x145) [0x7fb844b35f25] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 3: (abort()+0x12b) [0x7fb844b1f897] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::> Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 5: (MonitorDBStore::apply_transaction(std::shared_ptr<MonitorDBStore::Transaction>)+0x1226) > Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 6: (Paxos::handle_begin(boost::intrusive_ptr<MonOpRequest>)+0x439) [0x556c9c64d419] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 7: (Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x29b) [0x556c9c65377b] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 8: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x1113) [0x556c9c565293] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 9: (Monitor::_ms_dispatch(Message*)+0x921) [0x556c9c565e61] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 10: (Monitor::ms_dispatch(Message*)+0x27) [0x556c9c59c7f7] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 11: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x27) [0x556c9c597427] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 12: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0x5d8) [0x7fb845ba> Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 13: (DispatchQueue::entry()+0x8f2) [0x7fb845ba7132] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 14: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fb845c73b4d] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 15: (()+0x94cf) [0x7fb8450194cf] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 16: (clone()+0x43) [0x7fb844bf92d3] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Jan 29 23:32:07 langhus-1 ceph-mon[1058]: -2> 2020-01-29 23:32:07.653 7fb83c8da700 -1 rocksdb: submit_common error: Corruption: blo> Jan 29 23:32:07 langhus-1 ceph-mon[1058]: Put( Prefix = p key = 'xos'0x00353836'80538' Value size = 34430) Jan 29 23:32:07 langhus-1 ceph-mon[1058]: Put( Prefix = p key = 'xos'0x0070656e'ding_v' Value size = 8) Jan 29 23:32:07 langhus-1 ceph-mon[1058]: Put( Prefix = p key = 'xos'0x0070656e'ding_pn' Value size = 8) Jan 29 23:32:07 langhus-1 ceph-mon[1058]: -1> 2020-01-29 23:32:07.666 7fb83c8da700 -1 /build/ceph/src/ceph-14.2.6/src/mon/MonitorDB> Jan 29 23:32:07 langhus-1 ceph-mon[1058]: /build/ceph/src/ceph-14.2.6/src/mon/MonitorDBStore.h: 324: ceph_abort_msg("failed to write to> Jan 29 23:32:07 langhus-1 ceph-mon[1058]: ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable) Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::> Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 2: (MonitorDBStore::apply_transaction(std::shared_ptr<MonitorDBStore::Transaction>)+0x1226) > Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 3: (Paxos::handle_begin(boost::intrusive_ptr<MonOpRequest>)+0x439) [0x556c9c64d419] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 4: (Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x29b) [0x556c9c65377b] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 5: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x1113) [0x556c9c565293] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 6: (Monitor::_ms_dispatch(Message*)+0x921) [0x556c9c565e61] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 7: (Monitor::ms_dispatch(Message*)+0x27) [0x556c9c59c7f7] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 8: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x27) [0x556c9c597427] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 9: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0x5d8) [0x7fb845ba9> Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 10: (DispatchQueue::entry()+0x8f2) [0x7fb845ba7132] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 11: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fb845c73b4d] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 12: (()+0x94cf) [0x7fb8450194cf] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 13: (clone()+0x43) [0x7fb844bf92d3] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 0> 2020-01-29 23:32:07.669 7fb83c8da700 -1 *** Caught signal (Aborted) ** Jan 29 23:32:07 langhus-1 ceph-mon[1058]: in thread 7fb83c8da700 thread_name:ms_dispatch Jan 29 23:32:07 langhus-1 ceph-mon[1058]: ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable) Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 1: (()+0x14930) [0x7fb845024930] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 2: (gsignal()+0x145) [0x7fb844b35f25] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 3: (abort()+0x12b) [0x7fb844b1f897] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::> Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 5: (MonitorDBStore::apply_transaction(std::shared_ptr<MonitorDBStore::Transaction>)+0x1226) > Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 6: (Paxos::handle_begin(boost::intrusive_ptr<MonOpRequest>)+0x439) [0x556c9c64d419] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 7: (Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x29b) [0x556c9c65377b] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 8: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x1113) [0x556c9c565293] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 9: (Monitor::_ms_dispatch(Message*)+0x921) [0x556c9c565e61] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 10: (Monitor::ms_dispatch(Message*)+0x27) [0x556c9c59c7f7] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 11: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x27) [0x556c9c597427] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 12: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0x5d8) [0x7fb845ba> Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 13: (DispatchQueue::entry()+0x8f2) [0x7fb845ba7132] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 14: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fb845c73b4d] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 15: (()+0x94cf) [0x7fb8450194cf] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 16: (clone()+0x43) [0x7fb844bf92d3] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Jan 29 23:32:07 langhus-1 ceph-mon[1058]: -9999> 2020-01-29 23:32:07.653 7fb83c8da700 -1 rocksdb: submit_common error: Corruption: blo> Jan 29 23:32:07 langhus-1 ceph-mon[1058]: Put( Prefix = p key = 'xos'0x00353836'80538' Value size = 34430) Jan 29 23:32:07 langhus-1 ceph-mon[1058]: Put( Prefix = p key = 'xos'0x0070656e'ding_v' Value size = 8) Jan 29 23:32:07 langhus-1 ceph-mon[1058]: Put( Prefix = p key = 'xos'0x0070656e'ding_pn' Value size = 8) Jan 29 23:32:07 langhus-1 ceph-mon[1058]: -9998> 2020-01-29 23:32:07.666 7fb83c8da700 -1 /build/ceph/src/ceph-14.2.6/src/mon/MonitorDB> Jan 29 23:32:07 langhus-1 ceph-mon[1058]: /build/ceph/src/ceph-14.2.6/src/mon/MonitorDBStore.h: 324: ceph_abort_msg("failed to write to> Jan 29 23:32:07 langhus-1 ceph-mon[1058]: ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable) Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::> Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 2: (MonitorDBStore::apply_transaction(std::shared_ptr<MonitorDBStore::Transaction>)+0x1226) > Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 3: (Paxos::handle_begin(boost::intrusive_ptr<MonOpRequest>)+0x439) [0x556c9c64d419] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 4: (Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x29b) [0x556c9c65377b] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 5: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x1113) [0x556c9c565293] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 6: (Monitor::_ms_dispatch(Message*)+0x921) [0x556c9c565e61] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 7: (Monitor::ms_dispatch(Message*)+0x27) [0x556c9c59c7f7] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 8: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x27) [0x556c9c597427] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 9: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0x5d8) [0x7fb845ba9> Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 10: (DispatchQueue::entry()+0x8f2) [0x7fb845ba7132] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 11: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fb845c73b4d] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 12: (()+0x94cf) [0x7fb8450194cf] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 13: (clone()+0x43) [0x7fb844bf92d3] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: -9997> 2020-01-29 23:32:07.669 7fb83c8da700 -1 *** Caught signal (Aborted) ** Jan 29 23:32:07 langhus-1 ceph-mon[1058]: in thread 7fb83c8da700 thread_name:ms_dispatch Jan 29 23:32:07 langhus-1 ceph-mon[1058]: ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable) Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 1: (()+0x14930) [0x7fb845024930] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 2: (gsignal()+0x145) [0x7fb844b35f25] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 3: (abort()+0x12b) [0x7fb844b1f897] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::> Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 5: (MonitorDBStore::apply_transaction(std::shared_ptr<MonitorDBStore::Transaction>)+0x1226) > Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 6: (Paxos::handle_begin(boost::intrusive_ptr<MonOpRequest>)+0x439) [0x556c9c64d419] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 7: (Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x29b) [0x556c9c65377b] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 8: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x1113) [0x556c9c565293] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 9: (Monitor::_ms_dispatch(Message*)+0x921) [0x556c9c565e61] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 10: (Monitor::ms_dispatch(Message*)+0x27) [0x556c9c59c7f7] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 11: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x27) [0x556c9c597427] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 12: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0x5d8) [0x7fb845ba> Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 13: (DispatchQueue::entry()+0x8f2) [0x7fb845ba7132] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 14: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fb845c73b4d] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 15: (()+0x94cf) [0x7fb8450194cf] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: 16: (clone()+0x43) [0x7fb844bf92d3] Jan 29 23:32:07 langhus-1 ceph-mon[1058]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Jan 29 23:32:08 langhus-1 systemd[1]: ceph-mon@langhus-1.service: Main process exited, code=killed, status=6/ABRT Jan 29 23:32:08 langhus-1 systemd[1]: ceph-mon@langhus-1.service: Failed with result 'signal'. Jan 29 23:32:18 langhus-1 systemd[1]: ceph-mon@langhus-1.service: Scheduled restart job, restart counter is at 1. Jan 29 23:32:18 langhus-1 systemd[1]: Stopped Ceph cluster monitor daemon.
Updated by Igor Fedotov about 4 years ago
Just in case - have you checked H/w errors via dmesg?
Updated by Igor Fedotov about 4 years ago
And are DB devices for OSD and MON different?
Updated by Jamin Collins about 4 years ago
Host details:
$ grep model /proc/cpuinfo | tail -n 1 model name : AMD Ryzen 7 3700X 8-Core Processor
$ sudo nvme list | tail -n1 /dev/nvme0n1 S41GNX0M435108 SAMSUNG MZVLB256HAHQ-000L7 1 255.05 GB / 256.06 GB 512 B + 0 B 1L2QEXD7
$ ls -l /var/lib/ceph/osd/ceph-*/| grep db lrwxrwxrwx 1 ceph ceph 20 Jan 29 13:20 block.db -> /dev/ceph-db/osd0.db lrwxrwxrwx 1 ceph ceph 21 Jan 27 13:57 block.db -> /dev/ceph-db/osd10.db lrwxrwxrwx 1 ceph ceph 20 Jan 29 07:52 block.db -> /dev/ceph-db/osd5.db $ sudo pvs | grep ceph-db /dev/sdc ceph-db lvm2 a-- 931.51g 551.51g $ sudo hdparm -i /dev/sdc /dev/sdc: Model=Samsung SSD 850 EVO mSATA 1TB, FwRev=32101030, SerialNo=S33FNX0J100209D Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs } RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=0 BuffType=unknown, BuffSize=unknown, MaxMultSect=1, MultSect=1 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=1953525168 IORDY=on/off, tPIO={min:240,w/IORDY:120}, tDMA={min:120,rec:120} PIO modes: pio0 pio1 pio2 pio3 pio4 DMA modes: mdma0 mdma1 mdma2 UDMA modes: udma0 udma1 udma2 udma3 udma4 udma5 *udma6 AdvancedPM=yes: disabled (255) WriteCache=disabled Drive conforms to: Unspecified: ATA/ATAPI-4,5,6,7
$ sudo hdparm -i /dev/sda /dev/sda: Model=ST4000VX007-2DT166, FwRev=CV11, SerialNo=WDH1FSZY Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs RotSpdTol>.5% } RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=0 BuffType=unknown, BuffSize=8192kB, MaxMultSect=16, MultSect=off CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=7814037168 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120} PIO modes: pio0 pio1 pio2 pio3 pio4 DMA modes: mdma0 mdma1 mdma2 UDMA modes: udma0 udma1 udma2 udma3 udma4 udma5 *udma6 AdvancedPM=no WriteCache=enabled Drive conforms to: unknown: ATA/ATAPI-4,5,6,7 * signifies the current active mode $ sudo hdparm -i /dev/sdb /dev/sdb: Model=HGST HDN724040ALE640, FwRev=MJAOA5E0, SerialNo=PK1334PCJ7ZNRS Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs } RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=56 BuffType=DualPortCache, BuffSize=unknown, MaxMultSect=16, MultSect=16 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=7814037168 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120} PIO modes: pio0 pio1 pio2 pio3 pio4 DMA modes: mdma0 mdma1 mdma2 UDMA modes: udma0 udma1 udma2 udma3 udma4 udma5 *udma6 AdvancedPM=yes: disabled (255) WriteCache=enabled Drive conforms to: unknown: ATA/ATAPI-2,3,4,5,6,7 * signifies the current active mode
$ sudo dmidecode -t memory # dmidecode 3.2 Getting SMBIOS data from sysfs. SMBIOS 3.2.1 present. # SMBIOS implementations newer than version 3.2.0 are not # fully supported by this version of dmidecode. Handle 0x000C, DMI type 16, 23 bytes Physical Memory Array Location: System Board Or Motherboard Use: System Memory Error Correction Type: None Maximum Capacity: 128 GB Error Information Handle: 0x000B Number Of Devices: 4 Handle 0x0014, DMI type 17, 40 bytes Memory Device Array Handle: 0x000C Error Information Handle: 0x0013 Total Width: Unknown Data Width: Unknown Size: No Module Installed Form Factor: Unknown Set: None Locator: DIMM 0 Bank Locator: P0 CHANNEL A Type: Unknown Type Detail: Unknown Speed: Unknown Manufacturer: Unknown Serial Number: Unknown Asset Tag: Not Specified Part Number: Unknown Rank: Unknown Configured Memory Speed: Unknown Minimum Voltage: Unknown Maximum Voltage: Unknown Configured Voltage: Unknown Handle 0x0016, DMI type 17, 40 bytes Memory Device Array Handle: 0x000C Error Information Handle: 0x0015 Total Width: 64 bits Data Width: 64 bits Size: 16384 MB Form Factor: DIMM Set: None Locator: DIMM 1 Bank Locator: P0 CHANNEL A Type: DDR4 Type Detail: Synchronous Unbuffered (Unregistered) Speed: 2133 MT/s Manufacturer: Unknown Serial Number: 00000000 Asset Tag: Not Specified Part Number: F4-3000C16-16GSXFB Rank: 2 Configured Memory Speed: 2133 MT/s Minimum Voltage: 1.2 V Maximum Voltage: 1.2 V Configured Voltage: 1.2 V Handle 0x0019, DMI type 17, 40 bytes Memory Device Array Handle: 0x000C Error Information Handle: 0x0018 Total Width: Unknown Data Width: Unknown Size: No Module Installed Form Factor: Unknown Set: None Locator: DIMM 0 Bank Locator: P0 CHANNEL B Type: Unknown Type Detail: Unknown Speed: Unknown Manufacturer: Unknown Serial Number: Unknown Asset Tag: Not Specified Part Number: Unknown Rank: Unknown Configured Memory Speed: Unknown Minimum Voltage: Unknown Maximum Voltage: Unknown Configured Voltage: Unknown Handle 0x001B, DMI type 17, 40 bytes Memory Device Array Handle: 0x000C Error Information Handle: 0x001A Total Width: 64 bits Data Width: 64 bits Size: 16384 MB Form Factor: DIMM Set: None Locator: DIMM 1 Bank Locator: P0 CHANNEL B Type: DDR4 Type Detail: Synchronous Unbuffered (Unregistered) Speed: 2133 MT/s Manufacturer: Unknown Serial Number: 00000000 Asset Tag: Not Specified Part Number: F4-3000C16-16GSXFB Rank: 2 Configured Memory Speed: 2133 MT/s Minimum Voltage: 1.2 V Maximum Voltage: 1.2 V Configured Voltage: 1.2 V
$ sudo dmidecode -t baseboard # dmidecode 3.2 Getting SMBIOS data from sysfs. SMBIOS 3.2.1 present. # SMBIOS implementations newer than version 3.2.0 are not # fully supported by this version of dmidecode. Handle 0x0002, DMI type 2, 15 bytes Base Board Information Manufacturer: ASRock Product Name: B450M Pro4-F Version: Serial Number: M80-C9009201693 Asset Tag: Features: Board is a hosting board Board is replaceable Location In Chassis: Chassis Handle: 0x0003 Type: Motherboard Contained Object Handles: 0
The OSDs in question (0 and 5) are fronted by the Samsung 850 EVO. The monitor that failed is stored on the Samsung NVME. The OSDs are made by different manufacturers.
Updated by Jamin Collins about 4 years ago
Igor Fedotov wrote:
Just in case - have you checked H/w errors via dmesg?
No hardware messages in dmesg
Igor Fedotov wrote:
And are DB devices for OSD and MON different?
yes the DB devices and MON are on different storage devices
Updated by Igor Fedotov about 4 years ago
I definitely have no clue what's happened so can suggest some basic/obvious checks only:
1) Run smartctl -a for devices in question
2) Check/share OSD logs prior to the first crash occurrences. Some errors/odd behavior there?
3) Just a single host is currently behave badly, isn't it?
Having checksum failures at both OSD and MON with different drives behind makes me think about H/W or OS issues...
And a side note unrelated to the ticket - consumer-grade SSD drives (like Samsung 850 EVO) are terribly bad for using as a backend for BlueStore DB.
The rationale is the lack of power loss protection which causes very inefficient sync write performance.There were plenty of discussions at ceph-users mailing list and in some blogs. I faced that in my lab too. Hence suggest to consider replacement sooner rather than later.
Updated by Igor Fedotov about 4 years ago
Another side note - AFAIR enabled write caching has been reported as a bad practice too.
Updated by Jamin Collins about 4 years ago
- File ceph-osd.0.log.gz ceph-osd.0.log.gz added
- File ceph-osd.5.log.gz ceph-osd.5.log.gz added
Igor Fedotov wrote:
I definitely have no clue what's happened so can suggest some basic/obvious checks only:
1) Run smartctl -a for devices in question
The OSD devices are a bit older, but both pass the 'smartctl -a' check.
2) Check/share OSD logs prior to the first crash occurrences. Some errors/odd behavior there?
Checked, didn't see anything that jumped out at me, osd.0's log is attached. I had to trim some stuff from the beginning but left a full day before the crash. Similar with osd.5's log, but I had to remove some from the beginning and end to get the file size down (even compressed).
3) Just a single host is currently behave badly, isn't it?
Yes, it a single new host with the drives migrated to it.
Having checksum failures at both OSD and MON with different drives behind makes me think about H/W or OS issues...
Would agree, but other than the host hardware, the OS is the same load on the other 4 nodes in the cluster.
And a side note unrelated to the ticket - consumer-grade SSD drives (like Samsung 850 EVO) are terribly bad for using as a backend for BlueStore DB.
The rationale is the lack of power loss protection which causes very inefficient sync write performance.There were plenty of discussions at ceph-users mailing list and in some blogs. I faced that in my lab too. Hence suggest to consider replacement sooner rather than later.
The other four nodes in the cluster all have some form of consumer grade SSD in them, most from less respected manufacturers. The move to the Samsung EVO 850 and NVME drive were both new to the cluster as part of the hardware upgrade on this host. The move to an AMD CPU is also new.
Updated by Igor Fedotov about 4 years ago
These logs (osd-5 specifically) are very interesting!
Let's start with OSD-5. Looking for 'checksum' keyword.
- First occurrence:
2020-01-26 04:22:01.641 7f970ff79700 -1 bluestore(/var/lib/ceph/osd/ceph-5) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x6d000, got 0xdaf7cdb7, expected 0xe26480a9, device location [0x1ab27a2d000~1000], logical extent 0x6d000~1000, object #1:b05ad75a:::rbd_data.2b9d9c6b8b4567.000000000007e0bb:head#
It's main device, not DB! User data. And just a single checksum failure, likely read retry returned valid data!
- Next occurrence:
2020-01-26 10:38:44.527 7f971077a700 -1 /build/ceph/src/ceph-14.2.6/src/kv/RocksDBStore.cc: In function 'virtual int RocksDBStore::get(const string&, const string&, ceph::bufferlist*)' thread 7f971077a700 time 2020-01-26 10:38:44.518539
/build/ceph/src/ceph-14.2.6/src/kv/RocksDBStore.cc: 1211: ceph_abort_msg("block checksum mismatch: expected 2257455429, got 1374367646 in db/000651.sst offset 51208102 size 3912")
Not clear what's the device caused the issue but DB is involved. In case of bluefs spillover read could go to main device as well. Note ssT file name: db/000651.sst
OSD managed to restart after that crash.
- bypass some irrelevant/repeated 'checksum' occurrence and find:
2020-01-26 23:09:11.356 7fb34e244700 -1 bluestore(/var/lib/ceph/osd/ceph-5) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x20000, got 0xcfeb644a, expected 0xdb041ee1, device location [0x202dc0f0000~1000], logical extent 0x1a0000~1000, object #2:66d6f9d9:::rbd_data.5a6c74b0dc51.0000000000049d3f:head#
Again main device, different object and device location.
- the next one is related to DB again:
2020-01-26 23:39:46.214 7fb360268700 3 rocksdb: [db/db_impl_compaction_flush.cc:2659] Compaction error: Corruption: block checksum mismatch: expected 2705794548, got 186875627 in db/000956.sst offset 684976 size 53335
Note different SST file name: db/000956.sst
- and then in a postmortem log one can get more info on the previous main device crc failure:
-4097> 2020-01-26 23:09:11.356 7fb34e244700 -1 bluestore(/var/lib/ceph/osd/ceph-5) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x20000, got 0xcfeb644a, expected 0xdb041ee1, device location [0x202dc0f0000~1000], logical extent 0x1a0000~1000, object #2:66d6f9d9:::rbd_data.5a6c74b0dc51.0000000000049d3f:head#
-4096> 2020-01-26 23:09:11.356 7fb34e244700 5 bluestore(/var/lib/ceph/osd/ceph-5) _do_read read at 0x18b000~32000 failed 1 times before succeeding
which says that read failed only once before the success, i.e. retry was successful indeed!
After that OSD is unable to start up and is failing at db/000956.sst all the way.
The above (interim!!!! main device read failures) makes me think that finally you're observing another reincarnation of
https://tracker.ceph.com/issues/22464
https://github.com/ceph/ceph/pull/24649
It's about exactly the same main device failures. Presumably caused by high memory pressure (I'm afraid nobody knows for sure). The mentioned patch has a workaround for such a case by reattempting failed reads. And it showed pretty good results so far. Including main device failures for your OSD-5.
But this patch fixes user data reading ONLY! It doesn't apply for DB data at both main or DB devices.
And I've been waiting for this issue to reappear for DB data for a while...
Now I presume this has happened. And chances are that RocksDB failed to withstand such a read failure at some point and finally got damaged.
You may want to check "bluestore_reads_with_retries" performance counters for other OSDs at this host if any. Non-zero value will prove the above analysis.
Also could you please set debug_bluefs to 20m try to restart OSD and collect the fresh log. I'd like to check where broken SST files lie (i.e. was there any spillover to main device) - just curious if flash drive access might suffer from the same reading issue.
Updated by Igor Fedotov about 4 years ago
I was about to suggest memory utilization monitoring for this host. Including swapping. But finally realized that current state might be completely different as 2 OSDs are dead. Nevertheless please keep that in mind.
Updated by Igor Fedotov about 4 years ago
As for OSD-0 provided log has single permanent checksum failure all the way. But it makes sense to check earlier logs for similar checksum failures as for OSD-5.
And finally OSD-0 has started so I'm curious what happened for this to succeed?
Updated by Jamin Collins about 4 years ago
Once I got the cluster back to a healthy state, I removed and recreated both osd.0 and osd.5 to fully recover the cluster.
How do I check "bluestore_reads_with_retries" values for the OSDs within the cluster?
Updated by Igor Fedotov about 4 years ago
Run: ceph daemon osd.N perf dump
and look for the keyword in the output
Updated by Jamin Collins about 4 years ago
Does the "bluestore_reads_with_retries" reset with an OSD restart? Asking because all three OSDs on the host report 0 currently, but all have also been recently restarted.
$ sudo ceph daemon osd.0 perf dump | jq .bluestore.bluestore_reads_with_retries 0 $ sudo ceph daemon osd.5 perf dump | jq .bluestore.bluestore_reads_with_retries 0 $ sudo ceph daemon osd.10 perf dump | jq .bluestore.bluestore_reads_with_retries 0
Also, I presume you want the debug log from a failing OSD, right? If so, I'll gather when one of these fail again.
Updated by Aleksandr Rudenko over 3 years ago
I'm seeing this on 12.2.12
part of OSD log:
-18> 2020-08-12 19:23:44.329010 7f3ca01d4d40 0 filestore(/var/lib/ceph/osd/ceph-323) start omap initiation
-17> 2020-08-12 19:23:44.329090 7f3ca01d4d40 0 set rocksdb option base_background_compactions = 2
-16> 2020-08-12 19:23:44.329105 7f3ca01d4d40 0 set rocksdb option compaction_readahead_size = 2097152
-15> 2020-08-12 19:23:44.329120 7f3ca01d4d40 0 set rocksdb option compression = kNoCompression
-14> 2020-08-12 19:23:44.329131 7f3ca01d4d40 0 set rocksdb option max_background_compactions = 16
-13> 2020-08-12 19:23:44.329138 7f3ca01d4d40 0 set rocksdb option max_write_buffer_number = 4
-12> 2020-08-12 19:23:44.329144 7f3ca01d4d40 0 set rocksdb option min_write_buffer_number_to_merge = 2
-11> 2020-08-12 19:23:44.329190 7f3ca01d4d40 0 set rocksdb option base_background_compactions = 2
-10> 2020-08-12 19:23:44.329199 7f3ca01d4d40 0 set rocksdb option compaction_readahead_size = 2097152
-9> 2020-08-12 19:23:44.329205 7f3ca01d4d40 0 set rocksdb option compression = kNoCompression
-8> 2020-08-12 19:23:44.329210 7f3ca01d4d40 0 set rocksdb option max_background_compactions = 16
-7> 2020-08-12 19:23:44.329215 7f3ca01d4d40 0 set rocksdb option max_write_buffer_number = 4
-6> 2020-08-12 19:23:44.329220 7f3ca01d4d40 0 set rocksdb option min_write_buffer_number_to_merge = 2
-5> 2020-08-12 19:23:47.998200 7f3ca01d4d40 0 filestore(/var/lib/ceph/osd/ceph-323) mount(1759): enabling WRITEAHEAD journal mode: checkpoint is not enabled
-4> 2020-08-12 19:23:48.005244 7f3ca01d4d40 -1 rocksdb: submit_transaction error: Corruption: block checksum mismatch code = 2 Rocksdb transaction:
Delete( Prefix = _ key = 'SER_0000000004298992_AXATTR_'0x00'_')
Delete( Prefix = _ key = 'SER_0000000004298992_AXATTR_'0x005f7573'er.rgw.acl')
Delete( Prefix = _ key = 'SER_0000000004298992_AXATTR_'0x005f7573'er.rgw.content_type')
Delete( Prefix = _ key = 'SER_0000000004298992_AXATTR_'0x005f7573'er.rgw.etag')
Delete( Prefix = _ key = 'SER_0000000004298992_AXATTR_'0x005f7573'er.rgw.idtag')
Delete( Prefix = _ key = 'SER_0000000004298992_AXATTR_'0x005f7573'er.rgw.manifest')
Delete( Prefix = _ key = 'SER_0000000004298992_AXATTR_'0x005f7573'er.rgw.pg_ver')
Delete( Prefix = _ key = 'SER_0000000004298992_AXATTR_'0x005f7573'er.rgw.source_zone')
Delete( Prefix = _ key = 'SER_0000000004298992_AXATTR_'0x005f7573'er.rgw.tail_tag')
Delete( Prefix = _ key = 'SER_0000000004298992_AXATTR_'0x005f7573'er.rgw.x-amz-content-sha256')
-3> 2020-08-12 19:23:48.005265 7f3ca01d4d40 -1 filestore(/var/lib/ceph/osd/ceph-323) error (1) Operation not permitted not handled on operation 0x7f3ccdea5042 (22733464.0.1, or op 1, counting from 0)
-2> 2020-08-12 19:23:48.005278 7f3ca01d4d40 0 filestore(/var/lib/ceph/osd/ceph-323) EPERM suggests file(s) in osd data dir not owned by ceph user, or leveldb corruption
-1> 2020-08-12 19:23:48.005282 7f3ca01d4d40 0 filestore(/var/lib/ceph/osd/ceph-323) transaction dump:
{
"ops": [
{
"op_num": 0,
"op_name": "touch",
"collection": "10.3e50_head",
"oid": "#10:0a7fab97:::default.38952138.358_Veeam%2fArchive%2ftest12%2f12020b78-734e-442c-97a1-e6627ad504c7%2f82f94cc1-8b50-413d-3c35-001c99f3f69d%2fblocks%2fa1e5c4330b80543b875f50ee439ef697%2f13
It's filestore OSD.
osd disk is healthy.
OSD's journal on SSD which is healthy.