Bug #37090
closedBlueStore.cc: 3099: FAILED assert(0 == "uh oh, missing shared_blob")
0%
Description
Possibly a duplicate of #36303
What is slightly interesting, after setting the osd out and migrating off of it, it still claims to have one pg.
{ "cluster_fsid": "8385cdd7-977d-41eb-9810-c1720580fe1b", "osd_fsid": "722f4440-13d2-4b18-93e7-b644e01e4ca8", "whoami": 901, "state": "active", "oldest_map": 1607919, "newest_map": 1608517, "num_pgs": 1 } <pre> The clustermap on the other hand seems to think it does not hold any PGs (ceph pg ls-by-osd 901 or for that matter ceph pg map 34.958). Going by dump_pgstate_history this is the (shard? of) the PG it failed the assert on. <pre> { "pg": "34.958s2", "history": [ { "epoch": "1608511", "state": "Initial", "enter": "2018-11-13 11:22:43.986455", "exit": "2018-11-13 11:22:47.022097", "state": "Reset", "enter": "2018-11-13 11:22:47.022098", "exit": "2018-11-13 11:22:47.028721" }, { "epoch": "1608512", "state": "Start", "enter": "2018-11-13 11:22:47.028729", "exit": "2018-11-13 11:22:47.028748", "state": "Started/Stray", "enter": "2018-11-13 11:22:47.028748", "exit": "2018-11-13 11:22:48.195070", "state": "Started", "enter": "2018-11-13 11:22:47.028722", "exit": "2018-11-13 11:22:48.195079", "state": "Reset", "enter": "2018-11-13 11:22:48.195080", "exit": "2018-11-13 11:22:48.195466" }, { "epoch": "1608514", "state": "Start", "enter": "2018-11-13 11:22:48.195474", "exit": "2018-11-13 11:22:48.195494" } ] } <pre> <pre> 2018-11-12 13:17:54.768375 7fb31a937700 -1 bluestore(/var/lib/ceph/osd/ceph-901).collection(34.958s2_head 0x5569008c3000) load_shared_blob sbid 0x5579b888ea80 not found at key 0x00005579b888ea80 2018-11-12 13:17:54.773521 7fb31a937700 -1 /build/ceph-12.2.9/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::Collection::load_shared_blob(BlueStore::SharedBlobRef)' thread 7fb31a937700 time 2018-11-12 13:17:54.768383 /build/ceph-12.2.9/src/os/bluestore/BlueStore.cc: 3099: FAILED assert(0 == "uh oh, missing shared_blob") ceph version 12.2.9 (9e300932ef8a8916fb3fda78c58691a6ab0f4217) luminous (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x5568ebda53b2] 2: (BlueStore::Collection::load_shared_blob(boost::intrusive_ptr<BlueStore::SharedBlob>)+0x3fb) [0x5568ebbfc1eb] 3: (BlueStore::_wctx_finish(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*, std::set<BlueStore::SharedBlob*, std::less<BlueStore::SharedBlob*>, std::allocator<BlueStore::SharedBlob*> >*)+0x10a3) [0x5568ebc2d0f3] 4: (BlueStore::_do_truncate(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>, unsigned long, std::set<BlueStore::SharedBlob*, std::less<BlueStore::SharedBlob*>, std::allocator<BlueStore::SharedBlob*> >*)+0x2e2) [0x5568ebc47682] 5: (BlueStore::_do_remove(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>)+0xc5) [0x5568ebc47e75] 6: (BlueStore::_remove(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>&)+0x7b) [0x5568ebc4995b] 7: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x1bbc) [0x5568ebc56e2c] 8: (BlueStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x52e) [0x5568ebc57f2e] 9: (ObjectStore::queue_transaction(ObjectStore::Sequencer*, ObjectStore::Transaction&&, Context*, Context*, Context*, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x17c) [0x5568eb7eaadc] 10: (PrimaryLogPG::queue_transaction(ObjectStore::Transaction&&, boost::intrusive_ptr<OpRequest>)+0x58) [0x5568eb986b38] 11: (ECBackend::dispatch_recovery_messages(RecoveryMessages&, int)+0xbb7) [0x5568ebac8607] 12: (ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x25d) [0x5568ebad9efd] 13: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x50) [0x5568eb9b5f00] 14: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x543) [0x5568eb918ec3] 15: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3a9) [0x5568eb789999] 16: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> const&)+0x57) [0x5568eba3b577] 17: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x1047) [0x5568eb7b7db7] 18: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x884) [0x5568ebdaa1a4] 19: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5568ebdad1e0] 20: (()+0x76ba) [0x7fb3362846ba] 21: (clone()+0x6d) [0x7fb3352f741d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. </pre>
Files
Updated by Kjetil Joergensen over 5 years ago
- File ceph-osd.901.log.stub.xz ceph-osd.901.log.stub.xz added
Part of the osd log, should incude the first crash and maybe a couple of the subsequent ones, to make it fit within the 1MB attachment size.
Updated by Igor Fedotov over 5 years ago
Could you please run fsck on this OSD with "debug bluestore" set to 20 and share the log?
Updated by Kjetil Joergensen over 5 years ago
Log posted with ceph-upload-file: fbc90b08-887d-40b9-99b9-0a843465a313
Console output below
# ceph-bluestore-tool fsck --log-level 20 --log-file /root/ceph-osd-901-fsck.log --path /var/lib/ceph/osd/ceph-901/ 2018-11-14 09:54:57.263277 7fdfcc69cf80 -1 bluestore(/var/lib/ceph/osd/ceph-901/) fsck error: 2#34:1a925498:::rbd_data.2.b2587c6b8b4567.0000000000000836:head# blob blob([!~8000,0x1da87124000~4000] csum+has_unused+shared crc32c/0x1000 unused=0x3ff) sbid 93981275384448 > blobid_max 668824910 2018-11-14 09:55:10.543883 7fdfcc69cf80 -1 bluestore(/var/lib/ceph/osd/ceph-901/) fsck error: found stray shared blob data for sbid 0x27d97139 2018-11-14 09:55:10.544820 7fdfcc69cf80 -1 bluestore(/var/lib/ceph/osd/ceph-901/) fsck error: shared_blob 0x93981275384448 key is missing (SharedBlob(0x5617ae1c3a30 sbid 0x5579b888ea80)) 2018-11-14 09:55:10.544853 7fdfcc69cf80 -1 bluestore(/var/lib/ceph/osd/ceph-901/) fsck error: actual store_statfs(0x37d2a038000/0x37e37d51000, stored 0x8943b506/0x8c114000, compress 0x0/0x0/0x0) != expected store_statfs(0x37d2a038000/0x37e37d51000, stored 0x8943b506/0x8c110000, compress 0x0/0x0/0x0) 2018-11-14 09:55:31.050676 7fdfcc69cf80 -1 bluestore(/var/lib/ceph/osd/ceph-901/) fsck error: leaked extent 0x1da87124000~4000 /build/ceph-12.2.9/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_flush_cache()' thread 7fdfcc69cf80 time 2018-11-14 09:55:31.199417 /build/ceph-12.2.9/src/os/bluestore/BlueStore.cc: 11879: FAILED assert(p.second->shared_blob_set.empty()) ceph version 12.2.9 (9e300932ef8a8916fb3fda78c58691a6ab0f4217) luminous (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x7fdfc3132632] 2: (BlueStore::_flush_cache()+0x717) [0x56174b383967] 3: (BlueStore::_fsck(bool, bool)+0x5605) [0x56174b3d0aa5] 4: (main()+0xec8) [0x56174b28d278] 5: (__libc_start_main()+0xf0) [0x7fdfc18de830] 6: (_start()+0x29) [0x56174b3117f9] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 2018-11-14 09:55:31.199406 7fdfcc69cf80 -1 bluestore(/var/lib/ceph/osd/ceph-901/) _flush_cache stray shared blobs on 34.958s2_head 2018-11-14 09:55:31.200077 7fdfcc69cf80 -1 /build/ceph-12.2.9/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_flush_cache()' thread 7fdfcc69cf80 time 2018-11-14 09:55:31.199417 /build/ceph-12.2.9/src/os/bluestore/BlueStore.cc: 11879: FAILED assert(p.second->shared_blob_set.empty()) ceph version 12.2.9 (9e300932ef8a8916fb3fda78c58691a6ab0f4217) luminous (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x7fdfc3132632] 2: (BlueStore::_flush_cache()+0x717) [0x56174b383967] 3: (BlueStore::_fsck(bool, bool)+0x5605) [0x56174b3d0aa5] 4: (main()+0xec8) [0x56174b28d278] 5: (__libc_start_main()+0xf0) [0x7fdfc18de830] 6: (_start()+0x29) [0x56174b3117f9] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. -2> 2018-11-14 09:55:31.199406 7fdfcc69cf80 -1 bluestore(/var/lib/ceph/osd/ceph-901/) _flush_cache stray shared blobs on 34.958s2_head 0> 2018-11-14 09:55:31.200077 7fdfcc69cf80 -1 /build/ceph-12.2.9/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_flush_cache()' thread 7fdfcc69cf80 time 2018-11-14 09:55:31.199417 /build/ceph-12.2.9/src/os/bluestore/BlueStore.cc: 11879: FAILED assert(p.second->shared_blob_set.empty()) ceph version 12.2.9 (9e300932ef8a8916fb3fda78c58691a6ab0f4217) luminous (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x7fdfc3132632] 2: (BlueStore::_flush_cache()+0x717) [0x56174b383967] 3: (BlueStore::_fsck(bool, bool)+0x5605) [0x56174b3d0aa5] 4: (main()+0xec8) [0x56174b28d278] 5: (__libc_start_main()+0xf0) [0x7fdfc18de830] 6: (_start()+0x29) [0x56174b3117f9] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. *** Caught signal (Aborted) ** in thread 7fdfcc69cf80 thread_name:ceph-bluestore- ceph version 12.2.9 (9e300932ef8a8916fb3fda78c58691a6ab0f4217) luminous (stable) 1: (()+0x42e0e4) [0x56174b57d0e4] 2: (()+0x11390) [0x7fdfc294a390] 3: (gsignal()+0x38) [0x7fdfc18f3428] 4: (abort()+0x16a) [0x7fdfc18f502a] 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x28e) [0x7fdfc31327be] 6: (BlueStore::_flush_cache()+0x717) [0x56174b383967] 7: (BlueStore::_fsck(bool, bool)+0x5605) [0x56174b3d0aa5] 8: (main()+0xec8) [0x56174b28d278] 9: (__libc_start_main()+0xf0) [0x7fdfc18de830] 10: (_start()+0x29) [0x56174b3117f9] 2018-11-14 09:55:31.204200 7fdfcc69cf80 -1 *** Caught signal (Aborted) ** in thread 7fdfcc69cf80 thread_name:ceph-bluestore- ceph version 12.2.9 (9e300932ef8a8916fb3fda78c58691a6ab0f4217) luminous (stable) 1: (()+0x42e0e4) [0x56174b57d0e4] 2: (()+0x11390) [0x7fdfc294a390] 3: (gsignal()+0x38) [0x7fdfc18f3428] 4: (abort()+0x16a) [0x7fdfc18f502a] 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x28e) [0x7fdfc31327be] 6: (BlueStore::_flush_cache()+0x717) [0x56174b383967] 7: (BlueStore::_fsck(bool, bool)+0x5605) [0x56174b3d0aa5] 8: (main()+0xec8) [0x56174b28d278] 9: (__libc_start_main()+0xf0) [0x7fdfc18de830] 10: (_start()+0x29) [0x56174b3117f9] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 0> 2018-11-14 09:55:31.204200 7fdfcc69cf80 -1 *** Caught signal (Aborted) ** in thread 7fdfcc69cf80 thread_name:ceph-bluestore- ceph version 12.2.9 (9e300932ef8a8916fb3fda78c58691a6ab0f4217) luminous (stable) 1: (()+0x42e0e4) [0x56174b57d0e4] 2: (()+0x11390) [0x7fdfc294a390] 3: (gsignal()+0x38) [0x7fdfc18f3428] 4: (abort()+0x16a) [0x7fdfc18f502a] 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x28e) [0x7fdfc31327be] 6: (BlueStore::_flush_cache()+0x717) [0x56174b383967] 7: (BlueStore::_fsck(bool, bool)+0x5605) [0x56174b3d0aa5] 8: (main()+0xec8) [0x56174b28d278] 9: (__libc_start_main()+0xf0) [0x7fdfc18de830] 10: (_start()+0x29) [0x56174b3117f9] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Updated by Kjetil Joergensen over 5 years ago
Ok - I think you can close this one. This is in all likelihood a hardware error of some sort, on the same machine I have had more identical failures, same assert, just for different PGs/shards.
Updated by Kjetil Joergensen over 5 years ago
Kjetil Joergensen wrote:
Ok - I think you can close this one. This is in all likelihood a hardware error of some sort, on the same machine I have had more identical failures, same assert, just for different PGs/shards.
Actually - I guess there's one way I could make a contrived case for "bug in ceph", there's something with the combined set of hardware on that machine, which makes some interaction with hardware (io, memory access, cpu) slightly slower/faster uncovering some race-condition in ceph.
Updated by Kjetil Joergensen over 5 years ago
Kjetil Joergensen wrote:
Kjetil Joergensen wrote:
Ok - I think you can close this one. This is in all likelihood a hardware error of some sort, on the same machine I have had more identical failures, same assert, just for different PGs/shards.
Actually - I guess there's one way I could make a contrived case for "bug in ceph", there's something with the combined set of hardware on that machine, which makes some interaction with hardware (io, memory access, cpu) slightly slower/faster uncovering some race-condition in ceph.
I may have convinced myself that the latter might be less contrived - while I'll readily admit that it's perfectly possible that we flip bits in such a way that it's undetectable by ECC and transfer CRCs., I'd have expected failures to be less uniform, same assert and same stacktrace, just different PGs.
Updated by Sage Weil over 5 years ago
- Status changed from New to Can't reproduce
I have a feeling this is caused by http://tracker.ceph.com/issues/36526, the fix for which is in 12.2.10.
Updated by Sage Weil over 5 years ago
- Related to Bug #36526: segv in BlueStore::OldExtent::create added