Project

General

Profile

Actions

Bug #37090

closed

BlueStore.cc: 3099: FAILED assert(0 == "uh oh, missing shared_blob")

Added by Kjetil Joergensen over 5 years ago. Updated over 5 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Possibly a duplicate of #36303

What is slightly interesting, after setting the osd out and migrating off of it, it still claims to have one pg.

{
    "cluster_fsid": "8385cdd7-977d-41eb-9810-c1720580fe1b",
    "osd_fsid": "722f4440-13d2-4b18-93e7-b644e01e4ca8",
    "whoami": 901,
    "state": "active",
    "oldest_map": 1607919,
    "newest_map": 1608517,
    "num_pgs": 1
}
<pre>

The clustermap on the other hand seems to think it does not hold any PGs (ceph pg ls-by-osd 901 or for that matter ceph pg map 34.958).

Going by dump_pgstate_history this is the (shard? of) the PG it failed the assert on.
<pre>
{
    "pg": "34.958s2",
    "history": [
        {
            "epoch": "1608511",
            "state": "Initial",
            "enter": "2018-11-13 11:22:43.986455",
            "exit": "2018-11-13 11:22:47.022097",
            "state": "Reset",
            "enter": "2018-11-13 11:22:47.022098",
            "exit": "2018-11-13 11:22:47.028721" 
        },
        {
            "epoch": "1608512",
            "state": "Start",
            "enter": "2018-11-13 11:22:47.028729",
            "exit": "2018-11-13 11:22:47.028748",
            "state": "Started/Stray",
            "enter": "2018-11-13 11:22:47.028748",
            "exit": "2018-11-13 11:22:48.195070",
            "state": "Started",
            "enter": "2018-11-13 11:22:47.028722",
            "exit": "2018-11-13 11:22:48.195079",
            "state": "Reset",
            "enter": "2018-11-13 11:22:48.195080",
            "exit": "2018-11-13 11:22:48.195466" 
        },
        {
            "epoch": "1608514",
            "state": "Start",
            "enter": "2018-11-13 11:22:48.195474",
            "exit": "2018-11-13 11:22:48.195494" 
        }
    ]
}
<pre>

<pre>
2018-11-12 13:17:54.768375 7fb31a937700 -1 bluestore(/var/lib/ceph/osd/ceph-901).collection(34.958s2_head 0x5569008c3000) load_shared_blob sbid 0x5579b888ea80 not found at key 0x00005579b888ea80
2018-11-12 13:17:54.773521 7fb31a937700 -1 /build/ceph-12.2.9/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::Collection::load_shared_blob(BlueStore::SharedBlobRef)' thread 7fb31a937700 time 2018-11-12 13:17:54.768383
/build/ceph-12.2.9/src/os/bluestore/BlueStore.cc: 3099: FAILED assert(0 == "uh oh, missing shared_blob")

 ceph version 12.2.9 (9e300932ef8a8916fb3fda78c58691a6ab0f4217) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x5568ebda53b2]
 2: (BlueStore::Collection::load_shared_blob(boost::intrusive_ptr<BlueStore::SharedBlob>)+0x3fb) [0x5568ebbfc1eb]
 3: (BlueStore::_wctx_finish(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*, std::set<BlueStore::SharedBlob*, std::less<BlueStore::SharedBlob*>, std::allocator<BlueStore::SharedBlob*> >*)+0x10a3) [0x5568ebc2d0f3]
 4: (BlueStore::_do_truncate(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>, unsigned long, std::set<BlueStore::SharedBlob*, std::less<BlueStore::SharedBlob*>, std::allocator<BlueStore::SharedBlob*> >*)+0x2e2) [0x5568ebc47682]
 5: (BlueStore::_do_remove(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>)+0xc5) [0x5568ebc47e75]
 6: (BlueStore::_remove(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>&)+0x7b) [0x5568ebc4995b]
 7: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x1bbc) [0x5568ebc56e2c]
 8: (BlueStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x52e) [0x5568ebc57f2e]
 9: (ObjectStore::queue_transaction(ObjectStore::Sequencer*, ObjectStore::Transaction&&, Context*, Context*, Context*, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x17c) [0x5568eb7eaadc]
 10: (PrimaryLogPG::queue_transaction(ObjectStore::Transaction&&, boost::intrusive_ptr<OpRequest>)+0x58) [0x5568eb986b38]
 11: (ECBackend::dispatch_recovery_messages(RecoveryMessages&, int)+0xbb7) [0x5568ebac8607]
 12: (ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x25d) [0x5568ebad9efd]
 13: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x50) [0x5568eb9b5f00]
 14: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x543) [0x5568eb918ec3]
 15: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3a9) [0x5568eb789999]
 16: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> const&)+0x57) [0x5568eba3b577]
 17: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x1047) [0x5568eb7b7db7]
 18: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x884) [0x5568ebdaa1a4]
 19: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5568ebdad1e0]
 20: (()+0x76ba) [0x7fb3362846ba]
 21: (clone()+0x6d) [0x7fb3352f741d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
</pre>

Files

ceph-osd.901.log.stub.xz (887 KB) ceph-osd.901.log.stub.xz Kjetil Joergensen, 11/13/2018 07:48 PM

Related issues 1 (0 open1 closed)

Related to bluestore - Bug #36526: segv in BlueStore::OldExtent::createResolved10/18/2018

Actions
Actions #1

Updated by Kjetil Joergensen over 5 years ago

Part of the osd log, should incude the first crash and maybe a couple of the subsequent ones, to make it fit within the 1MB attachment size.

Actions #2

Updated by Igor Fedotov over 5 years ago

Could you please run fsck on this OSD with "debug bluestore" set to 20 and share the log?

Actions #3

Updated by Kjetil Joergensen over 5 years ago

Log posted with ceph-upload-file: fbc90b08-887d-40b9-99b9-0a843465a313

Console output below

# ceph-bluestore-tool fsck --log-level 20 --log-file /root/ceph-osd-901-fsck.log --path /var/lib/ceph/osd/ceph-901/
2018-11-14 09:54:57.263277 7fdfcc69cf80 -1 bluestore(/var/lib/ceph/osd/ceph-901/) fsck error: 2#34:1a925498:::rbd_data.2.b2587c6b8b4567.0000000000000836:head# blob blob([!~8000,0x1da87124000~4000] csum+has_unused+shared crc32c/0x1000 unused=0x3ff) sbid 93981275384448 > blobid_max 668824910
2018-11-14 09:55:10.543883 7fdfcc69cf80 -1 bluestore(/var/lib/ceph/osd/ceph-901/) fsck error: found stray shared blob data for sbid 0x27d97139
2018-11-14 09:55:10.544820 7fdfcc69cf80 -1 bluestore(/var/lib/ceph/osd/ceph-901/) fsck error: shared_blob 0x93981275384448 key is missing (SharedBlob(0x5617ae1c3a30 sbid 0x5579b888ea80))
2018-11-14 09:55:10.544853 7fdfcc69cf80 -1 bluestore(/var/lib/ceph/osd/ceph-901/) fsck error: actual store_statfs(0x37d2a038000/0x37e37d51000, stored 0x8943b506/0x8c114000, compress 0x0/0x0/0x0) != expected store_statfs(0x37d2a038000/0x37e37d51000, stored 0x8943b506/0x8c110000, compress 0x0/0x0/0x0)
2018-11-14 09:55:31.050676 7fdfcc69cf80 -1 bluestore(/var/lib/ceph/osd/ceph-901/) fsck error: leaked extent 0x1da87124000~4000
/build/ceph-12.2.9/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_flush_cache()' thread 7fdfcc69cf80 time 2018-11-14 09:55:31.199417
/build/ceph-12.2.9/src/os/bluestore/BlueStore.cc: 11879: FAILED assert(p.second->shared_blob_set.empty())
 ceph version 12.2.9 (9e300932ef8a8916fb3fda78c58691a6ab0f4217) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x7fdfc3132632]
 2: (BlueStore::_flush_cache()+0x717) [0x56174b383967]
 3: (BlueStore::_fsck(bool, bool)+0x5605) [0x56174b3d0aa5]
 4: (main()+0xec8) [0x56174b28d278]
 5: (__libc_start_main()+0xf0) [0x7fdfc18de830]
 6: (_start()+0x29) [0x56174b3117f9]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2018-11-14 09:55:31.199406 7fdfcc69cf80 -1 bluestore(/var/lib/ceph/osd/ceph-901/) _flush_cache stray shared blobs on 34.958s2_head
2018-11-14 09:55:31.200077 7fdfcc69cf80 -1 /build/ceph-12.2.9/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_flush_cache()' thread 7fdfcc69cf80 time 2018-11-14 09:55:31.199417
/build/ceph-12.2.9/src/os/bluestore/BlueStore.cc: 11879: FAILED assert(p.second->shared_blob_set.empty())

 ceph version 12.2.9 (9e300932ef8a8916fb3fda78c58691a6ab0f4217) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x7fdfc3132632]
 2: (BlueStore::_flush_cache()+0x717) [0x56174b383967]
 3: (BlueStore::_fsck(bool, bool)+0x5605) [0x56174b3d0aa5]
 4: (main()+0xec8) [0x56174b28d278]
 5: (__libc_start_main()+0xf0) [0x7fdfc18de830]
 6: (_start()+0x29) [0x56174b3117f9]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

    -2> 2018-11-14 09:55:31.199406 7fdfcc69cf80 -1 bluestore(/var/lib/ceph/osd/ceph-901/) _flush_cache stray shared blobs on 34.958s2_head
     0> 2018-11-14 09:55:31.200077 7fdfcc69cf80 -1 /build/ceph-12.2.9/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_flush_cache()' thread 7fdfcc69cf80 time 2018-11-14 09:55:31.199417
/build/ceph-12.2.9/src/os/bluestore/BlueStore.cc: 11879: FAILED assert(p.second->shared_blob_set.empty())

 ceph version 12.2.9 (9e300932ef8a8916fb3fda78c58691a6ab0f4217) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x7fdfc3132632]
 2: (BlueStore::_flush_cache()+0x717) [0x56174b383967]
 3: (BlueStore::_fsck(bool, bool)+0x5605) [0x56174b3d0aa5]
 4: (main()+0xec8) [0x56174b28d278]
 5: (__libc_start_main()+0xf0) [0x7fdfc18de830]
 6: (_start()+0x29) [0x56174b3117f9]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

*** Caught signal (Aborted) **
 in thread 7fdfcc69cf80 thread_name:ceph-bluestore-
 ceph version 12.2.9 (9e300932ef8a8916fb3fda78c58691a6ab0f4217) luminous (stable)
 1: (()+0x42e0e4) [0x56174b57d0e4]
 2: (()+0x11390) [0x7fdfc294a390]
 3: (gsignal()+0x38) [0x7fdfc18f3428]
 4: (abort()+0x16a) [0x7fdfc18f502a]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x28e) [0x7fdfc31327be]
 6: (BlueStore::_flush_cache()+0x717) [0x56174b383967]
 7: (BlueStore::_fsck(bool, bool)+0x5605) [0x56174b3d0aa5]
 8: (main()+0xec8) [0x56174b28d278]
 9: (__libc_start_main()+0xf0) [0x7fdfc18de830]
 10: (_start()+0x29) [0x56174b3117f9]
2018-11-14 09:55:31.204200 7fdfcc69cf80 -1 *** Caught signal (Aborted) **
 in thread 7fdfcc69cf80 thread_name:ceph-bluestore-

 ceph version 12.2.9 (9e300932ef8a8916fb3fda78c58691a6ab0f4217) luminous (stable)
 1: (()+0x42e0e4) [0x56174b57d0e4]
 2: (()+0x11390) [0x7fdfc294a390]
 3: (gsignal()+0x38) [0x7fdfc18f3428]
 4: (abort()+0x16a) [0x7fdfc18f502a]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x28e) [0x7fdfc31327be]
 6: (BlueStore::_flush_cache()+0x717) [0x56174b383967]
 7: (BlueStore::_fsck(bool, bool)+0x5605) [0x56174b3d0aa5]
 8: (main()+0xec8) [0x56174b28d278]
 9: (__libc_start_main()+0xf0) [0x7fdfc18de830]
 10: (_start()+0x29) [0x56174b3117f9]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

     0> 2018-11-14 09:55:31.204200 7fdfcc69cf80 -1 *** Caught signal (Aborted) **
 in thread 7fdfcc69cf80 thread_name:ceph-bluestore-

 ceph version 12.2.9 (9e300932ef8a8916fb3fda78c58691a6ab0f4217) luminous (stable)
 1: (()+0x42e0e4) [0x56174b57d0e4]
 2: (()+0x11390) [0x7fdfc294a390]
 3: (gsignal()+0x38) [0x7fdfc18f3428]
 4: (abort()+0x16a) [0x7fdfc18f502a]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x28e) [0x7fdfc31327be]
 6: (BlueStore::_flush_cache()+0x717) [0x56174b383967]
 7: (BlueStore::_fsck(bool, bool)+0x5605) [0x56174b3d0aa5]
 8: (main()+0xec8) [0x56174b28d278]
 9: (__libc_start_main()+0xf0) [0x7fdfc18de830]
 10: (_start()+0x29) [0x56174b3117f9]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Actions #4

Updated by Kjetil Joergensen over 5 years ago

Ok - I think you can close this one. This is in all likelihood a hardware error of some sort, on the same machine I have had more identical failures, same assert, just for different PGs/shards.

Actions #5

Updated by Kjetil Joergensen over 5 years ago

Kjetil Joergensen wrote:

Ok - I think you can close this one. This is in all likelihood a hardware error of some sort, on the same machine I have had more identical failures, same assert, just for different PGs/shards.

Actually - I guess there's one way I could make a contrived case for "bug in ceph", there's something with the combined set of hardware on that machine, which makes some interaction with hardware (io, memory access, cpu) slightly slower/faster uncovering some race-condition in ceph.

Actions #6

Updated by Kjetil Joergensen over 5 years ago

Kjetil Joergensen wrote:

Kjetil Joergensen wrote:

Ok - I think you can close this one. This is in all likelihood a hardware error of some sort, on the same machine I have had more identical failures, same assert, just for different PGs/shards.

Actually - I guess there's one way I could make a contrived case for "bug in ceph", there's something with the combined set of hardware on that machine, which makes some interaction with hardware (io, memory access, cpu) slightly slower/faster uncovering some race-condition in ceph.

I may have convinced myself that the latter might be less contrived - while I'll readily admit that it's perfectly possible that we flip bits in such a way that it's undetectable by ECC and transfer CRCs., I'd have expected failures to be less uniform, same assert and same stacktrace, just different PGs.

Actions #7

Updated by Sage Weil over 5 years ago

  • Status changed from New to Can't reproduce

I have a feeling this is caused by http://tracker.ceph.com/issues/36526, the fix for which is in 12.2.10.

Actions #8

Updated by Sage Weil over 5 years ago

  • Related to Bug #36526: segv in BlueStore::OldExtent::create added
Actions

Also available in: Atom PDF