Bug #42166: crash when LRU trimming - bluestore - Ceph

Actions

Copy link

Bug #42166

closed

crash when LRU trimming

Added by Jeff Layton over 4 years ago. Updated about 1 year ago.

Status:

Closed

Priority:

Normal

Assignee:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Testing xfstests on kcephfs vs. a vstart cluster, the OSD crashed with this:

 ceph version v15.0.0-5742-ge565e31184c0 (e565e31184c0ffd18e269c1ee0b7ee88dc696f56) octopus (dev)
 1: (()+0x12c60) [0x7f5cb69a3c60]
 2: (gsignal()+0x145) [0x7f5cb6442e35]
 3: (abort()+0x127) [0x7f5cb642d895]
 4: (()+0x18aa2) [0x7f5cb6cc6aa2]
 5: (()+0x1a449) [0x7f5cb6cc8449]
 6: (std::_Hashtable<ghobject_t, std::pair<ghobject_t const, boost::intrusive_ptr<BlueStore::Onode> >, mempool::pool_allocator<(mempool::pool_index_t)4, std::pair<ghobject_t const, boost::intrusive_ptr<BlueStore::Onode> > >, std::__detail::_Select1st, std::equal_to<ghobject_t>, std::hash<ghobject_t>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_erase(unsigned long, std::__detail::_Hash_node_base*, std::__detail::_Hash_node<std::pair<ghobject_t const, boost::intrusive_ptr<BlueStore::Onode> >, true>*)+0x88) [0x56273cda08c8]
 7: (LruOnodeCacheShard::_trim_to(unsigned long)+0x242) [0x56273cda4292]
 8: (BlueStore::OnodeSpace::add(ghobject_t const&, boost::intrusive_ptr<BlueStore::Onode>)+0x19d) [0x56273ccefdad]
 9: (BlueStore::Collection::get_onode(ghobject_t const&, bool, bool)+0x62b) [0x56273cd2ee9b]
 10: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ceph::os::Transaction*)+0x1d58) [0x56273cd660a8]
 11: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x275) [0x56273cd67185]
 12: (ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ceph::os::Transaction&&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x83) [0x56273c8698a3]
 13: (OSD::dispatch_context(PeeringCtx&, PG*, std::shared_ptr<OSDMap const>, ThreadPool::TPHandle*)+0x1f2) [0x56273c81ef72]
 14: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x208) [0x56273c82b698]
 15: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x52) [0x56273caa6c02]
 16: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xe6c) [0x56273c82cc6c]
 17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x403) [0x56273cec5923]
 18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x56273cec8680]
 19: (()+0x84c0) [0x7f5cb69994c0]
 20: (clone()+0x43) [0x7f5cb6507553]

This ceph bit is based on 6bafc61e8d7a75733974db87d2af3203f0a3ceb1, with a pile of experimental MDS patches (nothing that should affect OSD operation). OSD log is attached. I unfortunately don't have a core.

Files

Download all files

osd.0.log.gz (249 KB) osd.0.log.gz		Jeff Layton, 10/02/2019 07:02 PM
ceph.conf (4.45 KB) ceph.conf	generated by vstart.sh	Jeff Layton, 10/02/2019 07:11 PM

Actions

Copy link

Updated by Jeff Layton over 4 years ago

Build was done on Fedora 30.

Actions

Copy link

Updated by Jeff Layton over 4 years ago

File ceph.conf ceph.conf added

Actions

Copy link

Updated by Josh Durgin over 4 years ago

Status changed from New to Need More Info

Jeff do you happen to still have a coredump from this?

Actions

Copy link

Updated by Jeff Layton over 4 years ago

I'm afraid not.

Actions

Copy link

Updated by Igor Fedotov over 4 years ago

Just to note, osd log contains multiple odd checksum verification failures from RocksDB, e.g.

2019-10-02T11:44:22.035-0400 7f5ca9ae3700 3 rocksdb: [table/block_based_table_reader.cc:1113] Encountered error while reading data from compression dictionary block Corruption: block checksum mismatch: expected 0, got 2326703815 in db/000020.sst offset 18446744073709551615 size 18446744073709551615

They don't look like critical ones and don't result in any visible misbehavior.
Not sure what this means though...

Actions

Copy link