Bug #42166
closedcrash when LRU trimming
0%
Description
Testing xfstests on kcephfs vs. a vstart cluster, the OSD crashed with this:
ceph version v15.0.0-5742-ge565e31184c0 (e565e31184c0ffd18e269c1ee0b7ee88dc696f56) octopus (dev) 1: (()+0x12c60) [0x7f5cb69a3c60] 2: (gsignal()+0x145) [0x7f5cb6442e35] 3: (abort()+0x127) [0x7f5cb642d895] 4: (()+0x18aa2) [0x7f5cb6cc6aa2] 5: (()+0x1a449) [0x7f5cb6cc8449] 6: (std::_Hashtable<ghobject_t, std::pair<ghobject_t const, boost::intrusive_ptr<BlueStore::Onode> >, mempool::pool_allocator<(mempool::pool_index_t)4, std::pair<ghobject_t const, boost::intrusive_ptr<BlueStore::Onode> > >, std::__detail::_Select1st, std::equal_to<ghobject_t>, std::hash<ghobject_t>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_erase(unsigned long, std::__detail::_Hash_node_base*, std::__detail::_Hash_node<std::pair<ghobject_t const, boost::intrusive_ptr<BlueStore::Onode> >, true>*)+0x88) [0x56273cda08c8] 7: (LruOnodeCacheShard::_trim_to(unsigned long)+0x242) [0x56273cda4292] 8: (BlueStore::OnodeSpace::add(ghobject_t const&, boost::intrusive_ptr<BlueStore::Onode>)+0x19d) [0x56273ccefdad] 9: (BlueStore::Collection::get_onode(ghobject_t const&, bool, bool)+0x62b) [0x56273cd2ee9b] 10: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ceph::os::Transaction*)+0x1d58) [0x56273cd660a8] 11: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x275) [0x56273cd67185] 12: (ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ceph::os::Transaction&&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x83) [0x56273c8698a3] 13: (OSD::dispatch_context(PeeringCtx&, PG*, std::shared_ptr<OSDMap const>, ThreadPool::TPHandle*)+0x1f2) [0x56273c81ef72] 14: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x208) [0x56273c82b698] 15: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x52) [0x56273caa6c02] 16: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xe6c) [0x56273c82cc6c] 17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x403) [0x56273cec5923] 18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x56273cec8680] 19: (()+0x84c0) [0x7f5cb69994c0] 20: (clone()+0x43) [0x7f5cb6507553]
This ceph bit is based on 6bafc61e8d7a75733974db87d2af3203f0a3ceb1, with a pile of experimental MDS patches (nothing that should affect OSD operation). OSD log is attached. I unfortunately don't have a core.
Files
Updated by Josh Durgin over 4 years ago
- Status changed from New to Need More Info
Jeff do you happen to still have a coredump from this?
Updated by Igor Fedotov over 4 years ago
Just to note, osd log contains multiple odd checksum verification failures from RocksDB, e.g.
2019-10-02T11:44:22.035-0400 7f5ca9ae3700 3 rocksdb: [table/block_based_table_reader.cc:1113] Encountered error while reading data from compression dictionary block Corruption: block checksum mismatch: expected 0, got 2326703815 in db/000020.sst offset 18446744073709551615 size 18446744073709551615
They don't look like critical ones and don't result in any visible misbehavior.
Not sure what this means though...
Updated by Igor Fedotov about 1 year ago
- Status changed from Need More Info to Closed
apparently fixed by https://tracker.ceph.com/issues/56382