Actions
Bug #51454
openSimultaneous OSD's crash with tp_osd_tp on rocksdb::MergingIterator::Next()
Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
FileStore
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
Ceph v14.2.15
Main use case is RGW.
Bucket indexes on SSD OSDs.
Majority of SSD OSD under bucket intexes are FileStore, unfortunately.
Today 3 OSDs from differed hosts and DCs crashed simultaneous at one second!
Time of crashes:
osd.4 DC2: 11:04:44.409
osd.737 DC2: 11:04:44.408
osd.890 DC1: 11:04:44.411
734 and 890 OSDs are from one ACTING of some PGs, 4 not.
It was a big problem for as because cluster was under huge load (as every day in this time) and many other OSDs starting flapping.
There were a lot of SLOW OPS.
Traces of this OSDs:
osd.4
=====
2021-06-30 11:04:44.409 7f790ae37700 -1 *** Caught signal (Aborted) **
in thread 7f790ae37700 thread_name:tp_osd_tp
ceph version 14.2.15 (afdd217ae5fb1ed3f60e16bd62357ca58cc650e5) nautilus (stable)
1: (()+0xf630) [0x7f7936b3c630]
2: (()+0x16a08b) [0x7f7935a5e08b]
3: (()+0x10e3bb8) [0x55603cdd6bb8]
4: (rocksdb::MergingIterator::Next()+0x348) [0x55603cdb62d8]
5: (rocksdb::DBIter::FindNextUserEntryInternal(bool, bool)+0x3d5) [0x55603ccbecc5]
6: (rocksdb::DBIter::Seek(rocksdb::Slice const&)+0x56c) [0x55603ccc055c]
7: (RocksDBStore::RocksDBWholeSpaceIteratorImpl::lower_bound(std::string const&, std::string const&)+0x44) [0x55603cc32d34]
8: (DBObjectMap::DBObjectMapIteratorImpl::lower_bound(std::string const&)+0x5f) [0x55603c79a8bf]
9: (DBObjectMap::scan(std::shared_ptr<DBObjectMap::_Header>, std::set<std::string, std::less<std::string>, std::allocator<std::string> > const&, std::set<std::string, std::less<std::string>, std::allocator<std::string> >*, std::map<std::string, ceph::buffer::v14_2_0::list, std::less<std::string>, std::allocator<std::pair<std::string const, ceph::buffer::v14_2_0::list> > >*)+0x1e9) [0x55603c793189]
10: (DBObjectMap::get_values(ghobject_t const&, std::set<std::string, std::less<std::string>, std::allocator<std::string> > const&, std::map<std::string, ceph::buffer::v14_2_0::list, std::less<std::string>, std::allocator<std::pair<std::string const, ceph::buffer::v14_2_0::list> > >*)+0x9e) [0x55603c794fae]
11: (FileStore::omap_get_values(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ghobject_t const&, std::set<std::string, std::less<std::string>, std::allocator<std::string> > const&, std::map<std::string, ceph::buffer::v14_2_0::list, std::less<std::string>, std::allocator<std::pair<std::string const, ceph::buffer::v14_2_0::list> > >*)+0x2b9) [0x55603c6268d9]
12: (PrimaryLogPG::do_osd_ops(PrimaryLogPG::OpContext*, std::vector<OSDOp, std::allocator<OSDOp> >&)+0x10b3) [0x55603c454353]
13: (cls_cxx_map_get_val(void*, std::string const&, ceph::buffer::v14_2_0::list*)+0x284) [0x55603c54bf74]
14: (()+0xaeba7) [0x7f7919f9eba7]
15: (ClassHandler::ClassMethod::exec(void*, ceph::buffer::v14_2_0::list&, ceph::buffer::v14_2_0::list&)+0x34) [0x55603c33a874]
16: (PrimaryLogPG::do_osd_ops(PrimaryLogPG::OpContext*, std::vector<OSDOp, std::allocator<OSDOp> >&)+0x1637) [0x55603c4548d7]
17: (PrimaryLogPG::prepare_transaction(PrimaryLogPG::OpContext*)+0x13f) [0x55603c465f1f]
18: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0x34a) [0x55603c46665a]
19: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x373b) [0x55603c46b2fb]
20: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xcae) [0x55603c46cc7e]
21: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x362) [0x55603c2ac5d2]
22: (PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x62) [0x55603c53b3d2]
23: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x90f) [0x55603c2c7a4f]
24: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x55603c87fe56]
25: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55603c882970]
26: (()+0x7ea5) [0x7f7936b34ea5]
27: (clone()+0x6d) [0x7f79359f28dd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
osd.890
=======
2021-06-30 11:04:37.614 7fe03f748700 0 --1- [v2:10.39.0.2:6878/18173,v1:10.39.0.2:6879/18173] >> v1:192.168.144.106:6982/3864062 conn(0x55ea9fd97c00 0x55ea5fadb000 :-1 s=OPENED pgs=270848 cs=24131 l=0).fault initiating reconnect
2021-06-30 11:04:37.615 7fe03f748700 0 --1- [v2:10.39.0.2:6878/18173,v1:10.39.0.2:6879/18173] >> v1:192.168.144.106:6982/3864062 conn(0x55ea9fd97c00 0x55ea5fadb000 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=270848 cs=24132 l=0).handle_connect_reply_2 connect got RESETSESSION
2021-06-30 11:04:44.411 7fe012974700 -1 *** Caught signal (Aborted) **
in thread 7fe012974700 thread_name:tp_osd_tp
ceph version 14.2.15 (afdd217ae5fb1ed3f60e16bd62357ca58cc650e5) nautilus (stable)
1: (()+0xf630) [0x7fe042e82630]
2: (__pthread_mutex_unlock()+0x2e) [0x7fe042e7df0e]
3: (rocksdb_cache::BinnedLRUCacheShard::Lookup(rocksdb::Slice const&, unsigned int)+0x73) [0x55ea46ca6a03]
4: (()+0x10a2816) [0x55ea46df8816]
5: (rocksdb::BlockBasedTable::GetDataBlockFromCache(rocksdb::Slice const&, rocksdb::Slice const&, rocksdb::Cache*, rocksdb::Cache*, rocksdb::BlockBasedTable::Rep*, rocksdb::ReadOptions const&, rocksdb::BlockBasedTable::CachableEntry<rocksdb::Block>*, rocksdb::UncompressionDict const&, unsigned long, bool, rocksdb::GetContext*)+0xdd) [0x55ea46df8d4d]
6: (rocksdb::BlockBasedTable::MaybeReadBlockAndLoadToCache(rocksdb::FilePrefetchBuffer*, rocksdb::BlockBasedTable::Rep*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::UncompressionDict const&, rocksdb::BlockBasedTable::CachableEntry<rocksdb::Block>*, bool, rocksdb::GetContext*)+0x1a1) [0x55ea46df9361]
7: (rocksdb::DataBlockIter* rocksdb::BlockBasedTable::NewDataBlockIterator<rocksdb::DataBlockIter>(rocksdb::BlockBasedTable::Rep*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::DataBlockIter*, bool, bool, bool, rocksdb::GetContext*, rocksdb::Status, rocksdb::FilePrefetchBuffer*)+0x378) [0x55ea46e06258]
8: (rocksdb::BlockBasedTableIterator<rocksdb::DataBlockIter, rocksdb::Slice>::InitDataBlock()+0xc5) [0x55ea46e07615]
9: (rocksdb::BlockBasedTableIterator<rocksdb::DataBlockIter, rocksdb::Slice>::FindKeyForward()+0x1c0) [0x55ea46e07a20]
10: (()+0x1038b99) [0x55ea46d8eb99]
11: (rocksdb::MergingIterator::Next()+0x42) [0x55ea46e18fd2]
12: (rocksdb::DBIter::FindNextUserEntryInternal(bool, bool)+0x3d5) [0x55ea46d21cc5]
13: (rocksdb::DBIter::Seek(rocksdb::Slice const&)+0x56c) [0x55ea46d2355c]
14: (RocksDBStore::RocksDBWholeSpaceIteratorImpl::lower_bound(std::string const&, std::string const&)+0x44) [0x55ea46c95d34]
15: (DBObjectMap::DBObjectMapIteratorImpl::lower_bound(std::string const&)+0x5f) [0x55ea467fd8bf]
16: (DBObjectMap::scan(std::shared_ptr<DBObjectMap::_Header>, std::set<std::string, std::less<std::string>, std::allocator<std::string> > const&, std::set<std::string, std::less<std::string>, std::allocator<std::string> >*, std::map<std::string, ceph::buffer::v14_2_0::list, std::less<std::string>, std::allocator<std::pair<std::string const, ceph::buffer::v14_2_0::list> > >*)+0x1e9) [0x55ea467f6189]
17: (DBObjectMap::get_values(ghobject_t const&, std::set<std::string, std::less<std::string>, std::allocator<std::string> > const&, std::map<std::string, ceph::buffer::v14_2_0::list, std::less<std::string>, std::allocator<std::pair<std::string const, ceph::buffer::v14_2_0::list> > >*)+0x9e) [0x55ea467f7fae]
18: (FileStore::omap_get_values(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ghobject_t const&, std::set<std::string, std::less<std::string>, std::allocator<std::string> > const&, std::map<std::string, ceph::buffer::v14_2_0::list, std::less<std::string>, std::allocator<std::pair<std::string const, ceph::buffer::v14_2_0::list> > >*)+0x2b9) [0x55ea466898d9]
19: (PrimaryLogPG::do_osd_ops(PrimaryLogPG::OpContext*, std::vector<OSDOp, std::allocator<OSDOp> >&)+0x10b3) [0x55ea464b7353]
20: (cls_cxx_map_get_val(void*, std::string const&, ceph::buffer::v14_2_0::list*)+0x284) [0x55ea465aef74]
21: (()+0xaeba7) [0x7fe0262e4ba7]
22: (ClassHandler::ClassMethod::exec(void*, ceph::buffer::v14_2_0::list&, ceph::buffer::v14_2_0::list&)+0x34) [0x55ea4639d874]
23: (PrimaryLogPG::do_osd_ops(PrimaryLogPG::OpContext*, std::vector<OSDOp, std::allocator<OSDOp> >&)+0x1637) [0x55ea464b78d7]
24: (PrimaryLogPG::prepare_transaction(PrimaryLogPG::OpContext*)+0x13f) [0x55ea464c8f1f]
25: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0x34a) [0x55ea464c965a]
26: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x373b) [0x55ea464ce2fb]
27: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xcae) [0x55ea464cfc7e]
28: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x362) [0x55ea4630f5d2]
29: (PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x62) [0x55ea4659e3d2]
30: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x90f) [0x55ea4632aa4f]
31: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x55ea468e2e56]
32: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55ea468e5970]
33: (()+0x7ea5) [0x7fe042e7aea5]
34: (clone()+0x6d) [0x7fe041d388dd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
osd.737
=======
2021-06-30 11:04:44.408 7fbe97fa0700 -1 *** Caught signal (Aborted) **
in thread 7fbe97fa0700 thread_name:tp_osd_tp
ceph version 14.2.15 (afdd217ae5fb1ed3f60e16bd62357ca58cc650e5) nautilus (stable)
1: (()+0xf630) [0x7fbec54a8630]
2: (rocksdb::MergingIterator::Next()+0x786) [0x55a6f66b7716]
3: (rocksdb::DBIter::FindNextUserEntryInternal(bool, bool)+0x966) [0x55a6f65c0256]
4: (rocksdb::DBIter::Seek(rocksdb::Slice const&)+0x56c) [0x55a6f65c155c]
5: (RocksDBStore::RocksDBWholeSpaceIteratorImpl::lower_bound(std::string const&, std::string const&)+0x44) [0x55a6f6533d34]
6: (DBObjectMap::DBObjectMapIteratorImpl::lower_bound(std::string const&)+0x5f) [0x55a6f609b8bf]
7: (DBObjectMap::scan(std::shared_ptr<DBObjectMap::_Header>, std::set<std::string, std::less<std::string>, std::allocator<std::string> > const&, std::set<std::string, std::less<std::string>, std::allocator<std::string> >*, std::map<std::string, ceph::buffer::v14_2_0::list, std::less<std::string>, std::allocator<std::pair<std::string const, ceph::buffer::v14_2_0::list> > >*)+0x1e9) [0x55a6f6094189]
8: (DBObjectMap::get_values(ghobject_t const&, std::set<std::string, std::less<std::string>, std::allocator<std::string> > const&, std::map<std::string, ceph::buffer::v14_2_0::list, std::less<std::string>, std::allocator<std::pair<std::string const, ceph::buffer::v14_2_0::list> > >*)+0x9e) [0x55a6f6095fae]
9: (FileStore::omap_get_values(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ghobject_t const&, std::set<std::string, std::less<std::string>, std::allocator<std::string> > const&, std::map<std::string, ceph::buffer::v14_2_0::list, std::less<std::string>, std::allocator<std::pair<std::string const, ceph::buffer::v14_2_0::list> > >*)+0x2b9) [0x55a6f5f278d9]
10: (PrimaryLogPG::do_osd_ops(PrimaryLogPG::OpContext*, std::vector<OSDOp, std::allocator<OSDOp> >&)+0x10b3) [0x55a6f5d55353]
11: (cls_cxx_map_get_val(void*, std::string const&, ceph::buffer::v14_2_0::list*)+0x284) [0x55a6f5e4cf74]
12: (()+0xaeba7) [0x7fbea890aba7]
13: (ClassHandler::ClassMethod::exec(void*, ceph::buffer::v14_2_0::list&, ceph::buffer::v14_2_0::list&)+0x34) [0x55a6f5c3b874]
14: (PrimaryLogPG::do_osd_ops(PrimaryLogPG::OpContext*, std::vector<OSDOp, std::allocator<OSDOp> >&)+0x1637) [0x55a6f5d558d7]
15: (PrimaryLogPG::prepare_transaction(PrimaryLogPG::OpContext*)+0x13f) [0x55a6f5d66f1f]
16: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0x34a) [0x55a6f5d6765a]
17: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x373b) [0x55a6f5d6c2fb]
18: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xcae) [0x55a6f5d6dc7e]
19: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x362) [0x55a6f5bad5d2]
20: (PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x62) [0x55a6f5e3c3d2]
21: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x90f) [0x55a6f5bc8a4f]
22: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x55a6f6180e56]
23: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55a6f6183970]
24: (()+0x7ea5) [0x7fbec54a0ea5]
25: (clone()+0x6d) [0x7fbec435e8dd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
There was no problems with hardware (disks,network), we checked carefully.
What it was? How can 3 OSDs crash in same moment?
No data to display
Actions