Bug #58274
BlueStore::collection_list becomes extremely slow due to unbounded rocksdb iteration
0%
Description
We've observed some very high latencies on Bluestore::_collection_list() operations during PG removal.
On one cluster, these latencies have gotten so bad that they regularly exceed the OSD suicide timeout (150s) and cause the OSDs to thrash.
Here's part of a stacktrace from one of the OSDs that crashes with a suicide timeout:
*** Caught signal (Aborted) ** in thread 7f9317b89700 thread_name:tp_osd_tp 1: /lib64/libpthread.so.0(+0x12ce0) [0x7f935160bce0] 2: pread64() 3: (KernelDevice::read(unsigned long, unsigned long, ceph::buffer::v15_2_0::list*, IOContext*, bool)+0x2dd) [0x5579df8ddbdd] 4: (BlueFS::_read(BlueFS::FileReader*, unsigned long, unsigned long, ceph::buffer::v15_2_0::list*, char*)+0x79c) [0x5579df47dd> 5: (BlueRocksRandomAccessFile::Prefetch(unsigned long, unsigned long)+0x1e) [0x5579df4ac29e] 6: (rocksdb::LegacyRandomAccessFileWrapper::Prefetch(unsigned long, unsigned long, rocksdb::IOOptions const&, rocksdb::IODebug> 7: (rocksdb::RandomAccessFileReader::Prefetch(unsigned long, unsigned long) const+0x2e) [0x5579dfaa793e] 8: (rocksdb::BlockBasedTableIterator<rocksdb::DataBlockIter, rocksdb::Slice>::InitDataBlock()+0x1a6) [0x5579dfab39f6] 9: (rocksdb::BlockBasedTableIterator<rocksdb::DataBlockIter, rocksdb::Slice>::FindBlockForward()+0x1c6) [0x5579dfab4876] 10: (rocksdb::BlockBasedTableIterator<rocksdb::DataBlockIter, rocksdb::Slice>::Next()+0x58) [0x5579dfab4978] 11: (rocksdb::BlockBasedTableIterator<rocksdb::DataBlockIter, rocksdb::Slice>::NextAndGetResult(rocksdb::IterateResult*)+0x16)> 12: /usr/bin/ceph-osd(+0x12194b1) [0x5579dfa164b1] 13: (rocksdb::MergingIterator::NextAndGetResult(rocksdb::IterateResult*)+0x50) [0x5579dfaccc40] 14: (rocksdb::DBIter::FindNextUserEntryInternal(bool, rocksdb::Slice const*)+0x8d2) [0x5579df9a20c2] 15: (rocksdb::DBIter::FindNextUserEntry(bool, rocksdb::Slice const*)+0xa7) [0x5579df9a25b7] 16: (rocksdb::DBIter::Seek(rocksdb::Slice const&)+0x634) [0x5579df9a4bf4] 17: (ShardMergeIteratorImpl::lower_bound(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const> 18: /usr/bin/ceph-osd(+0xb54e28) [0x5579df351e28] 19: (BlueStore::_collection_list(BlueStore::Collection*, ghobject_t const&, ghobject_t const&, int, bool, std::vector<ghobject> 20: (BlueStore::collection_list(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ghobject_t const&, ghobject_t const&, int,> 21: (PG::do_delete_work(ceph::os::Transaction&, ghobject_t)+0x2e0) [0x5579def3e5e0] 22: (PeeringState::Deleting::react(PeeringState::DeleteSome const&)+0x108) [0x5579df0f8eb8] 23: (boost::statechart::simple_state<PeeringState::Deleting, PeeringState::ToDelete, boost::mpl::list<mpl_::na, mpl_::na, mpl_> 24: (boost::statechart::state_machine<PeeringState::PeeringMachine, PeeringState::Initial, std::allocator<boost::statechart::n> 25: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PeeringCtx&)+0x2d1) [0x5579def3ad11] 26: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x29c) [0x5579deeb171c] 27: (OSD::dequeue_delete(OSDShard*, PG*, unsigned int, ThreadPool::TPHandle&)+0xc8) [0x5579deeb18f8] 28: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xc28) [0x5579deea34c8] 29: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x5579df5202a4] 30: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x5579df523184] 31: /lib64/libpthread.so.0(+0x81ca) [0x7f93516011ca] 32: clone()
It looks like the underlying cause is essentially the same as the one identified in: https://tracker.ceph.com/issues/55324
That is, the lower_bound seek operation in rocksdb often has to iterate over a large number of tombstones when called from PG::do_delete_work -> BlueStore::collection_list due to the recent deletion of the PG's onode keys. Iteration over a large number of contiguous tombstones is time consuming.
The solution is to bound the rocksdb iteration so that it doesn't need to unnecessarily iterate over the tombstones from previously deleted keys.
Related issues
History
#1 Updated by yixing hao 3 months ago
Also observed from our HDD bluestore cluster with tens of billions of objects, the stack is like the above.
7ffad9e90700 0 bluestore(/var/lib/ceph/osd/ceph-342) log_latency_fn slow operation observed for _collection_list, latency = 73.4702s, lat = 73s cid =24.c73s1_head start GHMAX end GHMAX max 30
Maybe we can remove the check here ? Since PG::do_delete_work does check with collection_list (limits max here so performance ok) to ensure there is no objects in db for coll before remove_collection. While _collection_list in BlueStore::_remove_collection set max to num of objects in the pg which can be quite large.
#2 Updated by yixing hao 3 months ago
yixing hao wrote:
Also observed from our HDD bluestore cluster with tens of billions of objects, the stack is like the above.
7ffad9e90700 0 bluestore(/var/lib/ceph/osd/ceph-342) log_latency_fn slow operation observed for _collection_list, latency = 73.4702s, lat = 73s cid =24.c73s1_head start GHMAX end GHMAX max 30
Maybe we can remove the check here ? Since PG::do_delete_work does check with collection_list (limits max here so performance ok) to ensure there is no objects in db for coll before remove_collection. While _collection_list in BlueStore::_remove_collection set max to num of objects in the pg which can be quite large.
And if check not ok in BlueStore::_remove_collection, it's caller _txc_add_transaction will simply let the osd abort.
https://github.com/ceph/ceph/pull/49617
#3 Updated by Igor Fedotov about 2 months ago
- Status changed from New to Fix Under Review
- Pull request ID set to 49438
#4 Updated by Igor Fedotov 5 days ago
- Backport changed from quincy,pacific to reef, quincy,pacific
#5 Updated by Igor Fedotov 5 days ago
- Related to Bug #55682: OSD boot takes more than half an hour added