Bug #58274
openBlueStore::collection_list becomes extremely slow due to unbounded rocksdb iteration
0%
Description
We've observed some very high latencies on Bluestore::_collection_list() operations during PG removal.
On one cluster, these latencies have gotten so bad that they regularly exceed the OSD suicide timeout (150s) and cause the OSDs to thrash.
Here's part of a stacktrace from one of the OSDs that crashes with a suicide timeout:
*** Caught signal (Aborted) ** in thread 7f9317b89700 thread_name:tp_osd_tp 1: /lib64/libpthread.so.0(+0x12ce0) [0x7f935160bce0] 2: pread64() 3: (KernelDevice::read(unsigned long, unsigned long, ceph::buffer::v15_2_0::list*, IOContext*, bool)+0x2dd) [0x5579df8ddbdd] 4: (BlueFS::_read(BlueFS::FileReader*, unsigned long, unsigned long, ceph::buffer::v15_2_0::list*, char*)+0x79c) [0x5579df47dd> 5: (BlueRocksRandomAccessFile::Prefetch(unsigned long, unsigned long)+0x1e) [0x5579df4ac29e] 6: (rocksdb::LegacyRandomAccessFileWrapper::Prefetch(unsigned long, unsigned long, rocksdb::IOOptions const&, rocksdb::IODebug> 7: (rocksdb::RandomAccessFileReader::Prefetch(unsigned long, unsigned long) const+0x2e) [0x5579dfaa793e] 8: (rocksdb::BlockBasedTableIterator<rocksdb::DataBlockIter, rocksdb::Slice>::InitDataBlock()+0x1a6) [0x5579dfab39f6] 9: (rocksdb::BlockBasedTableIterator<rocksdb::DataBlockIter, rocksdb::Slice>::FindBlockForward()+0x1c6) [0x5579dfab4876] 10: (rocksdb::BlockBasedTableIterator<rocksdb::DataBlockIter, rocksdb::Slice>::Next()+0x58) [0x5579dfab4978] 11: (rocksdb::BlockBasedTableIterator<rocksdb::DataBlockIter, rocksdb::Slice>::NextAndGetResult(rocksdb::IterateResult*)+0x16)> 12: /usr/bin/ceph-osd(+0x12194b1) [0x5579dfa164b1] 13: (rocksdb::MergingIterator::NextAndGetResult(rocksdb::IterateResult*)+0x50) [0x5579dfaccc40] 14: (rocksdb::DBIter::FindNextUserEntryInternal(bool, rocksdb::Slice const*)+0x8d2) [0x5579df9a20c2] 15: (rocksdb::DBIter::FindNextUserEntry(bool, rocksdb::Slice const*)+0xa7) [0x5579df9a25b7] 16: (rocksdb::DBIter::Seek(rocksdb::Slice const&)+0x634) [0x5579df9a4bf4] 17: (ShardMergeIteratorImpl::lower_bound(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const> 18: /usr/bin/ceph-osd(+0xb54e28) [0x5579df351e28] 19: (BlueStore::_collection_list(BlueStore::Collection*, ghobject_t const&, ghobject_t const&, int, bool, std::vector<ghobject> 20: (BlueStore::collection_list(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ghobject_t const&, ghobject_t const&, int,> 21: (PG::do_delete_work(ceph::os::Transaction&, ghobject_t)+0x2e0) [0x5579def3e5e0] 22: (PeeringState::Deleting::react(PeeringState::DeleteSome const&)+0x108) [0x5579df0f8eb8] 23: (boost::statechart::simple_state<PeeringState::Deleting, PeeringState::ToDelete, boost::mpl::list<mpl_::na, mpl_::na, mpl_> 24: (boost::statechart::state_machine<PeeringState::PeeringMachine, PeeringState::Initial, std::allocator<boost::statechart::n> 25: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PeeringCtx&)+0x2d1) [0x5579def3ad11] 26: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x29c) [0x5579deeb171c] 27: (OSD::dequeue_delete(OSDShard*, PG*, unsigned int, ThreadPool::TPHandle&)+0xc8) [0x5579deeb18f8] 28: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xc28) [0x5579deea34c8] 29: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x5579df5202a4] 30: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x5579df523184] 31: /lib64/libpthread.so.0(+0x81ca) [0x7f93516011ca] 32: clone()
It looks like the underlying cause is essentially the same as the one identified in: https://tracker.ceph.com/issues/55324
That is, the lower_bound seek operation in rocksdb often has to iterate over a large number of tombstones when called from PG::do_delete_work -> BlueStore::collection_list due to the recent deletion of the PG's onode keys. Iteration over a large number of contiguous tombstones is time consuming.
The solution is to bound the rocksdb iteration so that it doesn't need to unnecessarily iterate over the tombstones from previously deleted keys.