Project

General

Profile

Actions

Bug #58274

open

BlueStore::collection_list becomes extremely slow due to unbounded rocksdb iteration

Added by Cory Snyder over 1 year ago. Updated 7 days ago.

Status:
Pending Backport
Priority:
High
Assignee:
Target version:
% Done:

0%

Source:
Development
Tags:
backport_processed
Backport:
quincy reef squid
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We've observed some very high latencies on Bluestore::_collection_list() operations during PG removal.

On one cluster, these latencies have gotten so bad that they regularly exceed the OSD suicide timeout (150s) and cause the OSDs to thrash.

Here's part of a stacktrace from one of the OSDs that crashes with a suicide timeout:

*** Caught signal (Aborted) **
 in thread 7f9317b89700 thread_name:tp_osd_tp
 1: /lib64/libpthread.so.0(+0x12ce0) [0x7f935160bce0]
 2: pread64()
 3: (KernelDevice::read(unsigned long, unsigned long, ceph::buffer::v15_2_0::list*, IOContext*, bool)+0x2dd) [0x5579df8ddbdd]
 4: (BlueFS::_read(BlueFS::FileReader*, unsigned long, unsigned long, ceph::buffer::v15_2_0::list*, char*)+0x79c) [0x5579df47dd>
 5: (BlueRocksRandomAccessFile::Prefetch(unsigned long, unsigned long)+0x1e) [0x5579df4ac29e]
 6: (rocksdb::LegacyRandomAccessFileWrapper::Prefetch(unsigned long, unsigned long, rocksdb::IOOptions const&, rocksdb::IODebug>
 7: (rocksdb::RandomAccessFileReader::Prefetch(unsigned long, unsigned long) const+0x2e) [0x5579dfaa793e]
 8: (rocksdb::BlockBasedTableIterator<rocksdb::DataBlockIter, rocksdb::Slice>::InitDataBlock()+0x1a6) [0x5579dfab39f6]
 9: (rocksdb::BlockBasedTableIterator<rocksdb::DataBlockIter, rocksdb::Slice>::FindBlockForward()+0x1c6) [0x5579dfab4876]
 10: (rocksdb::BlockBasedTableIterator<rocksdb::DataBlockIter, rocksdb::Slice>::Next()+0x58) [0x5579dfab4978]
 11: (rocksdb::BlockBasedTableIterator<rocksdb::DataBlockIter, rocksdb::Slice>::NextAndGetResult(rocksdb::IterateResult*)+0x16)>
 12: /usr/bin/ceph-osd(+0x12194b1) [0x5579dfa164b1]
 13: (rocksdb::MergingIterator::NextAndGetResult(rocksdb::IterateResult*)+0x50) [0x5579dfaccc40]
 14: (rocksdb::DBIter::FindNextUserEntryInternal(bool, rocksdb::Slice const*)+0x8d2) [0x5579df9a20c2]
 15: (rocksdb::DBIter::FindNextUserEntry(bool, rocksdb::Slice const*)+0xa7) [0x5579df9a25b7]
 16: (rocksdb::DBIter::Seek(rocksdb::Slice const&)+0x634) [0x5579df9a4bf4]
 17: (ShardMergeIteratorImpl::lower_bound(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const>
 18: /usr/bin/ceph-osd(+0xb54e28) [0x5579df351e28]
 19: (BlueStore::_collection_list(BlueStore::Collection*, ghobject_t const&, ghobject_t const&, int, bool, std::vector<ghobject>
 20: (BlueStore::collection_list(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ghobject_t const&, ghobject_t const&, int,>
 21: (PG::do_delete_work(ceph::os::Transaction&, ghobject_t)+0x2e0) [0x5579def3e5e0]
 22: (PeeringState::Deleting::react(PeeringState::DeleteSome const&)+0x108) [0x5579df0f8eb8]
 23: (boost::statechart::simple_state<PeeringState::Deleting, PeeringState::ToDelete, boost::mpl::list<mpl_::na, mpl_::na, mpl_>
 24: (boost::statechart::state_machine<PeeringState::PeeringMachine, PeeringState::Initial, std::allocator<boost::statechart::n>
 25: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PeeringCtx&)+0x2d1) [0x5579def3ad11]
 26: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x29c) [0x5579deeb171c]
 27: (OSD::dequeue_delete(OSDShard*, PG*, unsigned int, ThreadPool::TPHandle&)+0xc8) [0x5579deeb18f8]
 28: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xc28) [0x5579deea34c8]
 29: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x5579df5202a4]
 30: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x5579df523184]
 31: /lib64/libpthread.so.0(+0x81ca) [0x7f93516011ca]
 32: clone()

It looks like the underlying cause is essentially the same as the one identified in: https://tracker.ceph.com/issues/55324

That is, the lower_bound seek operation in rocksdb often has to iterate over a large number of tombstones when called from PG::do_delete_work -> BlueStore::collection_list due to the recent deletion of the PG's onode keys. Iteration over a large number of contiguous tombstones is time consuming.

The solution is to bound the rocksdb iteration so that it doesn't need to unnecessarily iterate over the tombstones from previously deleted keys.


Related issues 7 (7 open0 closed)

Related to bluestore - Bug #55682: OSD boot takes more than half an hourTriaged

Actions
Copied to bluestore - Backport #65936: quincy: BlueStore::collection_list becomes extremely slow due to unbounded rocksdb iterationNewCory SnyderActions
Copied to bluestore - Backport #65937: squid: BlueStore::collection_list becomes extremely slow due to unbounded rocksdb iterationNewCory SnyderActions
Copied to bluestore - Backport #65938: reef: BlueStore::collection_list becomes extremely slow due to unbounded rocksdb iterationNewCory SnyderActions
Copied to bluestore - Backport #65939: quincy: BlueStore::collection_list becomes extremely slow due to unbounded rocksdb iterationNewCory SnyderActions
Copied to bluestore - Backport #65940: squid: BlueStore::collection_list becomes extremely slow due to unbounded rocksdb iterationNewCory SnyderActions
Copied to bluestore - Backport #65941: reef: BlueStore::collection_list becomes extremely slow due to unbounded rocksdb iterationNewCory SnyderActions
Actions

Also available in: Atom PDF