Bug #58274: BlueStore::collection_list becomes extremely slow due to unbounded rocksdb iteration - bluestore - Ceph

Actions

Copy link

Bug #58274

open

BlueStore::collection_list becomes extremely slow due to unbounded rocksdb iteration

Added by Cory Snyder over 1 year ago. Updated 7 days ago.

Status:

Pending Backport

Priority:

High

Assignee:

Cory Snyder

Target version:

Ceph - v20.0.0

% Done:

Source:

Development

Tags:

backport_processed

Backport:

quincy reef squid

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

49438

Crash signature (v1):

Crash signature (v2):

Description

We've observed some very high latencies on Bluestore::_collection_list() operations during PG removal.

On one cluster, these latencies have gotten so bad that they regularly exceed the OSD suicide timeout (150s) and cause the OSDs to thrash.

Here's part of a stacktrace from one of the OSDs that crashes with a suicide timeout:

*** Caught signal (Aborted) **
 in thread 7f9317b89700 thread_name:tp_osd_tp
 1: /lib64/libpthread.so.0(+0x12ce0) [0x7f935160bce0]
 2: pread64()
 3: (KernelDevice::read(unsigned long, unsigned long, ceph::buffer::v15_2_0::list*, IOContext*, bool)+0x2dd) [0x5579df8ddbdd]
 4: (BlueFS::_read(BlueFS::FileReader*, unsigned long, unsigned long, ceph::buffer::v15_2_0::list*, char*)+0x79c) [0x5579df47dd>
 5: (BlueRocksRandomAccessFile::Prefetch(unsigned long, unsigned long)+0x1e) [0x5579df4ac29e]
 6: (rocksdb::LegacyRandomAccessFileWrapper::Prefetch(unsigned long, unsigned long, rocksdb::IOOptions const&, rocksdb::IODebug>
 7: (rocksdb::RandomAccessFileReader::Prefetch(unsigned long, unsigned long) const+0x2e) [0x5579dfaa793e]
 8: (rocksdb::BlockBasedTableIterator<rocksdb::DataBlockIter, rocksdb::Slice>::InitDataBlock()+0x1a6) [0x5579dfab39f6]
 9: (rocksdb::BlockBasedTableIterator<rocksdb::DataBlockIter, rocksdb::Slice>::FindBlockForward()+0x1c6) [0x5579dfab4876]
 10: (rocksdb::BlockBasedTableIterator<rocksdb::DataBlockIter, rocksdb::Slice>::Next()+0x58) [0x5579dfab4978]
 11: (rocksdb::BlockBasedTableIterator<rocksdb::DataBlockIter, rocksdb::Slice>::NextAndGetResult(rocksdb::IterateResult*)+0x16)>
 12: /usr/bin/ceph-osd(+0x12194b1) [0x5579dfa164b1]
 13: (rocksdb::MergingIterator::NextAndGetResult(rocksdb::IterateResult*)+0x50) [0x5579dfaccc40]
 14: (rocksdb::DBIter::FindNextUserEntryInternal(bool, rocksdb::Slice const*)+0x8d2) [0x5579df9a20c2]
 15: (rocksdb::DBIter::FindNextUserEntry(bool, rocksdb::Slice const*)+0xa7) [0x5579df9a25b7]
 16: (rocksdb::DBIter::Seek(rocksdb::Slice const&)+0x634) [0x5579df9a4bf4]
 17: (ShardMergeIteratorImpl::lower_bound(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const>
 18: /usr/bin/ceph-osd(+0xb54e28) [0x5579df351e28]
 19: (BlueStore::_collection_list(BlueStore::Collection*, ghobject_t const&, ghobject_t const&, int, bool, std::vector<ghobject>
 20: (BlueStore::collection_list(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ghobject_t const&, ghobject_t const&, int,>
 21: (PG::do_delete_work(ceph::os::Transaction&, ghobject_t)+0x2e0) [0x5579def3e5e0]
 22: (PeeringState::Deleting::react(PeeringState::DeleteSome const&)+0x108) [0x5579df0f8eb8]
 23: (boost::statechart::simple_state<PeeringState::Deleting, PeeringState::ToDelete, boost::mpl::list<mpl_::na, mpl_::na, mpl_>
 24: (boost::statechart::state_machine<PeeringState::PeeringMachine, PeeringState::Initial, std::allocator<boost::statechart::n>
 25: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PeeringCtx&)+0x2d1) [0x5579def3ad11]
 26: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x29c) [0x5579deeb171c]
 27: (OSD::dequeue_delete(OSDShard*, PG*, unsigned int, ThreadPool::TPHandle&)+0xc8) [0x5579deeb18f8]
 28: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xc28) [0x5579deea34c8]
 29: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x5579df5202a4]
 30: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x5579df523184]
 31: /lib64/libpthread.so.0(+0x81ca) [0x7f93516011ca]
 32: clone()

It looks like the underlying cause is essentially the same as the one identified in: https://tracker.ceph.com/issues/55324

That is, the lower_bound seek operation in rocksdb often has to iterate over a large number of tombstones when called from PG::do_delete_work -> BlueStore::collection_list due to the recent deletion of the PG's onode keys. Iteration over a large number of contiguous tombstones is time consuming.

The solution is to bound the rocksdb iteration so that it doesn't need to unnecessarily iterate over the tombstones from previously deleted keys.

Related issues 7 (7 open — 0 closed)

Related to bluestore - Bug #55682: OSD boot takes more than half an hour

Triaged

Actions

Copied to bluestore - Backport #65936: quincy: BlueStore::collection_list becomes extremely slow due to unbounded rocksdb iteration

New

Cory Snyder

Actions

Copied to bluestore - Backport #65937: squid: BlueStore::collection_list becomes extremely slow due to unbounded rocksdb iteration

New

Cory Snyder