Project

General

Profile

Bug #58274

BlueStore::collection_list becomes extremely slow due to unbounded rocksdb iteration

Added by Cory Snyder over 1 year ago. Updated about 1 year ago.

Status:
Fix Under Review
Priority:
High
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
reef, quincy,pacific
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We've observed some very high latencies on Bluestore::_collection_list() operations during PG removal.

On one cluster, these latencies have gotten so bad that they regularly exceed the OSD suicide timeout (150s) and cause the OSDs to thrash.

Here's part of a stacktrace from one of the OSDs that crashes with a suicide timeout:

*** Caught signal (Aborted) **
 in thread 7f9317b89700 thread_name:tp_osd_tp
 1: /lib64/libpthread.so.0(+0x12ce0) [0x7f935160bce0]
 2: pread64()
 3: (KernelDevice::read(unsigned long, unsigned long, ceph::buffer::v15_2_0::list*, IOContext*, bool)+0x2dd) [0x5579df8ddbdd]
 4: (BlueFS::_read(BlueFS::FileReader*, unsigned long, unsigned long, ceph::buffer::v15_2_0::list*, char*)+0x79c) [0x5579df47dd>
 5: (BlueRocksRandomAccessFile::Prefetch(unsigned long, unsigned long)+0x1e) [0x5579df4ac29e]
 6: (rocksdb::LegacyRandomAccessFileWrapper::Prefetch(unsigned long, unsigned long, rocksdb::IOOptions const&, rocksdb::IODebug>
 7: (rocksdb::RandomAccessFileReader::Prefetch(unsigned long, unsigned long) const+0x2e) [0x5579dfaa793e]
 8: (rocksdb::BlockBasedTableIterator<rocksdb::DataBlockIter, rocksdb::Slice>::InitDataBlock()+0x1a6) [0x5579dfab39f6]
 9: (rocksdb::BlockBasedTableIterator<rocksdb::DataBlockIter, rocksdb::Slice>::FindBlockForward()+0x1c6) [0x5579dfab4876]
 10: (rocksdb::BlockBasedTableIterator<rocksdb::DataBlockIter, rocksdb::Slice>::Next()+0x58) [0x5579dfab4978]
 11: (rocksdb::BlockBasedTableIterator<rocksdb::DataBlockIter, rocksdb::Slice>::NextAndGetResult(rocksdb::IterateResult*)+0x16)>
 12: /usr/bin/ceph-osd(+0x12194b1) [0x5579dfa164b1]
 13: (rocksdb::MergingIterator::NextAndGetResult(rocksdb::IterateResult*)+0x50) [0x5579dfaccc40]
 14: (rocksdb::DBIter::FindNextUserEntryInternal(bool, rocksdb::Slice const*)+0x8d2) [0x5579df9a20c2]
 15: (rocksdb::DBIter::FindNextUserEntry(bool, rocksdb::Slice const*)+0xa7) [0x5579df9a25b7]
 16: (rocksdb::DBIter::Seek(rocksdb::Slice const&)+0x634) [0x5579df9a4bf4]
 17: (ShardMergeIteratorImpl::lower_bound(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const>
 18: /usr/bin/ceph-osd(+0xb54e28) [0x5579df351e28]
 19: (BlueStore::_collection_list(BlueStore::Collection*, ghobject_t const&, ghobject_t const&, int, bool, std::vector<ghobject>
 20: (BlueStore::collection_list(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ghobject_t const&, ghobject_t const&, int,>
 21: (PG::do_delete_work(ceph::os::Transaction&, ghobject_t)+0x2e0) [0x5579def3e5e0]
 22: (PeeringState::Deleting::react(PeeringState::DeleteSome const&)+0x108) [0x5579df0f8eb8]
 23: (boost::statechart::simple_state<PeeringState::Deleting, PeeringState::ToDelete, boost::mpl::list<mpl_::na, mpl_::na, mpl_>
 24: (boost::statechart::state_machine<PeeringState::PeeringMachine, PeeringState::Initial, std::allocator<boost::statechart::n>
 25: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PeeringCtx&)+0x2d1) [0x5579def3ad11]
 26: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x29c) [0x5579deeb171c]
 27: (OSD::dequeue_delete(OSDShard*, PG*, unsigned int, ThreadPool::TPHandle&)+0xc8) [0x5579deeb18f8]
 28: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xc28) [0x5579deea34c8]
 29: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x5579df5202a4]
 30: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x5579df523184]
 31: /lib64/libpthread.so.0(+0x81ca) [0x7f93516011ca]
 32: clone()

It looks like the underlying cause is essentially the same as the one identified in: https://tracker.ceph.com/issues/55324

That is, the lower_bound seek operation in rocksdb often has to iterate over a large number of tombstones when called from PG::do_delete_work -> BlueStore::collection_list due to the recent deletion of the PG's onode keys. Iteration over a large number of contiguous tombstones is time consuming.

The solution is to bound the rocksdb iteration so that it doesn't need to unnecessarily iterate over the tombstones from previously deleted keys.


Related issues

Related to bluestore - Bug #55682: OSD boot takes more than half an hour Triaged

History

#1 Updated by yixing hao about 1 year ago

Also observed from our HDD bluestore cluster with tens of billions of objects, the stack is like the above.

7ffad9e90700 0 bluestore(/var/lib/ceph/osd/ceph-342) log_latency_fn slow operation observed for _collection_list, latency = 73.4702s, lat = 73s cid =24.c73s1_head start GHMAX end GHMAX max 30

Maybe we can remove the check here ? Since PG::do_delete_work does check with collection_list (limits max here so performance ok) to ensure there is no objects in db for coll before remove_collection. While _collection_list in BlueStore::_remove_collection set max to num of objects in the pg which can be quite large.

#2 Updated by yixing hao about 1 year ago

yixing hao wrote:

Also observed from our HDD bluestore cluster with tens of billions of objects, the stack is like the above.

7ffad9e90700 0 bluestore(/var/lib/ceph/osd/ceph-342) log_latency_fn slow operation observed for _collection_list, latency = 73.4702s, lat = 73s cid =24.c73s1_head start GHMAX end GHMAX max 30

Maybe we can remove the check here ? Since PG::do_delete_work does check with collection_list (limits max here so performance ok) to ensure there is no objects in db for coll before remove_collection. While _collection_list in BlueStore::_remove_collection set max to num of objects in the pg which can be quite large.

And if check not ok in BlueStore::_remove_collection, it's caller _txc_add_transaction will simply let the osd abort.
https://github.com/ceph/ceph/pull/49617

#3 Updated by Igor Fedotov about 1 year ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 49438

#4 Updated by Igor Fedotov about 1 year ago

  • Backport changed from quincy,pacific to reef, quincy,pacific

#5 Updated by Igor Fedotov about 1 year ago

  • Related to Bug #55682: OSD boot takes more than half an hour added

Also available in: Atom PDF