Project

General

Profile

Actions

Bug #57039

open

OSD crash

Added by Torkil Svensgaard over 1 year ago. Updated over 1 year ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Busy cluster with OSDs going up/down and in/out, and eventually some stop. Here's the top of a stack trace for an actual crash, I'll be happy to provide whatever other info you would want:

022-08-03T10:34:50.179+0000 7fdedd02f700 -1 ** Caught signal (Aborted) *
in thread 7fdedd02f700 thread_name:tp_osd_tp

ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)
1: /lib64/libpthread.so.0(+0x12c20) [0x7fdf00a78c20]
2: pread64()
3: (KernelDevice::read_random(unsigned long, unsigned long, char*, bool)+0x40d) [0x55701b4c0f0d]
4: (BlueFS::_read_random(BlueFS::FileReader*, unsigned long, unsigned long, char*)+0x60d) [0x55701b05ee7d]
5: (BlueRocksRandomAccessFile::Read(unsigned long, unsigned long, rocksdb::Slice*, char*) const+0x24) [0x55701b08e6d4]
6: (rocksdb::LegacyRandomAccessFileWrapper::Read(unsigned long, unsigned long, rocksdb::IOOptions const&, rocksdb::Slice*, char*, rocksdb::IODebugContext*) const+0x26) [0x55701b529396]
7: (rocksdb::RandomAccessFileReader::Read(unsigned long, unsigned long, rocksdb::Slice*, char*, bool) const+0xdc7) [0x55701b745267]
8: (rocksdb::BlockFetcher::ReadBlockContents()+0x4b5) [0x55701b69fa45]
9: (rocksdb::Status rocksdb::BlockBasedTable::MaybeReadBlockAndLoadToCache<rocksdb::ParsedFullFilterBlock>(rocksdb::FilePrefetchBuffer*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::UncompressionDict const&, rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>*, rocksdb::BlockType, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocksdb::BlockContents*) const+0x919) [0x55701b691ba9]
10: (rocksdb::Status rocksdb::BlockBasedTable::RetrieveBlock<rocksdb::ParsedFullFilterBlock>(rocksdb::FilePrefetchBuffer*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::UncompressionDict const&, rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>*, rocksdb::BlockType, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, bool, bool) const+0x286) [0x55701b691f86]
11: (rocksdb::FilterBlockReaderCommon<rocksdb::ParsedFullFilterBlock>::ReadFilterBlock(rocksdb::BlockBasedTable const*, rocksdb::FilePrefetchBuffer*, rocksdb::ReadOptions const&, bool, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>*)+0xf1) [0x55701b760891]
12: (rocksdb::FilterBlockReaderCommon<rocksdb::ParsedFullFilterBlock>::GetOrReadFilterBlock(bool, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>*) const+0xfe) [0x55701b760b5e]
13: (rocksdb::FullFilterBlockReader::MayMatch(rocksdb::Slice const&, bool, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*) const+0x43) [0x55701b69aae3]
14: (rocksdb::BlockBasedTable::FullFilterKeyMayMatch(rocksdb::ReadOptions const&, rocksdb::FilterBlockReader*, rocksdb::Slice const&, bool, rocksdb::SliceTransform const*, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*) const+0x9e) [0x55701b68118e]
15: (rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&, rocksdb::Slice const&, rocksdb::GetContext*, rocksdb::SliceTransform const*, bool)+0x180) [0x55701b6829a0]
16: (rocksdb::TableCache::Get(rocksdb::ReadOptions const&, rocksdb::InternalKeyComparator const&, rocksdb::FileMetaData const&, rocksdb::Slice const&, rocksdb::GetContext*, rocksdb::SliceTransform const*, rocksdb::HistogramImpl*, bool, int)+0x170) [0x55701b5e2370]
17: (rocksdb::Version::Get(rocksdb::ReadOptions const&, rocksdb::LookupKey const&, rocksdb::PinnableSlice*, rocksdb::Status*, rocksdb::MergeContext*, unsigned long*, bool*, bool*, unsigned long*, rocksdb::ReadCallback*, bool*, bool)+0x3ad) [0x55701b5ffbbd]
18: (rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, rocksdb::Slice const&, rocksdb::DBImpl::GetImplOptions)+0x537) [0x55701b50fbe7]
19: (rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, rocksdb::PinnableSlice*)+0x4d) [0x55701b5104fd]
20: (RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list*)+0x35b) [0x55701b4d617b]
21: (BlueStore::Collection::load_shared_blob(boost::intrusive_ptr<BlueStore::SharedBlob>)+0xf7) [0x55701af50b57]
22: (BlueStore::_wctx_finish(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*, std::set<BlueStore::SharedBlob*, std::less<BlueStore::SharedBlob*>, std::allocator<BlueStore::SharedBlob*> >)+0xeaf) [0x55701af908cf]
23: (BlueStore::_do_truncate(BlueStore::TransContext
, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>, unsigned long, std::set<BlueStore::SharedBlob*, std::less<BlueStore::SharedBlob*>, std::allocator<BlueStore::SharedBlob*> >)+0x3a8) [0x55701af91f68]
24: (BlueStore::_do_remove(BlueStore::TransContext
, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>)+0xce) [0x55701af9289e]
25: (BlueStore::_remove(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>&)+0x22c) [0x55701af9441c]
26: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ceph::os::Transaction*)+0x1fcf) [0x55701afb8dbf]
27: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x316) [0x55701afba1a6]
28: (ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ceph::os::Transaction&&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x85) [0x55701aad7765]
29: (OSD::dispatch_context(PeeringCtx&, PG*, std::shared_ptr<OSDMap const>, ThreadPool::TPHandle*)+0xf3) [0x55701aa6c6d3]
30: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x2d8) [0x55701aa9ee98]
31: (OSD::dequeue_delete(OSDShard*, PG*, unsigned int, ThreadPool::TPHandle&)+0xc8) [0x55701aa9f038]
32: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xc28) [0x55701aa90d48]
33: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x55701b1025b4]
34: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55701b105254]
35: /lib64/libpthread.so.0(+0x817a) [0x7fdf00a6e17a]
36: clone()
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Mvh.

Torkil

Actions #1

Updated by Torkil Svensgaard over 1 year ago

Torkil Svensgaard wrote:

Busy cluster with OSDs going up/down and in/out, and eventually some stop. Here's the top of a stack trace for an actual crash, I'll be happy to provide whatever other info you would want:

End of the log from one going down for good:

"
2022-08-04T09:57:31.752+0000 7f3812cb2700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f37f5a67700' had timed out after 15.000000954s
2022-08-04T09:57:31.837+0000 7f3813cb4700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f37f5a67700' had timed out after 15.000000954s
2022-08-04T09:57:31.838+0000 7f3812cb2700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f37f5a67700' had timed out after 15.000000954s
2022-08-04T09:57:31.906+0000 7f3812cb2700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f37f5a67700' had timed out after 15.000000954s
2022-08-04T09:57:31.906+0000 7f3813cb4700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f37f5a67700' had timed out after 15.000000954s
2022-08-04T09:57:31.913+0000 7f38134b3700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f37f5a67700' had timed out after 15.000000954s
2022-08-04T09:57:31.914+0000 7f3813cb4700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f37f5a67700' had timed out after 15.000000954s
2022-08-04T09:57:31.976+0000 7f38134b3700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f37f5a67700' had timed out after 15.000000954s
2022-08-04T09:57:31.976+0000 7f3812cb2700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f37f5a67700' had timed out after 15.000000954s
2022-08-04T09:57:35.520+0000 7f37f5a67700 1 heartbeat_map clear_timeout 'OSD::osd_op_tp thread 0x7f37f5a67700' had timed out after 15.000000954s
2022-08-04T09:57:35.520+0000 7f37f5a67700 0 bluestore(/var/lib/ceph/osd/ceph-102) log_latency slow operation observed for submit_transact, latency = 63.815132141s
2022-08-04T09:57:35.531+0000 7f3806c9a700 0 bluestore(/var/lib/ceph/osd/ceph-102) log_latency_fn slow operation observed for _txc_committed_kv, latency = 63.825824738s, txc = 0x55f84d31d500
2022-08-04T09:57:35.554+0000 7f3802c92700 0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.102 down, but it is still running
2022-08-04T09:57:35.554+0000 7f3802c92700 0 log_channel(cluster) log [DBG] : map e138671 wrongly marked me down at e138668
2022-08-04T09:57:35.554+0000 7f3802c92700 -1 osd.102 138671 _committed_osd_maps marked down 6 > osd_max_markdown_count 5 in last 600.000000 seconds, shutting down
2022-08-04T09:57:35.558+0000 7f3802c92700 0 osd.102 138671 _committed_osd_maps shutdown OSD via async signal
2022-08-04T09:57:35.558+0000 7f38114af700 -1 received signal: Interrupt from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
2022-08-04T09:57:35.558+0000 7f38114af700 -1 osd.102 138671 * Got signal Interrupt
2022-08-04T09:57:35.558+0000 7f38114af700 -1 osd.102 138671
Immediate shutdown (osd_fast_shutdown=true) *
"

Actions

Also available in: Atom PDF