Project

General

Profile

Actions

Bug #59141

closed

osds crash periodically

Added by Tobias Florek about 1 year ago. Updated about 1 year ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I am using ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable) from the latest container image on an openshift (OKD) cluster on linux 6.1.11-200.fc37.x86_64.

The ceph cluster is experiencing crashes on mostly one node: on two OSDs on hdd and one on nvme. They crash

I attached the log of the last crash.

I can provide additional logs and information if requested.


Files

rook-ceph-osd-4.log (18.5 KB) rook-ceph-osd-4.log Tobias Florek, 03/23/2023 08:34 AM
rook-ceph-osd-4.log.gz (302 KB) rook-ceph-osd-4.log.gz Tobias Florek, 03/23/2023 10:35 AM
rook-ceph-osd-12.log.gz (339 KB) rook-ceph-osd-12.log.gz Tobias Florek, 03/23/2023 10:35 AM
Actions #1

Updated by Tobias Florek about 1 year ago

missed log upload...

Actions #2

Updated by Igor Fedotov about 1 year ago

Hi Tobias,
could you please share more lines (preferably 20K+ ones before the crash) for the already attached log.
And please share another one from a different OSD to make sure it faces the same issue.

Also curious if this issue started to happen recently or have been with you for a while? Can you locate the first occurrence in logs?

Actions #3

Updated by Igor Fedotov about 1 year ago

  • Project changed from Ceph to bluestore

Backtrace for relevant issues matching if any:

debug 0> 2023-03-23T08:20:22.461+0000 7f111e316700 -1 ** Caught signal (Bus error) *
in thread 7f111e316700 thread_name:tp_osd_tp

ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)
1: /lib64/libpthread.so.0(+0x12cf0) [0x7f113ff5fcf0]
2: (rocksdb::MemTableIterator::NextAndGetResult(rocksdb::IterateResult*)+0x99) [0x560f8cf645c9]
3: (rocksdb::MergingIterator::Next()+0x32) [0x560f8d078e52]
4: (rocksdb::MergingIterator::NextAndGetResult(rocksdb::IterateResult*)+0x13) [0x560f8d077193]
5: (rocksdb::DBIter::FindNextUserEntryInternal(bool, rocksdb::Slice const*)+0x6be) [0x560f8cf28c2e]
6: (rocksdb::DBIter::FindNextUserEntry(bool, rocksdb::Slice const*)+0xa0) [0x560f8cf29a40]
7: (rocksdb::DBIter::Next()+0x1ac) [0x560f8cf29c9c]
8: (ShardMergeIteratorImpl::next()+0x65) [0x560f8ce8a125]
9: ceph-osd(+0xc0a482) [0x560f8c824482]
10: (BlueStore::_collection_list(BlueStore::Collection*, ghobject_t const&, ghobject_t const&, int, bool, std::vector<ghobject_t, std::allocator<ghobject_t> >, ghobject_t)+0x1805) [0x560f8c846ae5]
11: (BlueStore::collection_list(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ghobject_t const&, ghobject_t const&, int, std::vector<ghobject_t, std::allocator<ghobject_t> >, ghobject_t)+0xeb) [0x560f8c847adb]
12: (PG::do_delete_work(ceph::os::Transaction&, ghobject_t)+0x25b) [0x560f8c384fcb]
13: (PeeringState::Deleting::react(PeeringState::DeleteSome const&)+0x189) [0x560f8c5d1a09]
14: (boost::statechart::simple_state<PeeringState::Deleting, PeeringState::ToDelete, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xd9) [0x560f8c63afb9]
15: (boost::statechart::state_machine<PeeringState::PeeringMachine, PeeringState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x73) [0x560f8c3913c3]
16: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PeeringCtx&)+0x129) [0x560f8c375399]
17: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x2e5) [0x560f8c2d4435]
18: (OSD::dequeue_delete(OSDShard*, PG*, unsigned int, ThreadPool::TPHandle&)+0x3f7) [0x560f8c2d4b17]
19: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x115f) [0x560f8c2c5dbf]
20: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x435) [0x560f8ca238c5]
21: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x560f8ca25fe4]
22: /lib64/libpthread.so.0(+0x81ca) [0x7f113ff551ca]

Updated by Tobias Florek about 1 year ago

I attached two logs with 25k lines. It started quiet a while ago. Ceph is so robust, that it hardly was only a nuisance. Sorry for not reporting earlier.

Actions #5

Updated by Tobias Florek about 1 year ago

Sorry, "it was only a nuisance".

Actions #6

Updated by Igor Fedotov about 1 year ago

Thanks for logs.
Unfortunately not much helpful info are there. Backtraces aren't fully the same, The only common is that both cases occur in RocksDB call stack.
Neither I recall "Caught signal (Bus error)" signal being frequently met with Ceph issues. Which makes me think it's rather a system and/or H/W problem..

Have you checked system log for errors, particularly ones with timestamps matching Ceph crashes?

Also you might want to run OSD fsck for the affected daemons using ceph-bluestore-tool to see if there are any data consistency issues. Hopefully it will give some insight.

Actions #7

Updated by Tobias Florek about 1 year ago

fsck looked alright for 4 and 12.

# ceph-bluestore-tool --command fsck --path /var/lib/ceph/osd/ceph-4/
fsck success

But you are right about the hardware:
```
Mar 23 08:20:22 worker03 kernel: MCE: Killing tp_osd_tp:811848 due to hardware memory corruption fault at 560ff8097378
Mar 23 08:20:22 worker03 audit810923: ANOM_ABEND auid=4294967295 uid=167 gid=167 ses=4294967295 subj=system_u:system_r:spc_t:s0 pid=810923 comm="tp_osd_tp" exe="/usr/bin/ceph-osd" sig=7 res=1
```

Actions #8

Updated by Igor Fedotov about 1 year ago

So this is a H/W issue unrelated to Ceph, right?

Can we close the ticket then?

Actions #9

Updated by Tobias Florek about 1 year ago

I confirmed that after fixing the memory, no more crashes have occured for more than a day now. Closing. Thank you for diagnosing the issue. I'll have to install alerts to notify us when this issue arises.

Actions #10

Updated by Tobias Florek about 1 year ago

Sorry, I don't know how to close this issue. I can only add comments. Sorry.

Actions #11

Updated by Kefu Chai about 1 year ago

  • Status changed from New to Rejected

marked rejected. as it's a hardware issue.

Actions

Also available in: Atom PDF