Project

General

Profile

Actions

Support #64154

open

Regular ceph daemon crashes causing 2-3 minutes downtime

Added by Akash Wark 3 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Tags:
Reviewed:
Affected Versions:
Pull request ID:

Description

We are running rook-ceph deployed as a operator in kubernetes with rook version 1.10.8 and ceph 17.2.5.
Deployed in Centos 7.9

Its working fine but we are seeing frequent OSD daemon crash every 3-4 days and restarts normally
also we are seeing flapping osds i.e osd up down.

Recently daemon crash happened for 2 OSDs at same time on different nodes with below error in crash info :

-305> 2023-12-17T14:50:14.413+0000 7f53b5f91700 -1 ** Caught signal (Aborted) *
in thread 7f53b5f91700 thread_name:tp_osd_tp
ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)
1: /lib64/libpthread.so.0(+0x12cf0) [0x7f53d93ddcf0]
2: gsignal()
3: abort()
4: /lib64/libc.so.6(+0x21d79) [0x7f53d8025d79]
5: /lib64/libc.so.6(+0x47456) [0x7f53d804b456]
6: (MOSDRepOp::encode_payload(unsigned long)+0x2d0) [0x55acc0f81730]
7: (Message::encode(unsigned long, int, bool)+0x2e) [0x55acc140ec2e]
8: (ProtocolV2::send_message(Message*)+0x25e) [0x55acc16a5aae]
9: (AsyncConnection::send_message(Message*)+0x18e) [0x55acc167dc4e]
10: (OSDService::send_message_osd_cluster(int, Message*, unsigned int)+0x2bd) [0x55acc0b4b11d]
11: (ReplicatedBackend::issue_op(hobject_t const&, eversion_t const&, unsigned long, osd_reqid_t, eversion_t, eversion_t, hobject_t, hobject_t, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> > const&, std::optional<pg_hit_set_history_t>&, ReplicatedBackend::InProgressOp*, ceph::os::Transaction&)+0x6c8) [0x55acc0f69368]
12: (ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >&&, std::optional<pg_hit_set_history_t>&, Context*, unsigned long, osd_reqid_t, boost::intrusive_ptr<OpRequest>)+0x5e7) [0x55acc0f6c907]
13: (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0x50d) [0x55acc0c92ebd]
14: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0xd25) [0x55acc0cf0295]
15: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x288d) [0x55acc0cf78fd]
16: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1c0) [0x55acc0b56900]
17: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x6d) [0x55acc0e552ad]
18: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x115f) [0x55acc0b69dbf]
19: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x435) [0x55acc12c78c5]
20: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55acc12c9fe4]
21: /lib64/libpthread.so.0(+0x81ca) [0x7f53d93d31ca]
22: clone()
It has also has below errors before the crash:
scrub-queue::*remove_from_osd_queue* removing pg[2.4f0] failed. State was: unregistering

Not able to see any segfault or kernel error during the timeperiod in syslogs

Please help to troubleshoot the issue and fix it

No data to display

Actions

Also available in: Atom PDF