Actions
Bug #63872
openConsistent OSD crashes which recovers without any problem
Status:
New
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:
0%
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Description
We are running rook-ceph deployed as a operator in kubernetes with rook version 1.10.8 and ceph 17.2.5.
Its working fine but we are seeing frequent OSD daemon crash in 3-4 days and restarts without any problem also we are seeing flapping osds i.e osd up down.
Recently daemon crash happened for 2 OSDs at same time on different nodes with below error in crash info :
-305> 2023-12-17T14:50:14.413+0000 7f53b5f91700 -1 *** Caught signal (Aborted) **
in thread 7f53b5f91700 thread_name:tp_osd_tp
ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)
1: /lib64/libpthread.so.0(+0x12cf0) [0x7f53d93ddcf0]
2: gsignal()
3: abort()
4: /lib64/libc.so.6(+0x21d79) [0x7f53d8025d79]
5: /lib64/libc.so.6(+0x47456) [0x7f53d804b456]
6: (MOSDRepOp::encode_payload(unsigned long)+0x2d0) [0x55acc0f81730]
7: (Message::encode(unsigned long, int, bool)+0x2e) [0x55acc140ec2e]
8: (ProtocolV2::send_message(Message*)+0x25e) [0x55acc16a5aae]
9: (AsyncConnection::send_message(Message*)+0x18e) [0x55acc167dc4e]
10: (OSDService::send_message_osd_cluster(int, Message*, unsigned int)+0x2bd) [0x55acc0b4b11d]
11: (ReplicatedBackend::issue_op(hobject_t const&, eversion_t const&, unsigned long, osd_reqid_t, eversion_t, eversion_t, hobject_t, hobject_t, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> > const&, std::optional<pg_hit_set_history_t>&, ReplicatedBackend::InProgressOp*, ceph::os::Transaction&)+0x6c8) [0x55acc0f69368]
12: (ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >&&, std::optional<pg_hit_set_history_t>&, Context*, unsigned long, osd_reqid_t, boost::intrusive_ptr<OpRequest>)+0x5e7) [0x55acc0f6c907]
13: (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0x50d) [0x55acc0c92ebd]
14: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0xd25) [0x55acc0cf0295]
15: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x288d) [0x55acc0cf78fd]
16: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1c0) [0x55acc0b56900]
17: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x6d) [0x55acc0e552ad]
18: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x115f) [0x55acc0b69dbf]
19: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x435) [0x55acc12c78c5]
20: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55acc12c9fe4]
21: /lib64/libpthread.so.0(+0x81ca) [0x7f53d93d31ca]
22: clone()
It has also has below errors before the crash:
scrub-queue::*remove_from_osd_queue* removing pg[2.4f0] failed. State was: unregistering
Please help to troubleshoot the issue and fix it
No data to display
Actions