Bug #63872: Consistent OSD crashes which recovers without any problem - Ceph - Ceph

Actions

Copy link

Bug #63872

open

Consistent OSD crashes which recovers without any problem

Added by Akash Wark 5 months ago.

Status:

New

Priority:

Normal

Assignee:

Category:

OSD

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

v17.2.5

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

We are running rook-ceph deployed as a operator in kubernetes with rook version 1.10.8 and ceph 17.2.5.

Its working fine but we are seeing frequent OSD daemon crash in 3-4 days and restarts without any problem also we are seeing flapping osds i.e osd up down.

Recently daemon crash happened for 2 OSDs at same time on different nodes with below error in crash info :

-305> 2023-12-17T14:50:14.413+0000 7f53b5f91700 -1 *** Caught signal (Aborted) **
 in thread 7f53b5f91700 thread_name:tp_osd_tp
 ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)
 1: /lib64/libpthread.so.0(+0x12cf0) [0x7f53d93ddcf0]
 2: gsignal()
 3: abort()
 4: /lib64/libc.so.6(+0x21d79) [0x7f53d8025d79]
 5: /lib64/libc.so.6(+0x47456) [0x7f53d804b456]
 6: (MOSDRepOp::encode_payload(unsigned long)+0x2d0) [0x55acc0f81730]
 7: (Message::encode(unsigned long, int, bool)+0x2e) [0x55acc140ec2e]
 8: (ProtocolV2::send_message(Message*)+0x25e) [0x55acc16a5aae]
 9: (AsyncConnection::send_message(Message*)+0x18e) [0x55acc167dc4e]
 10: (OSDService::send_message_osd_cluster(int, Message*, unsigned int)+0x2bd) [0x55acc0b4b11d]
 11: (ReplicatedBackend::issue_op(hobject_t const&, eversion_t const&, unsigned long, osd_reqid_t, eversion_t, eversion_t, hobject_t, hobject_t, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> > const&, std::optional<pg_hit_set_history_t>&, ReplicatedBackend::InProgressOp*, ceph::os::Transaction&)+0x6c8) [0x55acc0f69368]
 12: (ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >&&, std::optional<pg_hit_set_history_t>&, Context*, unsigned long, osd_reqid_t, boost::intrusive_ptr<OpRequest>)+0x5e7) [0x55acc0f6c907]
 13: (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0x50d) [0x55acc0c92ebd]
 14: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0xd25) [0x55acc0cf0295]
 15: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x288d) [0x55acc0cf78fd]
 16: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1c0) [0x55acc0b56900]
 17: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x6d) [0x55acc0e552ad]
 18: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x115f) [0x55acc0b69dbf]
 19: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x435) [0x55acc12c78c5]
 20: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55acc12c9fe4]
 21: /lib64/libpthread.so.0(+0x81ca) [0x7f53d93d31ca]
 22: clone()

It has also has below errors before the crash:
scrub-queue::*remove_from_osd_queue* removing pg[2.4f0] failed. State was: unregistering

Please help to troubleshoot the issue and fix it

No data to display

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #63872

Consistent OSD crashes which recovers without any problem