Bug #12523
closedosd suicide timeout during peering - search for missing objects
0%
Description
Peering thread hit suicide timeout, and the logs show that the thread was doing (more than 150 seconds) PG::MissingLoc::add_source_info, which should be able to reset the timeout.
Looking at PG::RecoveryState::start_handle, if there is messages to flush (messages_pending_flush = true), it will create a new RecoveryCtx which lose the original thread handle, as a result, the above procedure will not reset the timeout and triggered the crash.
Updated by huang jun over 8 years ago
hi,guangyang
how did you see what the thread is doing during timeout? by turn on debug_osd level?
do you used lttng to trace it?
Updated by Guang Yang over 8 years ago
Hi Huang Jun,
I have the issue reproduced with debug_osd = 10, what I observed was, the peering thread stuck at PG::MissingLoc::add_source_info with more than 150 seconds, but that should not happen if the ThreadPool::TPHandle* handle is passed properly, the only chance is, the handle passed in is NULL so that reset_tp_timeout never happened..
huang jun wrote:
hi,guangyang
how did you see what the thread is doing during timeout? by turn on debug_osd level?
do you used lttng to trace it?
Updated by huang jun over 8 years ago
Thanks, GuangYang.
We also encounter this kind of problems, it's osd_op_tp threads timeout, which result in osd suicide.
we test 4k random writes, from the lttng log, the thread last record is:
[22:09:54.901842738] (+0.000001067) lab5107 oprequest:mark_flag_point: { cpu_id = 11 }, { pthread_id = 140316928382720 }, { type = 8, num = 601924, tid = 3439120, inc = 1, rmw_flags = 0x4, flag = 0x10, msg = "waiting for subops from 1", old_hit_flag_points = 0xA, new_ hit_flag_points = 0x1A }
the normal log is:
[22:09:48.003170248] (+0.000020173) lab5107 oprequest:mark_flag_point: { cpu_id = 14 }, { pthread_id = 139644379625216 }, { type = 8,num = 601924, tid = 3601389, inc = 1, rmw_flags = 0x4, flag = 0x10, msg = "waiting for subops from 0", old_hit_flag_points = 0xE, new_ hit_flag_points = 0x1E }
[22:09:48.003270001] (+0.000099753) lab5107 osd:opwq_process_finish: { cpu_id = 14 }, { pthread_id = 139644379625216 }, { type = 8, num = 601924, tid = 3601389, inc = 1 }
it seems the thread is stucked after send op to subosd, and never be schduled.
so, what's your opinion of how to debug this?
Updated by Guang Yang over 8 years ago
That looks like a different issue. 'suicide timeout' is ceph's way to make sure the thread workers are responsive, and many many cases could lead to this crash.
Good news is that it is not too hard to debug:
You will need to enable core dump on the host so when it happens you have a core, since the log always have the thread id which caused suicide and then use gdb to switch to that thread, that mostly give you some clues.
Updated by Guang Yang over 8 years ago
Update: after applying this patch, we managed to bring back the cluster.
Although this does not happen frequently, but when it happens, it causes cascading effect, so that I would recommend this one as a backport candidate.
Updated by Kefu Chai over 8 years ago
- Status changed from New to Fix Under Review
- Backport set to hammer
Updated by Sage Weil over 8 years ago
- Priority changed from Normal to Urgent
- Source changed from other to Community (dev)
Updated by Samuel Just over 8 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Loïc Dachary over 8 years ago
- Status changed from Pending Backport to Resolved