Bug #12523: osd suicide timeout during peering - search for missing objects - Ceph - Ceph

Actions

Copy link

Bug #12523

closed

osd suicide timeout during peering - search for missing objects

Added by Guang Yang over 8 years ago. Updated over 8 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Guang Yang

Category:

Target version:

% Done:

Source:

Community (dev)

Tags:

Backport:

hammer

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Peering thread hit suicide timeout, and the logs show that the thread was doing (more than 150 seconds) PG::MissingLoc::add_source_info, which should be able to reset the timeout.

Looking at PG::RecoveryState::start_handle, if there is messages to flush (messages_pending_flush = true), it will create a new RecoveryCtx which lose the original thread handle, as a result, the above procedure will not reset the timeout and triggered the crash.

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by Guang Yang over 8 years ago

https://github.com/ceph/ceph/pull/5413

Actions

Copy link

Updated by huang jun over 8 years ago

hi,guangyang

how did you see what the thread is doing during timeout? by turn on debug_osd level?
do you used lttng to trace it?

Actions

Copy link

Updated by Guang Yang over 8 years ago

Hi Huang Jun,
I have the issue reproduced with debug_osd = 10, what I observed was, the peering thread stuck at PG::MissingLoc::add_source_info with more than 150 seconds, but that should not happen if the ThreadPool::TPHandle* handle is passed properly, the only chance is, the handle passed in is NULL so that reset_tp_timeout never happened..
huang jun wrote:

hi,guangyang

how did you see what the thread is doing during timeout? by turn on debug_osd level?
do you used lttng to trace it?

Actions

Copy link

Updated by huang jun over 8 years ago

Thanks, GuangYang.
We also encounter this kind of problems, it's osd_op_tp threads timeout, which result in osd suicide.
we test 4k random writes, from the lttng log, the thread last record is:
[22:09:54.901842738] (+0.000001067) lab5107 oprequest:mark_flag_point: { cpu_id = 11 }, { pthread_id = 140316928382720 }, { type = 8, num = 601924, tid = 3439120, inc = 1, rmw_flags = 0x4, flag = 0x10, msg = "waiting for subops from 1", old_hit_flag_points = 0xA, new_ hit_flag_points = 0x1A }

the normal log is:
[22:09:48.003170248] (+0.000020173) lab5107 oprequest:mark_flag_point: { cpu_id = 14 }, { pthread_id = 139644379625216 }, { type = 8,num = 601924, tid = 3601389, inc = 1, rmw_flags = 0x4, flag = 0x10, msg = "waiting for subops from 0", old_hit_flag_points = 0xE, new_ hit_flag_points = 0x1E }
[22:09:48.003270001] (+0.000099753) lab5107 osd:opwq_process_finish: { cpu_id = 14 }, { pthread_id = 139644379625216 }, { type = 8, num = 601924, tid = 3601389, inc = 1 }

it seems the thread is stucked after send op to subosd, and never be schduled.

so, what's your opinion of how to debug this?

Actions

Copy link

Updated by Guang Yang over 8 years ago

That looks like a different issue. 'suicide timeout' is ceph's way to make sure the thread workers are responsive, and many many cases could lead to this crash.

Good news is that it is not too hard to debug:

You will need to enable core dump on the host so when it happens you have a core, since the log always have the thread id which caused suicide and then use gdb to switch to that thread, that mostly give you some clues.

Actions

Copy link

Updated by Guang Yang over 8 years ago

Update: after applying this patch, we managed to bring back the cluster.

Although this does not happen frequently, but when it happens, it causes cascading effect, so that I would recommend this one as a backport candidate.

Actions

Copy link