Project

General

Profile

Bug #12523

osd suicide timeout during peering - search for missing objects

Added by Guang Yang over 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
Start date:
07/29/2015
Due date:
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
hammer
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

Peering thread hit suicide timeout, and the logs show that the thread was doing (more than 150 seconds) PG::MissingLoc::add_source_info, which should be able to reset the timeout.

Looking at PG::RecoveryState::start_handle, if there is messages to flush (messages_pending_flush = true), it will create a new RecoveryCtx which lose the original thread handle, as a result, the above procedure will not reset the timeout and triggered the crash.


Related issues

Related to Ceph - Bug #9128: Newly-restarted OSD may suicide itself after hitting suicide time out value because it may need to search huge amount of objects Resolved 08/15/2014
Copied to Ceph - Backport #12844: osd suicide timeout during peering - search for missing objects Resolved 07/29/2015

History

#2 Updated by huang jun over 3 years ago

hi,guangyang

how did you see what the thread is doing during timeout? by turn on debug_osd level?
do you used lttng to trace it?

#3 Updated by Guang Yang over 3 years ago

Hi Huang Jun,
I have the issue reproduced with debug_osd = 10, what I observed was, the peering thread stuck at PG::MissingLoc::add_source_info with more than 150 seconds, but that should not happen if the ThreadPool::TPHandle* handle is passed properly, the only chance is, the handle passed in is NULL so that reset_tp_timeout never happened..
huang jun wrote:

hi,guangyang

how did you see what the thread is doing during timeout? by turn on debug_osd level?
do you used lttng to trace it?

#4 Updated by huang jun over 3 years ago

Thanks, GuangYang.
We also encounter this kind of problems, it's osd_op_tp threads timeout, which result in osd suicide.
we test 4k random writes, from the lttng log, the thread last record is:
[22:09:54.901842738] (+0.000001067) lab5107 oprequest:mark_flag_point: { cpu_id = 11 }, { pthread_id = 140316928382720 }, { type = 8, num = 601924, tid = 3439120, inc = 1, rmw_flags = 0x4, flag = 0x10, msg = "waiting for subops from 1", old_hit_flag_points = 0xA, new_ hit_flag_points = 0x1A }

the normal log is:
[22:09:48.003170248] (+0.000020173) lab5107 oprequest:mark_flag_point: { cpu_id = 14 }, { pthread_id = 139644379625216 }, { type = 8,num = 601924, tid = 3601389, inc = 1, rmw_flags = 0x4, flag = 0x10, msg = "waiting for subops from 0", old_hit_flag_points = 0xE, new_ hit_flag_points = 0x1E }
[22:09:48.003270001] (+0.000099753) lab5107 osd:opwq_process_finish: { cpu_id = 14 }, { pthread_id = 139644379625216 }, { type = 8, num = 601924, tid = 3601389, inc = 1 }

it seems the thread is stucked after send op to subosd, and never be schduled.

so, what's your opinion of how to debug this?

#5 Updated by Guang Yang over 3 years ago

That looks like a different issue. 'suicide timeout' is ceph's way to make sure the thread workers are responsive, and many many cases could lead to this crash.

Good news is that it is not too hard to debug:

You will need to enable core dump on the host so when it happens you have a core, since the log always have the thread id which caused suicide and then use gdb to switch to that thread, that mostly give you some clues.

#6 Updated by Guang Yang over 3 years ago

Update: after applying this patch, we managed to bring back the cluster.

Although this does not happen frequently, but when it happens, it causes cascading effect, so that I would recommend this one as a backport candidate.

#7 Updated by huang jun over 3 years ago

thank you, i will try it.

#8 Updated by Kefu Chai over 3 years ago

  • Status changed from New to Need Review
  • Backport set to hammer

#9 Updated by Sage Weil over 3 years ago

  • Priority changed from Normal to Urgent
  • Source changed from other to Community (dev)

#10 Updated by Samuel Just over 3 years ago

  • Status changed from Need Review to Pending Backport

#11 Updated by Loic Dachary about 3 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF