Project

General

Profile

Bug #21847

osd frequently been marked down and up

Added by dongdong tao over 6 years ago. Updated over 6 years ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

our ceph version is 10.2.5
we have encounter an issue that one of our osd has been marked down and up about 3 times one hour
but osd is actually alive and i can see the "wrongly marked me down" from the log of that osd.
i have enabled debug osd = 20. below it's log which i think would make ceph developers very interesting.
firstly we can see below abnormal messages:
2017-10-18 08:21:03.545028 7f632f262700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f6323849700' had timed out after 15
2017-10-18 08:21:03.545031 7f632f262700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f632404a700' had timed out after 15
2017-10-18 08:21:03.545046 7f632f262700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f632484b700' had timed out after 15
2017-10-18 08:21:03.545048 7f632f262700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f632604e700' had timed out after 15

then i followed one of the fourth timed out thread above :0x7f6323849700
here comes the log message that i can really not understand:
2017-10-18 08:20:47.734210 7f6323849700 10 osd.6 2027233 dequeue_op 0x561cdadd9200 prio 127 cost 4194304 latency 3.776214 osd_op(client.54725091.1:109812 78.29dc6487 10008728c6e.00000046 [write 0~4194304 [1@-1]] snapc 1=[] ondisk+write+known_if_redirected e2027233) v4 pg pg[78.87( v 2027233'11422036 (2021605'11418989,2027233'11422036] local-les=2027228 n=7704 ec=1983768 les/c/f 2027228/2027228/0 2027225/2027225/2027225) [6,9,7] r=0 lpr=2027225 crt=2027233'11422034 lcod 2027233'11422035 mlcod 2027233'11422035 active+clean]

this message shows that the osd_tp thread "7f6323849700" had taken one OP from the queue. next step it's going to handle that OP request message
but it's take about 20 seconds that this thread is starting to handle this message. please see next two log below:

2017-10-18 08:21:07.079717 7f6323849700 10 osd.6 pg_epoch: 2027233 pg[78.87( v 2027233'11422036 (2021605'11418989,2027233'11422036] local-les=2027228 n=7704 ec=1983768 les/c/f 2027228/2027228/0 2027225/2027225/2027225) [6,9,7] r=0 lpr=2027225 crt=2027233'11422034 lcod 2027233'11422035 mlcod 2027233'11422035 active+clean] handle_message: 0x561cdadd9200

2017-10-18 08:21:07.079845 7f6323849700 20 osd.6 pg_epoch: 2027233 pg[78.87( v 2027233'11422036 (2021605'11418989,2027233'11422036] local-les=2027228 n=7704 ec=1983768 les/c/f 2027228/2027228/0 2027225/2027225/2027225) [6,9,7] r=0 lpr=2027225 crt=2027233'11422034 lcod 2027233'11422035 mlcod 2027233'11422035 active+clean] op_has_sufficient_caps pool=78 (cephfs_data ) owner=0 need_read_cap=0 need_write_cap=1 need_class_read_cap=0 need_class_write_cap=0 -> yes

I really can not find any code would possibly cause that much time between "dequeue_op" and "handle_message"
BTW, all the timed out thread have the same issue.
any input would be much appreciated, and i would be very happy to give you the whole log.

History

#1 Updated by Sage Weil over 6 years ago

  • Status changed from New to Need More Info

Is there anything in 'dmesg' output? Maybe a bad disk?

#2 Updated by dongdong tao over 6 years ago

from the dmesg, i found lots of "libceph: osd.6 down"
disk are all good. if you have never saw such kind weird log,
then it's should not be an osd issue ?

#3 Updated by dongdong tao over 6 years ago

Hi sage , this same issue happened again, is there any any possible reason for this issue ?
how should i go forward ?

Also available in: Atom PDF