Project

General

Profile

Actions

Bug #14523

closed

osd slowrequest after tcp connection fault

Added by zcc icy about 8 years ago. Updated about 8 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

when we find the following log appears, osd will have slow request after 30 seconds

2016-01-27 13:33:12.321607 7fedb5c66700 0 -- 172.18.129.60:6801/14172 >> 172.18.129.121:6801/49670 pipe(0x7fee07d6f000 sd=158 :6801 s=0 pgs=0 cs=0 l=0 c=0x7fee033418c0).accept connect_seq 87 vs existing 87 state standby
2016-01-27 13:33:12.321918 7fedb5c66700 0 -- 172.18.129.60:6801/14172 >> 172.18.129.121:6801/49670 pipe(0x7fee07d6f000 sd=158 :6801 s=0 pgs=0 cs=0 l=0 c=0x7fee033418c0).accept connect_seq 88 vs existing 87 state standby


Files

1$$NI0U~{)BHT4R__JZJZHD.png (428 KB) 1$$NI0U~{)BHT4R__JZJZHD.png connection exception happened , then osd have slow request zcc icy, 01/27/2016 03:02 PM
VBL@JO7PAW0UA`RMKI8PRWQ.png (32.1 KB) VBL@JO7PAW0UA`RMKI8PRWQ.png find that block 3 osd is inconsistent zcc icy, 01/27/2016 03:02 PM
Actions #1

Updated by zcc icy about 8 years ago

172.18.129.108 Log (osd 35) :
2016-01-27 13:26:00.172881 7f064aacf700 0 -- 172.18.129.108:6804/141884 >> 172.18.129.112:6804/70932 pipe(0x7f0697abe000 sd=379 :6804 s=0 pgs=0 cs=0 l=0 c=0x7f06a0c8b600).accept connect_seq 43 vs existing 43 state standby
2016-01-27 13:26:00.173198 7f064aacf700 0 -- 172.18.129.108:6804/141884 >> 172.18.129.112:6804/70932 pipe(0x7f0697abe000 sd=379 :6804 s=0 pgs=0 cs=0 l=0 c=0x7f06a0c8b600).accept connect_seq 44 vs existing 43 state standby
2016-01-27 13:27:44.533344 7f064496e700 0 -- 172.18.129.108:6804/141884 >> 172.18.129.60:6801/14172 pipe(0x7f06958f1000 sd=182 :6804 s=2 pgs=856 cs=93 l=0 c=0x7f0695750100).fault with nothing to send, going to standby
2016-01-27 13:27:54.258219 7f064f91d700 0 -- 172.18.129.108:6804/141884 >> 172.18.129.113:6804/68627 pipe(0x7f0695411000 sd=101 :26626 s=2 pgs=957 cs=27 l=0 c=0x7f0695340940).fault with nothing to send, going to standby
2016-01-27 13:28:03.837085 7f064d5fa700 0 -- 172.18.129.108:6804/141884 >> 172.18.129.60:6801/14172 pipe(0x7f06a33f8000 sd=160 :6804 s=0 pgs=0 cs=0 l=0 c=0x7f06a0c8a160).accept connect_seq 93 vs existing 93 state standby
2016-01-27 13:28:03.837490 7f064d5fa700 0 -- 172.18.129.108:6804/141884 >> 172.18.129.60:6801/14172 pipe(0x7f06a33f8000 sd=160 :6804 s=0 pgs=0 cs=0 l=0 c=0x7f06a0c8a160).accept connect_seq 94 vs existing 93 state standby
2016-01-27 13:28:03.932362 7f0646489700 0 -- 172.18.129.108:6804/141884 >> 172.18.129.71:6801/15373 pipe(0x7f06958f1000 sd=120 :6804 s=0 pgs=0 cs=0 l=0 c=0x7f06a0c8adc0).accept connect_seq 71 vs existing 71 state standby
2016-01-27 13:28:03.933072 7f0646489700 0 -- 172.18.129.108:6804/141884 >> 172.18.129.71:6801/15373 pipe(0x7f06958f1000 sd=120 :6804 s=0 pgs=0 cs=0 l=0 c=0x7f06a0c8adc0).accept connect_seq 72 vs existing 71 state standby
2016-01-27 13:33:14.109002 7f064d7fc700 0 -- 172.18.129.108:6804/141884 >> 172.18.129.22:6802/8151 pipe(0x7f06a08a0000 sd=255 :6804 s=0 pgs=0 cs=0 l=0 c=0x7f06a0c8c680).accept connect_seq 49 vs existing 49 state standby
2016-01-27 13:33:14.109582 7f064d7fc700 0 -- 172.18.129.108:6804/141884 >> 172.18.129.22:6802/8151 pipe(0x7f06a08a0000 sd=255 :6804 s=0 pgs=0 cs=0 l=0 c=0x7f06a0c8c680).accept connect_seq 50 vs existing 49 state standby
2016-01-27 13:36:14.442609 7f063c7ed700 0 -- 172.18.129.108:6804/141884 >> 172.18.129.106:6804/15522 pipe(0x7f06a3528000 sd=227 :6804 s=2 pgs=661 cs=67 l=0 c=0x7f069533e6e0).fault with nothing to send, going to standby

172.18.129.121 Log (osd 3):
2016-01-27 13:33:42.529495 7fdeab3d2700 0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 30.232265 secs
2016-01-27 13:33:42.529508 7fdeab3d2700 0 log_channel(cluster) log [WRN] : slow request 30.232265 seconds old, received at 2016-01-27 13:33:12.297189: osd_op(client.6602.0:157984 rbd_data.17285739f6be.000000000000013d [writefull 0~4194304] 6.155c0963 ack+ondisk+write+known_if_redirected e263) currently waiting for subops from 35,47
2016-01-27 13:34:12.533428 7fdeab3d2700 0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 60.236198 secs
2016-01-27 13:34:12.533437 7fdeab3d2700 0 log_channel(cluster) log [WRN] : slow request 60.236198 seconds old, received at 2016-01-27 13:33:12.297189: osd_op(client.6602.0:157984 rbd_data.17285739f6be.000000000000013d [writefull 0~4194304] 6.155c0963 ack+ondisk+write+known_if_redirected e263) currently waiting for subops from 35,47
2016-01-27 13:35:12.541231 7fdeab3d2700 0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 120.244001 secs
2016-01-27 13:35:12.541242 7fdeab3d2700 0 log_channel(cluster) log [WRN] : slow request 120.244001 seconds old, received at 2016-01-27 13:33:12.297189: osd_op(client.6602.0:157984 rbd_data.17285739f6be.000000000000013d [writefull 0~4194304] 6.155c0963 ack+ondisk+write+known_if_redirected e263) currently waiting for subops from 35,47
2016-01-27 13:37:12.556494 7fdeab3d2700 0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 240.259266 secs
2016-01-27 13:37:12.556503 7fdeab3d2700 0 log_channel(cluster) log [WRN] : slow request 240.259266 seconds old, received at 2016-01-27 13:33:12.297189: osd_op(client.6602.0:157984 rbd_data.17285739f6be.000000000000013d [writefull 0~4194304] 6.155c0963 ack+ondisk+write+known_if_redirected e263) currently waiting for subops from 35,47

172.18.129.60 Log (osd 47) :
2016-01-27 13:33:12.321607 7fedb5c66700 0 -- 172.18.129.60:6801/14172 >> 172.18.129.121:6801/49670 pipe(0x7fee07d6f000 sd=158 :6801 s=0 pgs=0 cs=0 l=0 c=0x7fee033418c0).accept connect_seq 87 vs existing 87 state standby
2016-01-27 13:33:12.321918 7fedb5c66700 0 -- 172.18.129.60:6801/14172 >> 172.18.129.121:6801/49670 pipe(0x7fee07d6f000 sd=158 :6801 s=0 pgs=0 cs=0 l=0 c=0x7fee033418c0).accept connect_seq 88 vs existing 87 state standby
2016-01-27 13:37:56.928762 7fedb525c700 0 -- 172.18.129.60:6801/14172 >> 172.18.129.129:6804/146489 pipe(0x7fee0b49f000 sd=270 :6801 s=2 pgs=1048 cs=97 l=0 c=0x7fedfe9a7860).fault with nothing to send, going to standby
2016-01-27 13:38:28.420259 7fedb8a94700 0 -- 172.18.129.60:6801/14172 >> 172.18.129.63:6802/45250 pipe(0x7fedfe67d000 sd=53 :58933 s=2 pgs=909 cs=25 l=0 c=0x7fedfe698580).fault with nothing to send, going to standby
2016-01-27 13:39:08.570648 7feda404b700 0 -- 172.18.129.60:6801/14172 >> 172.18.129.113:6804/68627 pipe(0x7fee048b7000 sd=232 :6801 s=2 pgs=941 cs=51 l=0 c=0x7fedfe9a5e40).fault with nothing to send, going to standby

Actions #2

Updated by zcc icy about 8 years ago

the detail log is above

Updated by zcc icy about 8 years ago

after
172.18.129.60:6801/14172 >> 172.18.129.121:6801/49670 pipe(0x7fee07d6f000 sd=158 :6801 s=0 pgs=0 cs=0 l=0 c=0x7fee033418c0).accept connect_seq 87 vs existing 87 state standby
this log appear,
the osd will slowrequest

then after 900s ,the slowrequest will disappear,then we find that time(900s) is affacted with ’mon osd report timeout‘

Actions #4

Updated by Samuel Just about 8 years ago

  • Status changed from New to Can't reproduce

It sounds like you had a network hiccup that lasted 900s between those osds. I think the mon_osd_report_timeout part is a coincidence. Feel free to reopen if you can reproduce with

debug osd = 20
debug ms = 20
debug filestore = 20

on both osds.

Actions

Also available in: Atom PDF