Project

General

Profile

Actions

Bug #4348

closed

OSD slow request leads to RBD clients stalled/delayed

Added by Ivan Kudryavtsev about 11 years ago. Updated about 11 years ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi. I have next problem and I believe it can be reproduced. One of my OSDs is full with messages like:

2013-03-05 06:55:07.146109 7fd3a7abf700 0 log [WRN] : slow request 6335.099822 seconds old, received at 2013-03-05 05:09:32.046218: osd_op(client.11689.1:97589574 rb.0.2cd1.238e1f29.000000000c7e [write 1454080~4096] 2.28656d03) currently reached pg
2013-03-05 06:55:07.146111 7fd3a7abf700 0 log [WRN] : slow request 6335.099794 seconds old, received at 2013-03-05 05:09:32.046246: osd_op(client.11689.1:97589575 rb.0.2cd1.238e1f29.000000000c7e [write 1507328~4096] 2.28656d03) currently reached pg
2013-03-05 06:55:07.146113 7fd3a7abf700 0 log [WRN] : slow request 6335.099767 seconds old, received at 2013-03-05 05:09:32.046273: osd_op(client.11689.1:97589576 rb.0.2cd1.238e1f29.000000000c7e [write 1609728~4096] 2.28656d03) currently reached pg
2013-03-05 06:55:07.146114 7fd3a7abf700 0 log [WRN] : slow request 6335.099068 seconds old, received at 2013-03-05 05:09:32.046972: osd_op(client.11689.1:97589577 rb.0.2cd1.238e1f29.000000000c7e [write 1675264~4096] 2.28656d03) currently reached pg

however it's on the same host with several other osds and has the same configuration as they have, but no more osd have such a bug log, so I suppose it's buggy and it's an OSD bug.

I use kernel RBD client

So, client has the same line of messages like:
[643232.462554] libceph: tid 97589149 timed out on osd22, will reset osd

and actually, all VMs become stalled (which I suppose tried to write to that OSD), because the host unable to kick the OSD off and tries to work with it. After OSD restart the problem is over. And moreover, rbd map/unmap are unable to do the job correctly for stalled /dev/rbd* devices.

But, indeed. It's unable to find that OSD is buggy if it's not, but it's buggy indeed when it gets a lot of such messages.

What I see is to create some pipe log analyzer and restart osd when I will detect the bug, until the bug will be closed.

Actions

Also available in: Atom PDF