Project

General

Profile

Bug #23205

Blocked requests no longer show details

Added by Alex Gorbachev over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
03/02/2018
Due date:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
mimic
Regression:
No
Severity:
3 - minor
Reviewed:
ceph-qa-suite:
Pull request ID:

Description

In Luminous, blocked requests no longer show details, i.e. OSD id or hints as to the operation:

018-03-01 21:52:54.237270 7f7e419e3700 0 log_channel(cluster) log
[WRN] : Health check failed: 116 slow requests are blocked > 32 sec
(REQUEST_SLOW)
2018-03-01 21:53:00.282721 7f7e419e3700 0 log_channel(cluster) log
[WRN] : Health check update: 66 slow requests are blocked > 32 sec
(REQUEST_SLOW)
2018-03-01 21:53:08.534244 7f7e419e3700 0 log_channel(cluster) log
[WRN] : Health check update: 5 slow requests are blocked > 32 sec
(REQUEST_SLOW)
2018-03-01 21:53:10.382510 7f7e419e3700 0 log_channel(cluster) log
[INF] : Health check cleared: REQUEST_SLOW (was: 5 slow requests are
blocked > 32 sec)
2018-03-01 21:53:10.382546 7f7e419e3700 0 log_channel(cluster) log
[INF] : Cluster is now healthy


Related issues

Related to RADOS - Feature #23236: should allow osd to dump slow ops Rejected 03/06/2018

History

#1 Updated by Alex Gorbachev over 1 year ago

Further info: these occur when journaling, object-map and exclusive-lock are set on the rbd image

#2 Updated by Alex Gorbachev over 1 year ago

Further note: the slowdowns and blocked requests start occurring after about 2TB of I/O on the image. Specifically, we are doing VMotions (ESXi) over NFS-delivered images. Anything under 2TB does not generate blocked requests. Anything over 2TB does. Hope this leads to a better path identifying the issues.

#3 Updated by John Spray over 1 year ago

The details are there, but they're no longer dumped to the cluster log by default. The reason is that if you have 1000 OSDs and many of them have multiple slow requests, it results in a very large number of repetitive messages to the log.

You can go to a server running osds and do "ceph daemon osd.<id> dump_blocked_ops" to see what's stuck on a particular OSD.

#4 Updated by Greg Farnum over 1 year ago

  • Project changed from Ceph to mgr

#5 Updated by Alex Gorbachev over 1 year ago

John, thank you for answering. Was it not the reason for "last X slow requests" in case this helps at least lead in the right direction troubleshooting? With 1000s of OSDs is the troubleshooter expected to query all of them on all servers?

#6 Updated by John Spray over 1 year ago

The "last X slow requests" was emitted separately by each OSD, so even though it was a summary, you'd still get a flood of cluster log messages from a cluster with lots of OSDs.

If you do have 1000s of OSDs with slow requests, and want to see a list of 10000s of requests, then you do need to go through and query them yourself, yes. However, in practice I suspect that most people are either seeing slow requests on particular OSDs (in which case they investigate those OSDs in detail), or widespread slow requests (in which case the details are probably less important than whatever system-wide issue is going on).

None of which means that we can't add in some extra tooling in the future, it just won't look like sending all the detail to the cluster log like we used to.

#7 Updated by Alex Gorbachev over 1 year ago

Got it, thanks John. I think you are exactly right that most people either see problems coming from one or a couple OSDs or are having a network issue and slow requests are all over the place.

We in the past had problems with a) XFS bug that crashed filesystems on OSDs, b) Hardware problems on OSDs themselves, or c) network issues where it was clear the problem was not specific to an OSD.

So the challenge it seems is: when blocked/slow requests show up, what is the best tool to quickly identify the number and ID of the OSDs involved. Maybe even the OSD host where the problems are occurring, so there is not a wild goose chase over all of them.

I see Brad is taking a look at this and realy appreciate all your work.

#8 Updated by Kefu Chai over 1 year ago

#9 Updated by Brad Hubbard over 1 year ago

https://tracker.ceph.com/issues/23236 was rejected.

"ceph health" should list the osds with slow ops and dump_blocked_ops and/or dump_ops_in_flight run against those osds should give details.

#10 Updated by Alex Gorbachev over 1 year ago

the one big advantage of having the few slowest OSDs listed in ceph.log or MON log was the ability to go back to troubleshoot a previous slow condition, which is no longer present at the time of troubleshooting.

#11 Updated by Alex Gorbachev over 1 year ago

in other words, what is the best way to troubleshoot the following situation:

2018-03-11 22:00:00.000132 mon.roc-vm-sc3c234 [INF] overall HEALTH_OK
2018-03-11 22:44:46.173825 mon.roc-vm-sc3c234 [WRN] Health check failed: 12 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-03-11 22:44:52.245738 mon.roc-vm-sc3c234 [WRN] Health check update: 9 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-03-11 22:44:57.925686 mon.roc-vm-sc3c234 [WRN] Health check update: 10 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-03-11 22:45:02.926031 mon.roc-vm-sc3c234 [WRN] Health check update: 14 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-03-11 22:45:06.413741 mon.roc-vm-sc3c234 [INF] Health check cleared: REQUEST_SLOW (was: 14 slow requests are blocked > 32 sec)
2018-03-11 22:45:06.413814 mon.roc-vm-sc3c234 [INF] Cluster is now healthy
2018-03-11 23:00:00.000136 mon.roc-vm-sc3c234 [INF] overall HEALTH_OK

What happened at 22:44 and where to look? Previously, there was a reference to OSD IDs, which would lead to some hints at the source of the problem. Thanks.

#12 Updated by Alex Gorbachev over 1 year ago

We continue to get these with no clues as to where to go next:

2018-03-25 06:58:39.693404 7fa0e9ee2700 0 log_channel(cluster) log [WRN] : Health check failed: 1 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-03-25 06:58:45.035274 7fa0e9ee2700 0 log_channel(cluster) log [WRN] : Health check update: 126 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-03-25 06:58:50.035626 7fa0e9ee2700 0 log_channel(cluster) log [WRN] : Health check update: 49 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-03-25 06:58:55.035968 7fa0e9ee2700 0 log_channel(cluster) log [WRN] : Health check update: 15 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-03-25 06:58:56.807108 7fa0e76dd700 0 mon.roc-vm-sc3c234@0(leader) e1 handle_command mon_command({"prefix": "df", "format": "json"} v 0) v1
2018-03-25 06:58:56.807167 7fa0e76dd700 0 log_channel(audit) log [DBG] : from='client.? 10.80.3.234:0/2414473661' entity='client.admin' cmd=[{"prefix": "df", "format": "json"}]: dispatch
2018-03-25 06:58:57.071196 7fa0e76dd700 0 mon.roc-vm-sc3c234@0(leader) e1 handle_command mon_command({"prefix": "osd dump", "format": "json"} v 0) v1
2018-03-25 06:58:57.071253 7fa0e76dd700 0 log_channel(audit) log [DBG] : from='client.? 10.80.3.234:0/2147735519' entity='client.admin' cmd=[{"prefix": "osd dump", "format": "json"}]: dispatch
2018-03-25 06:58:57.926624 7fa0e9ee2700 0 log_channel(cluster) log [INF] : Health check cleared: REQUEST_SLOW (was: 5 slow requests are blocked > 32 sec)
2018-03-25 06:58:57.926668 7fa0e9ee2700 0 log_channel(cluster) log [INF] : Cluster is now healthy
/blocked

#13 Updated by Tobias Rehn over 1 year ago

I have the same issue. We are constantly having some slow requests every day and there is no way to debug. The old way for jewel does not work anymore for luminous.

How can we debug slow requests?

#14 Updated by Sage Weil over 1 year ago

  • Priority changed from Normal to High

Bumping priority on this; I think it's something we need to fix. The health warning detail would have this information, but you can only see the details on currently active health events, not past events.

#15 Updated by Josh Durgin over 1 year ago

  • Assignee set to Brad Hubbard

Brad is working on this for luminous - the osds on which slow ops appear are present again already in master.

#16 Updated by Brad Hubbard over 1 year ago

https://github.com/ceph/ceph/pull/21750 is the fix for master.

I am trying to chase down a race in the fix for luminous and hope to get that posted very soon.

#17 Updated by Brad Hubbard over 1 year ago

  • Status changed from New to In Progress

#18 Updated by John Spray over 1 year ago

  • Status changed from In Progress to Pending Backport
  • Backport set to mimic

Since we branched mimic and the PR was to master, we probably need a backport of https://github.com/ceph/ceph/pull/21750

#19 Updated by Yuri Weinstein over 1 year ago

Brad Hubbard wrote:

Luminous https://github.com/ceph/ceph/pull/21760

merged

#20 Updated by Nathan Cutler over 1 year ago

John Spray wrote:

Since we branched mimic and the PR was to master, we probably need a backport of https://github.com/ceph/ceph/pull/21750

It's already been cherry-picked to mimic, but I couldn't find a corresponding PR:

smithfarm@wilbur:~/src/ceph/smithfarm/ceph> git checkout mimic
Switched to branch 'mimic'
Your branch is up-to-date with 'ceph/mimic'.
smithfarm@wilbur:~/src/ceph/smithfarm/ceph> git log --grep b5263176
commit 303e71fe074bb051421119d44f9559ceb929efe3
Author: Brad Hubbard <bhubbard@redhat.com>
Date:   Tue May 1 11:15:28 2018 +1000

    mgr: Include daemon details in SLOW_OPS output

    Currently there is no way to see which daemons were involved in a slow
    op after the op has cleared. This change allows us to record which
    daemons were implicated in the logs.

    Partially fixes: http://tracker.ceph.com/issues/23205

    Signed-off-by: Brad Hubbard <bhubbard@redhat.com>
    (cherry picked from commit b5263176dea6d204c2caa52043cefc94dd5e475f)

#21 Updated by Nathan Cutler over 1 year ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF