Project

General

Profile

Bug #56577

mds: client request may complete without queueing next replay request

Added by Patrick Donnelly 5 months ago. Updated 5 months ago.

Status:
In Progress
Priority:
Normal
Category:
-
Target version:
% Done:

0%

Source:
Development
Tags:
Backport:
quincy,pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We received a report of a situation of a cluster with a single active MDS stuck in up:clientreplay. The status was:

> ceph tell mds.ocs-storagecluster-cephfilesystem:0 status
> {
>     "cluster_fsid": "XXX",
>     "whoami": 0,
>     "id": 19987341,
>     "want_state": "up:clientreplay",
>     "state": "up:clientreplay",
>     "fs_name": "ocs-storagecluster-cephfilesystem",
>     "clientreplay_status": {
>         "clientreplay_queue": 125048,
>         "active_replay": 0
>     },
>     "rank_uptime": 191060.81145907301,
>     "mdsmap_epoch": 8735,
>     "osdmap_epoch": 4421,
>     "osdmap_epoch_barrier": 3296,
>     "uptime": 191061.807527136
> }

The MDS had no outstanding ops or objecter requests. An increase in debugging did not indicate any client request activity.

It's not clear how this could happen other than the MDS failed to call MDSRank::queue_one_replay during some error handling of a request. The most likely place for this I believe to be here:

https://github.com/ceph/ceph/blob/a6f1a1c6c09d74f5918c715b05789f34f2ea0e90/src/mds/Server.cc#L2253-L2262

If !(mdr->has_completed || reply->get_result() < 0) then the request is cleaned up without queuing the next request. I don't know a scenario in which that condition may be false in this code path.

I think for now a reasonable fix is to move this to MDCache::request_cleanup which is generally called on every client request during cleanup of any kind. We do need to maintain Server::journal_and_reply may queue the next op even if the current request is not yet safe.

History

#1 Updated by Patrick Donnelly 5 months ago

  • Pull request ID set to 47121

Also available in: Atom PDF