Project

General

Profile

Actions

Bug #56577

open

mds: client request may complete without queueing next replay request

Added by Patrick Donnelly almost 2 years ago. Updated 6 months ago.

Status:
Pending Backport
Priority:
Normal
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Development
Tags:
backport_processed
Backport:
reef,quincy,pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We received a report of a situation of a cluster with a single active MDS stuck in up:clientreplay. The status was:

> ceph tell mds.ocs-storagecluster-cephfilesystem:0 status
> {
>     "cluster_fsid": "XXX",
>     "whoami": 0,
>     "id": 19987341,
>     "want_state": "up:clientreplay",
>     "state": "up:clientreplay",
>     "fs_name": "ocs-storagecluster-cephfilesystem",
>     "clientreplay_status": {
>         "clientreplay_queue": 125048,
>         "active_replay": 0
>     },
>     "rank_uptime": 191060.81145907301,
>     "mdsmap_epoch": 8735,
>     "osdmap_epoch": 4421,
>     "osdmap_epoch_barrier": 3296,
>     "uptime": 191061.807527136
> }

The MDS had no outstanding ops or objecter requests. An increase in debugging did not indicate any client request activity.

It's not clear how this could happen other than the MDS failed to call MDSRank::queue_one_replay during some error handling of a request. The most likely place for this I believe to be here:

https://github.com/ceph/ceph/blob/a6f1a1c6c09d74f5918c715b05789f34f2ea0e90/src/mds/Server.cc#L2253-L2262

If !(mdr->has_completed || reply->get_result() < 0) then the request is cleaned up without queuing the next request. I don't know a scenario in which that condition may be false in this code path.

I think for now a reasonable fix is to move this to MDCache::request_cleanup which is generally called on every client request during cleanup of any kind. We do need to maintain Server::journal_and_reply may queue the next op even if the current request is not yet safe.


Related issues 3 (1 open2 closed)

Copied to CephFS - Backport #63418: reef: mds: client request may complete without queueing next replay requestResolvedPatrick DonnellyActions
Copied to CephFS - Backport #63419: pacific: mds: client request may complete without queueing next replay requestResolvedPatrick DonnellyActions
Copied to CephFS - Backport #63420: quincy: mds: client request may complete without queueing next replay requestIn ProgressPatrick DonnellyActions
Actions #1

Updated by Patrick Donnelly almost 2 years ago

  • Pull request ID set to 47121
Actions #2

Updated by Patrick Donnelly 8 months ago

  • Target version deleted (v18.0.0)
Actions #3

Updated by Patrick Donnelly 7 months ago

  • Category set to Correctness/Safety
  • Status changed from In Progress to Fix Under Review
  • Target version set to v19.0.0
  • Backport changed from quincy,pacific to reef,quincy,pacific
Actions #4

Updated by Patrick Donnelly 6 months ago

  • Status changed from Fix Under Review to Pending Backport
Actions #5

Updated by Backport Bot 6 months ago

  • Copied to Backport #63418: reef: mds: client request may complete without queueing next replay request added
Actions #6

Updated by Backport Bot 6 months ago

  • Copied to Backport #63419: pacific: mds: client request may complete without queueing next replay request added
Actions #7

Updated by Backport Bot 6 months ago

  • Copied to Backport #63420: quincy: mds: client request may complete without queueing next replay request added
Actions #8

Updated by Backport Bot 6 months ago

  • Tags set to backport_processed
Actions

Also available in: Atom PDF