Bug #56577
openmds: client request may complete without queueing next replay request
0%
Description
We received a report of a situation of a cluster with a single active MDS stuck in up:clientreplay. The status was:
> ceph tell mds.ocs-storagecluster-cephfilesystem:0 status > { > "cluster_fsid": "XXX", > "whoami": 0, > "id": 19987341, > "want_state": "up:clientreplay", > "state": "up:clientreplay", > "fs_name": "ocs-storagecluster-cephfilesystem", > "clientreplay_status": { > "clientreplay_queue": 125048, > "active_replay": 0 > }, > "rank_uptime": 191060.81145907301, > "mdsmap_epoch": 8735, > "osdmap_epoch": 4421, > "osdmap_epoch_barrier": 3296, > "uptime": 191061.807527136 > }
The MDS had no outstanding ops or objecter requests. An increase in debugging did not indicate any client request activity.
It's not clear how this could happen other than the MDS failed to call MDSRank::queue_one_replay during some error handling of a request. The most likely place for this I believe to be here:
If !(mdr->has_completed || reply->get_result() < 0) then the request is cleaned up without queuing the next request. I don't know a scenario in which that condition may be false in this code path.
I think for now a reasonable fix is to move this to MDCache::request_cleanup which is generally called on every client request during cleanup of any kind. We do need to maintain Server::journal_and_reply may queue the next op even if the current request is not yet safe.
Updated by Patrick Donnelly 7 months ago
- Category set to Correctness/Safety
- Status changed from In Progress to Fix Under Review
- Target version set to v19.0.0
- Backport changed from quincy,pacific to reef,quincy,pacific
Updated by Patrick Donnelly 6 months ago
- Status changed from Fix Under Review to Pending Backport
Updated by Backport Bot 6 months ago
- Copied to Backport #63418: reef: mds: client request may complete without queueing next replay request added
Updated by Backport Bot 6 months ago
- Copied to Backport #63419: pacific: mds: client request may complete without queueing next replay request added
Updated by Backport Bot 6 months ago
- Copied to Backport #63420: quincy: mds: client request may complete without queueing next replay request added