Project

General

Profile

Actions

Bug #40034

open

mds: stuck in clientreplay

Added by Nathan Fish almost 5 years ago. Updated over 4 years ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
fs
Component(FS):
Ganesha FSAL, MDS
Labels (FS):
NFS-cluster
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

When I came in on Monday morning, our cluster's cephfs was stuck in clientreplay, and nfs mount through nfs-ganesha hung:

root@m3-3101-422:~# ceph status
cluster:
id: 7c2de3c5-2476-45e2-ac46-b4ca19eeacb5
health: HEALTH_WARN
1 filesystem is degraded
too few PGs per OSD (1 < min 30)

services:
mon: 3 daemons, quorum dc-3558-422,mc-3015-422,m3-3101-422 (age 2d)
mgr: m3-3101-422(active, since 2d), standbys: mc-3015-422, dc-3558-422
mds: cephfs_cscf-home:1/1 {0=m3-3101-422-A=up:clientreplay} 2 up:standby
osd: 57 osds: 57 up, 57 in
data:
pools: 2 pools, 32 pgs
objects: 23 objects, 198 KiB
usage: 1.7 TiB used, 589 TiB / 591 TiB avail
pgs: 32 active+clean

Ceph Nautilus 14.2.1, multi-fs (enable_multiple) enabled, but only one fs created so far.
The fs (cephfs_cscf-home) has "allow_standby_replay" = true. There was only one client, nfs-ganesha 2.7 compiled from deb-src against Nautilus. This client was idle over the weekend. I did this because I needed multi-fs support. Multi-MDS was not enabled. I injected debug flags to all MDS', then restarted nfs-ganesha. This cleared up the problem. Since I only enabled debugging after the hang, I unfortunately don't have logs from that part.

Log from the stuck MDS:
https://termbin.com/z2ad

The last parts of the log repeat until the present - it seems this one was standby-replay:
root@mc-3015-422:~# head -200 /var/log/ceph/ceph-mds.mc-3015-422-A.log | nc termbin.com 9999
https://termbin.com/1m6o

The 3rd mds remained in standby the whole time.

Actions

Also available in: Atom PDF