Bug #40034: mds: stuck in clientreplay - CephFS - Ceph

Actions

Copy link

Bug #40034

open

mds: stuck in clientreplay

Added by Nathan Fish almost 5 years ago. Updated over 4 years ago.

Status:

Need More Info

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

nautilus

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v14.2.1

ceph-qa-suite:

Component(FS):

Ganesha FSAL, MDS

Labels (FS):

NFS-cluster

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

When I came in on Monday morning, our cluster's cephfs was stuck in clientreplay, and nfs mount through nfs-ganesha hung:

root@m3-3101-422:~# ceph status
cluster:
id: 7c2de3c5-2476-45e2-ac46-b4ca19eeacb5
health: HEALTH_WARN
1 filesystem is degraded
too few PGs per OSD (1 < min 30)

services:
    mon: 3 daemons, quorum dc-3558-422,mc-3015-422,m3-3101-422 (age 2d)
    mgr: m3-3101-422(active, since 2d), standbys: mc-3015-422, dc-3558-422
    mds: cephfs_cscf-home:1/1 {0=m3-3101-422-A=up:clientreplay} 2 up:standby
    osd: 57 osds: 57 up, 57 in

data:
    pools:   2 pools, 32 pgs
    objects: 23 objects, 198 KiB
    usage:   1.7 TiB used, 589 TiB / 591 TiB avail
    pgs:     32 active+clean

Ceph Nautilus 14.2.1, multi-fs (enable_multiple) enabled, but only one fs created so far.
The fs (cephfs_cscf-home) has "allow_standby_replay" = true. There was only one client, nfs-ganesha 2.7 compiled from deb-src against Nautilus. This client was idle over the weekend. I did this because I needed multi-fs support. Multi-MDS was not enabled. I injected debug flags to all MDS', then restarted nfs-ganesha. This cleared up the problem. Since I only enabled debugging after the hang, I unfortunately don't have logs from that part.

Log from the stuck MDS:
https://termbin.com/z2ad

The last parts of the log repeat until the present - it seems this one was standby-replay:
root@mc-3015-422:~# head -200 /var/log/ceph/ceph-mds.mc-3015-422-A.log | nc termbin.com 9999
https://termbin.com/1m6o

The 3rd mds remained in standby the whole time.

Actions

Copy link

Updated by Zheng Yan almost 5 years ago

2019-05-27 11:06:45.314 7ff2f868d700  0 log_channel(cluster) log [INF] : Evicting (and blacklisting) client session 116561 (10.1.154.220:0/663516622)
2019-05-27 11:06:45.314 7ff2f868d700  4 mds.0.19 Preparing blacklist command... (wait=0)
2019-05-27 11:06:45.314 7ff2f868d700  4 mds.0.19 Sending mon blacklist command: {"prefix":"osd blacklist", "blacklistop":"add","addr":"10.1.154.220:0/663516622"}
2019-05-27 11:06:45.314 7ff2f868d700  3 mds.0.server handle_client_session client_session(request_close) v1 from client.117235
2019-05-27 11:06:45.906 7ff2f868d700  4 mds.0.19 handle_osd_map epoch 369, 1 new blacklist entries
2019-05-27 11:06:45.910 7ff2f2681700  4 mds.0.19 set_osd_epoch_barrier: epoch=369
2019-05-27 11:06:45.910 7ff2f167f700  5 mds.0.log _submit_thread 4395375~123 : ESession client.116561 10.1.154.220:0/663516622 close cmapv 88 (1000 inos, v15)
2019-05-27 11:06:45.918 7ff2f2681700  1 mds.0.19 clientreplay_done
2019-05-27 11:06:45.918 7ff2f2681700  3 mds.0.19 request_state up:active
2019-05-27 11:06:45.918 7ff2f2681700  5 mds.beacon.m3-3101-422-A set_want_state: up:clientreplay -> up:active
2019-05-27 11:06:45.918 7ff2f2681700  5 mds.beacon.m3-3101-422-A Sending beacon up:active seq 127082

mds became active after killing the genesha client. This is a feature. genesha session needs to be reclaimed or eviected before mds become active.

Actions

Copy link

Updated by Patrick Donnelly almost 5 years ago

Target version deleted (~~v14.2.1~~)
Component(FS) Ganesha FSAL, MDS added

Logs from nfs-ganesha would be helpful too if you have them.

Actions

Copy link

Updated by Nathan Fish almost 5 years ago

Here's ganesha.log, not sure if there's anything useful:
https://termbin.com/7ni9

Is it really intended for an mds to hang indefinitely if a client hangs or misbehaves? Is it not safe to forcibly evict it and continue?

Actions

Copy link

Updated by Patrick Donnelly almost 5 years ago

Subject changed from MDS stuck in clientreplay to mds: stuck in clientreplay
Assignee set to Jeff Layton
Target version set to v15.0.0
Start date deleted (~~05/27/2019~~)
Source set to Community (user)
Backport set to nautilus
Labels (FS) NFS-cluster added

Actions

Copy link

Updated by Jeff Layton almost 5 years ago

Assignee changed from Jeff Layton to Patrick Donnelly

Actions

Copy link

Updated by Patrick Donnelly almost 5 years ago

Assignee deleted (~~Patrick Donnelly~~)

None of us see why the MDS was stuck in clientreplay. How long do you think it was in that state?

Actions

Copy link

Updated by Patrick Donnelly almost 5 years ago

Status changed from New to Need More Info

Actions

Copy link

Updated by Nathan Fish almost 5 years ago

Patrick Donnelly wrote:

None of us see why the MDS was stuck in clientreplay. How long do you think it was in that state?

I don't know. It was left idle Friday night and was stuck on Monday morning. It hasn't occurred since. Is it possibly related to the standby-replay feature? I'm willing to disable that if it could prevent this happening in prod.

Actions

Copy link

Updated by Patrick Donnelly almost 5 years ago

Nathan Fish wrote:

Patrick Donnelly wrote:

None of us see why the MDS was stuck in clientreplay. How long do you think it was in that state?

I don't know. It was left idle Friday night and was stuck on Monday morning. It hasn't occurred since. Is it possibly related to the standby-replay feature? I'm willing to disable that if it could prevent this happening in prod.

standby-replay is very unlikely to be the cause. Please let us know if it happens again.

Actions

Copy link

#10

Updated by Patrick Donnelly over 4 years ago

Target version deleted (~~v15.0.0~~)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #40034

mds: stuck in clientreplay

Updated by Zheng Yan almost 5 years ago

Updated by Patrick Donnelly almost 5 years ago

Updated by Nathan Fish almost 5 years ago

Updated by Patrick Donnelly almost 5 years ago

Updated by Jeff Layton almost 5 years ago

Updated by Patrick Donnelly almost 5 years ago

Updated by Patrick Donnelly almost 5 years ago

Updated by Nathan Fish almost 5 years ago

Updated by Patrick Donnelly almost 5 years ago

Updated by Patrick Donnelly over 4 years ago