Project

General

Profile

Actions

Bug #23883

open

kclient: CephFS kernel client hang

Added by wei jin about 6 years ago. Updated about 5 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
kceph
Labels (FS):
multimds
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

ceph: 12.2.4/12.2.5
os: debian jessie
kernel: 4.9/4.4

After restart all mds(6 in total, 5 active, 1 standby), client will hang at 'stat' operation according to strace. (test command: df, lsof...). It is very easy to trigger this issue if we do 'ls -lR' with a huge directory which includes million files before we restart all mds.

dmesg shows that connection was denied and I did not run into this issue with Jewel. This is very critical because many commands will hang on the client machine, and seems reboot is the only way to rescue.

I am not sure whether this patch(https://github.com/ceph/ceph/pull/21592) will fix this issue too.
or
It is related to authentication bug #110131

[164426.867436] libceph: mds4 10.20.52.159:6800 socket closed (con state OPEN)
[164427.871075] libceph: mds0 10.20.52.160:6800 socket closed (con state OPEN)
[164431.274627] libceph: mds0 10.20.52.160:6800 socket closed (con state CONNECTING)
[164432.042622] libceph: mds0 10.20.52.160:6800 socket closed (con state CONNECTING)
[164433.034613] libceph: mds0 10.20.52.160:6800 socket closed (con state CONNECTING)
[164435.050590] libceph: mds0 10.20.52.160:6800 socket closed (con state CONNECTING)
[164439.213001] libceph: wrong peer, want 10.20.52.160:6800/-713929449, got 10.20.52.160:6800/1282156498
[164439.213003] libceph: mds0 10.20.52.160:6800 wrong peer at address
[164479.144207] ceph: mds0 caps stale
[164479.144209] ceph: mds4 caps stale
[164488.007687] ceph: mds4 reconnect start
[164636.153486] ceph: mds0 reconnect start
[164647.877800] ceph: mds4 recovery completed
[164647.877807] ceph: mds0 recovery completed
[164647.878088] libceph: mon0 10.20.63.194:6789 session lost, hunting for new mon
[164647.887724] ceph: mds0 reconnect denied
[164647.887742] ceph: mds4 reconnect denied
[164647.890247] libceph: mon4 10.20.64.66:6789 session established
[164654.634241] libceph: mds4 10.20.52.155:6800 socket closed (con state NEGOTIATING)
[165547.897829] libceph: mds2 10.20.52.158:6800 socket closed (con state OPEN)
[165555.605291] libceph: mds4 10.20.52.155:6800 socket closed (con state OPEN)
[166274.096654] libceph: mds1 10.20.52.156:6800 socket closed (con state OPEN)
[166469.161566] libceph: mds3 10.20.52.159:6800 socket closed (con state OPEN)
Actions

Also available in: Atom PDF