Bug #63225
openhanging cephfs mounts. ceph reporting slow requests and mclientcaps(revoke)
0%
Description
Hi,
All mounts (ceph-fuse) of our cephfs cluster starts hanging regularly (every 8-12hours). ceph health reports :
[root@ceph031 ~]# ceph health detail
HEALTH_WARN 1 clients failing to respond to capability release; 1 MDSs report slow requests
[WRN] MDS_CLIENT_LATE_RELEASE: 1 clients failing to respond to capability release
mds.ceph_fs.ceph031.nhaxht(mds.0): Client hyp120.swablu.os: failing to respond to capability release client_id: 54694670
[WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests
mds.ceph_fs.ceph031.nhaxht(mds.0): 5 slow requests are blocked > 30 secs
I also find some things in the log file of the active mds:
2023-10-14T08:20:28.077+0000 7fe55fe0f700 0 log_channel(cluster) log [WRN] : evicting unresponsive client hyp123.swablu.os:cephfs (54755042), after 304.865 seconds
2023-10-14T08:20:28.077+0000 7fe55fe0f700 1 mds.0.1528 Evicting (and blocklisting) client session 54755042 (10.143.20.123:0/1882518721)
2023-10-14T08:20:28.077+0000 7fe55fe0f700 0 log_channel(cluster) log [INF] : Evicting (and blocklisting) client session 54755042 (10.143.20.123:0/1882518721)
2023-10-14T13:01:28.287+0000 7fe55fe0f700 0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 31.277313 secs
2023-10-14T13:01:28.287+0000 7fe55fe0f700 0 log_channel(cluster) log [WRN] : slow request 31.277312 seconds old, received at 2023-10-14T13:00:57.011354+0000: client_request(client.54781123:865 unlink #0x1000006d51d/disk.2 2023-10-14T13:00:57.011313+0000 caller_uid=9869
, caller_gid=9869{9869,}) currently failed to wrlock, waiting
2023-10-14T13:01:30.180+0000 7fe561612700 1 mds.ceph_fs.ceph031.nhaxht Updating MDS map to version 1533 from mon.0
2023-10-14T13:01:58.288+0000 7fe55fe0f700 0 log_channel(cluster) log [WRN] : slow request 61.277676 seconds old, received at 2023-10-14T13:00:57.011354+0000: client_request(client.54781123:865 unlink #0x1000006d51d/disk.2 2023-10-14T13:00:57.011313+0000 caller_uid=9869
, caller_gid=9869{9869,}) currently failed to wrlock, waiting
2023-10-14T13:01:58.288+0000 7fe55fe0f700 0 log_channel(cluster) log [WRN] : client.54694670 isn't responding to mclientcaps(revoke), ino 0x1000006d51d pending pAsLsXs issued pAsLsXsFs, sent 61.277680 seconds ago
2023-10-14T13:02:01.421+0000 7fe561612700 1 mds.ceph_fs.ceph031.nhaxht Updating MDS map to version 1534 from mon.0
2023-10-14T13:24:58.305+0000 7fe55fe0f700 0 log_channel(cluster) log [WRN] : 2 slow requests, 1 included below; oldest blocked for > 1441.294805 secs
2023-10-14T13:24:58.305+0000 7fe55fe0f700 0 log_channel(cluster) log [WRN] : slow request 31.106247 seconds old, received at 2023-10-14T13:24:27.199912+0000: client_request(client.54728269:7013 readdir #0x1000006d51d 2023-10-14T13:24:27.199847+0000 caller_uid=9869, caller_gid=9869{9869,}) currently failed to rdlock, waiting
2023-10-17T08:14:21.315+0000 7fe55fe0f700 0 log_channel(cluster) log [WRN] : client.54694670 isn't responding to mclientcaps(revoke), ino 0x1 pending pAsLsXs issued pAsLsXsFs, sent 61.315658 seconds ago
2023-10-17T08:15:21.316+0000 7fe55fe0f700 0 log_channel(cluster) log [WRN] : client.54694670 isn't responding to mclientcaps(revoke), ino 0x1 pending pAsLsXs issued pAsLsXsFs, sent 121.316420 seconds ago
2023-10-17T08:17:21.317+0000 7fe55fe0f700 0 log_channel(cluster) log [WRN] : client.54694670 isn't responding to mclientcaps(revoke), ino 0x1 pending pAsLsXs issued pAsLsXsFs, sent 241.317899 seconds ago
2023-10-17T08:21:21.320+0000 7fe55fe0f700 0 log_channel(cluster) log [WRN] : client.54694670 isn't responding to mclientcaps(revoke), ino 0x1 pending pAsLsXs issued pAsLsXsFs, sent 481.320907 seconds ago
2023-10-17T08:29:21.326+0000 7fe55fe0f700 0 log_channel(cluster) log [WRN] : client.54694670 isn't responding to mclientcaps(revoke), ino 0x1 pending pAsLsXs issued pAsLsXsFs, sent 961.326899 seconds ago
2023-10-17T08:45:21.338+0000 7fe55fe0f700 0 log_channel(cluster) log [WRN] : client.54694670 isn't responding to mclientcaps(revoke), ino 0x1 pending pAsLsXs issued pAsLsXsFs, sent 1921.338920 seconds ago
All daemons and ceph-fuse clients are running 17.2.6, ceph is running in containers using cephadm
Updated by Kenneth Waegeman 6 months ago
did not see the issue again recently after removing some faulty osd drive