Project

General

Profile

Actions

Bug #63225

open

hanging cephfs mounts. ceph reporting slow requests and mclientcaps(revoke)

Added by Kenneth Waegeman 7 months ago. Updated 6 months ago.

Status:
New
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Client, MDS, ceph-fuse
Labels (FS):
crash
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi,

All mounts (ceph-fuse) of our cephfs cluster starts hanging regularly (every 8-12hours). ceph health reports :
[root@ceph031 ~]# ceph health detail
HEALTH_WARN 1 clients failing to respond to capability release; 1 MDSs report slow requests
[WRN] MDS_CLIENT_LATE_RELEASE: 1 clients failing to respond to capability release
mds.ceph_fs.ceph031.nhaxht(mds.0): Client hyp120.swablu.os: failing to respond to capability release client_id: 54694670
[WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests
mds.ceph_fs.ceph031.nhaxht(mds.0): 5 slow requests are blocked > 30 secs

I also find some things in the log file of the active mds:

2023-10-14T08:20:28.077+0000 7fe55fe0f700 0 log_channel(cluster) log [WRN] : evicting unresponsive client hyp123.swablu.os:cephfs (54755042), after 304.865 seconds
2023-10-14T08:20:28.077+0000 7fe55fe0f700 1 mds.0.1528 Evicting (and blocklisting) client session 54755042 (10.143.20.123:0/1882518721)
2023-10-14T08:20:28.077+0000 7fe55fe0f700 0 log_channel(cluster) log [INF] : Evicting (and blocklisting) client session 54755042 (10.143.20.123:0/1882518721)
2023-10-14T13:01:28.287+0000 7fe55fe0f700 0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 31.277313 secs
2023-10-14T13:01:28.287+0000 7fe55fe0f700 0 log_channel(cluster) log [WRN] : slow request 31.277312 seconds old, received at 2023-10-14T13:00:57.011354+0000: client_request(client.54781123:865 unlink #0x1000006d51d/disk.2 2023-10-14T13:00:57.011313+0000 caller_uid=9869
, caller_gid=9869{9869,}) currently failed to wrlock, waiting
2023-10-14T13:01:30.180+0000 7fe561612700 1 mds.ceph_fs.ceph031.nhaxht Updating MDS map to version 1533 from mon.0

2023-10-14T13:01:58.288+0000 7fe55fe0f700 0 log_channel(cluster) log [WRN] : slow request 61.277676 seconds old, received at 2023-10-14T13:00:57.011354+0000: client_request(client.54781123:865 unlink #0x1000006d51d/disk.2 2023-10-14T13:00:57.011313+0000 caller_uid=9869
, caller_gid=9869{9869,}) currently failed to wrlock, waiting
2023-10-14T13:01:58.288+0000 7fe55fe0f700 0 log_channel(cluster) log [WRN] : client.54694670 isn't responding to mclientcaps(revoke), ino 0x1000006d51d pending pAsLsXs issued pAsLsXsFs, sent 61.277680 seconds ago
2023-10-14T13:02:01.421+0000 7fe561612700 1 mds.ceph_fs.ceph031.nhaxht Updating MDS map to version 1534 from mon.0

2023-10-14T13:24:58.305+0000 7fe55fe0f700 0 log_channel(cluster) log [WRN] : 2 slow requests, 1 included below; oldest blocked for > 1441.294805 secs
2023-10-14T13:24:58.305+0000 7fe55fe0f700 0 log_channel(cluster) log [WRN] : slow request 31.106247 seconds old, received at 2023-10-14T13:24:27.199912+0000: client_request(client.54728269:7013 readdir #0x1000006d51d 2023-10-14T13:24:27.199847+0000 caller_uid=9869, caller_gid=9869{9869,}) currently failed to rdlock, waiting

2023-10-17T08:14:21.315+0000 7fe55fe0f700 0 log_channel(cluster) log [WRN] : client.54694670 isn't responding to mclientcaps(revoke), ino 0x1 pending pAsLsXs issued pAsLsXsFs, sent 61.315658 seconds ago
2023-10-17T08:15:21.316+0000 7fe55fe0f700 0 log_channel(cluster) log [WRN] : client.54694670 isn't responding to mclientcaps(revoke), ino 0x1 pending pAsLsXs issued pAsLsXsFs, sent 121.316420 seconds ago
2023-10-17T08:17:21.317+0000 7fe55fe0f700 0 log_channel(cluster) log [WRN] : client.54694670 isn't responding to mclientcaps(revoke), ino 0x1 pending pAsLsXs issued pAsLsXsFs, sent 241.317899 seconds ago
2023-10-17T08:21:21.320+0000 7fe55fe0f700 0 log_channel(cluster) log [WRN] : client.54694670 isn't responding to mclientcaps(revoke), ino 0x1 pending pAsLsXs issued pAsLsXsFs, sent 481.320907 seconds ago
2023-10-17T08:29:21.326+0000 7fe55fe0f700 0 log_channel(cluster) log [WRN] : client.54694670 isn't responding to mclientcaps(revoke), ino 0x1 pending pAsLsXs issued pAsLsXsFs, sent 961.326899 seconds ago
2023-10-17T08:45:21.338+0000 7fe55fe0f700 0 log_channel(cluster) log [WRN] : client.54694670 isn't responding to mclientcaps(revoke), ino 0x1 pending pAsLsXs issued pAsLsXsFs, sent 1921.338920 seconds ago

All daemons and ceph-fuse clients are running 17.2.6, ceph is running in containers using cephadm

Actions #1

Updated by Milind Changire 6 months ago

  • Assignee set to Xiubo Li
Actions #2

Updated by Kenneth Waegeman 6 months ago

did not see the issue again recently after removing some faulty osd drive

Actions

Also available in: Atom PDF