Bug #49434: `client isn't responding to mclientcaps(revoke)` for hours - CephFS - Ceph

Actions

Bug #49434

closed

Bug #57244: [WRN] : client.408214273 isn't responding to mclientcaps(revoke), ino 0x10000000003 pending pAsLsXsFs issued pAsLsXsFs, sent 62.303702 seconds ago

`client isn't responding to mclientcaps(revoke)` for hours

Added by Wouter van Os about 3 years ago. Updated over 1 year ago.

Status:

Duplicate

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v14.2.6

ceph-qa-suite:

Component(FS):

Client

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

One of our clients does not seem to respond to `mclientcaps(revoke)`, to a request where the issued and pending caps are the same. I looked into older issues, but I don't think raising the session_timeout would help, as seen in the logs below - it is sometimes not responding for >17 hours.

```
2021-02-18 10:31:57.105 7fad25cb9700 0 log_channel(cluster) log [WRN] : client.7611766 isn't responding to mclientcaps(revoke), ino 0x100019e3431 pending pAsLsXsFsc issued pAsLsXsFsc, sent 61443.879409 seconds ago
- a day later and many similar lines -
2021-02-17 17:29:54.362 7fad25cb9700 0 log_channel(cluster) log [WRN] : client.7611766 isn't responding to mclientcaps(revoke), ino 0x100019e3431 pending pAsLsXsFsc issued pAsLsXsFsc, sent 121.142165 seconds ago
2021-02-17 17:28:54.365 7fad25cb9700 0 log_channel(cluster) log [WRN] : client.7611766 isn't responding to mclientcaps(revoke), ino 0x100019e3431 pending pAsLsXsFsxcrwb issued pAsLsXsFsxcrwb, sent 61.141179 seconds ago
```

No particular load or spike is noticeable on this specific client (avg. at around 0.8), and our other ~700 clients don't have this issue. The WARN disappears once we restart the client/remount it, but it comes back after a couple of days on this specific client, it however does not seem to be consistent in terms of when it comes back. It seems not to cause any further problems, besides the Ceph being in WARN state and monitoring becomes thus hard.

Session:
``` {
"id": 7611766,
"num_leases": 1,
"num_caps": 109,
"state": "open",
"request_load_avg": 5,
"uptime": 775257.18299780798,
"replay_requests": 0,
"completed_requests": 1,
"reconnecting": false,
"inst": "client.7611766 v1:10.10.2.104:0/3115386332",
"client_metadata": {
"features": "0000000000001bff",
"entity_id": "admin",
"hostname": "job1",
"kernel_version": "5.4.0-65-generic",
"root": "/mountpath"
}
},

```

Any idea what could case this? I asked on the IRC first, but the answer there was: "just restart it", but it seems to be a problem that happens many times to really make this the workaround. I've also attached a full log file of all the lines from the last time it was in WARN.

Thanks.

Files

revoke.log (36.1 KB) revoke.log

Wouter van Os, 02/23/2021 12:17 PM

Actions

Copy link

Updated by Xiubo Li over 1 year ago

Tracker changed from Support to Bug
Status changed from New to Duplicate
Parent task set to #57244
Regression set to No
Severity set to 3 - minor

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #49434

`client isn't responding to mclientcaps(revoke)` for hours

Updated by Xiubo Li over 1 year ago