Project

General

Profile

Actions

Bug #49434

closed

Bug #57244: [WRN] : client.408214273 isn't responding to mclientcaps(revoke), ino 0x10000000003 pending pAsLsXsFs issued pAsLsXsFs, sent 62.303702 seconds ago

`client isn't responding to mclientcaps(revoke)` for hours

Added by Wouter van Os about 3 years ago. Updated over 1 year ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Client
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

One of our clients does not seem to respond to `mclientcaps(revoke)`, to a request where the issued and pending caps are the same. I looked into older issues, but I don't think raising the session_timeout would help, as seen in the logs below - it is sometimes not responding for >17 hours.

```
2021-02-18 10:31:57.105 7fad25cb9700 0 log_channel(cluster) log [WRN] : client.7611766 isn't responding to mclientcaps(revoke), ino 0x100019e3431 pending pAsLsXsFsc issued pAsLsXsFsc, sent 61443.879409 seconds ago
- a day later and many similar lines -
2021-02-17 17:29:54.362 7fad25cb9700 0 log_channel(cluster) log [WRN] : client.7611766 isn't responding to mclientcaps(revoke), ino 0x100019e3431 pending pAsLsXsFsc issued pAsLsXsFsc, sent 121.142165 seconds ago
2021-02-17 17:28:54.365 7fad25cb9700 0 log_channel(cluster) log [WRN] : client.7611766 isn't responding to mclientcaps(revoke), ino 0x100019e3431 pending pAsLsXsFsxcrwb issued pAsLsXsFsxcrwb, sent 61.141179 seconds ago
```

No particular load or spike is noticeable on this specific client (avg. at around 0.8), and our other ~700 clients don't have this issue. The WARN disappears once we restart the client/remount it, but it comes back after a couple of days on this specific client, it however does not seem to be consistent in terms of when it comes back. It seems not to cause any further problems, besides the Ceph being in WARN state and monitoring becomes thus hard.

Session:
``` {
"id": 7611766,
"num_leases": 1,
"num_caps": 109,
"state": "open",
"request_load_avg": 5,
"uptime": 775257.18299780798,
"replay_requests": 0,
"completed_requests": 1,
"reconnecting": false,
"inst": "client.7611766 v1:10.10.2.104:0/3115386332",
"client_metadata": {
"features": "0000000000001bff",
"entity_id": "admin",
"hostname": "job1",
"kernel_version": "5.4.0-65-generic",
"root": "/mountpath"
}
},

```

Any idea what could case this? I asked on the IRC first, but the answer there was: "just restart it", but it seems to be a problem that happens many times to really make this the workaround. I've also attached a full log file of all the lines from the last time it was in WARN.

Thanks.


Files

revoke.log (36.1 KB) revoke.log Wouter van Os, 02/23/2021 12:17 PM
Actions #1

Updated by Xiubo Li over 1 year ago

  • Tracker changed from Support to Bug
  • Status changed from New to Duplicate
  • Parent task set to #57244
  • Regression set to No
  • Severity set to 3 - minor
Actions

Also available in: Atom PDF