Bug #47689
rados/upgrade/nautilus-x-singleton fails due to cluster [WRN] evicting unresponsive client
0%
Description
"2020-09-29T12:20:24.346813+0000 mds.a (mds.0) 1 : cluster [WRN] evicting unresponsive client smithi039:y (4771), after 304.592 seconds" in cluster log
/a/teuthology-2020-09-29_07:01:02-rados-master-distro-basic-smithi/5480802
Discussed this with Patrick and he is debugging this further in https://github.com/ceph/ceph/pull/37419
Related issues
History
#1 Updated by Neha Ojha almost 3 years ago
/a/teuthology-2020-09-30_07:01:02-rados-master-distro-basic-smithi/5483508/
#2 Updated by Neha Ojha almost 3 years ago
- Priority changed from Normal to High
/a/teuthology-2020-10-01_07:01:02-rados-master-distro-basic-smithi/5485885
#3 Updated by Patrick Donnelly almost 3 years ago
- Status changed from New to In Progress
- Assignee set to Patrick Donnelly
- Target version set to v16.0.0
- Source set to Q/A
This appears to be a fairly old failure. Here's a few instances:
https://pulpito.ceph.com/teuthology-2020-07-22_07:01:02-rados-master-distro-basic-smithi/5248885/
https://pulpito.ceph.com/nojha-2020-06-09_20:15:18-rados:upgrade:nautilus-x-singleton-master-distro-basic-smithi/
Unfortunately no mgr logs for those.
I built a branch wip-pdonnell-testing-20201001.200945 with https://github.com/ceph/ceph/pull/37419 and got a run here:
It's very reproducible (but not 100%!!). For job
/ceph/teuthology-archive/pdonnell-2020-10-01_23:18:55-rados:upgrade:nautilus-x-singleton-wip-pdonnell-testing-20201001.200945-distro-basic-smithi/5487412
the evicted client is client.4775. The last thing we see from it in the mgr log is:
2020-10-01T23:37:08.331+0000 7f183caa6700 20 client.4775 tick 2020-10-01T23:37:08.331+0000 7f183caa6700 20 client.4775 collect_and_send_metrics 2020-10-01T23:37:08.331+0000 7f183caa6700 20 client.4775 collect_and_send_global_metrics 2020-10-01T23:37:08.331+0000 7f183caa6700 5 client.4775 collect_and_send_global_metrics: rank=0 does not support metrics 2020-10-01T23:37:08.331+0000 7f183caa6700 20 client.4775 trim_cache size 0 max 16384
2 seconds before it's supposed to renew its caps. The the mgr failed over:
2020-10-01T23:37:08.623+0000 7f186cf7d700 10 mgr ms_dispatch2 active mgrmap(e 6) v1 2020-10-01T23:37:08.623+0000 7f186cf7d700 4 mgr handle_mgr_map received map epoch 6 2020-10-01T23:37:08.623+0000 7f186cf7d700 4 mgr handle_mgr_map active in map: 1 active is 4649 2020-10-01T23:37:08.623+0000 7f186cf7d700 1 mgr respawn e: 'ceph-mgr' 2020-10-01T23:37:08.623+0000 7f186cf7d700 1 mgr respawn 0: 'ceph-mgr' 2020-10-01T23:37:08.623+0000 7f186cf7d700 1 mgr respawn 1: '-f' 2020-10-01T23:37:08.623+0000 7f186cf7d700 1 mgr respawn 2: '--cluster' 2020-10-01T23:37:08.623+0000 7f186cf7d700 1 mgr respawn 3: 'ceph' 2020-10-01T23:37:08.623+0000 7f186cf7d700 1 mgr respawn 4: '-i' 2020-10-01T23:37:08.623+0000 7f186cf7d700 1 mgr respawn 5: 'x' 2020-10-01T23:37:08.623+0000 7f186cf7d700 1 mgr respawn respawning with exe /usr/bin/ceph-mgr 2020-10-01T23:37:08.623+0000 7f186cf7d700 1 mgr respawn exe_path /proc/self/exe 2020-10-01T23:37:08.655+0000 7f1f245dd200 0 ceph version 16.0.0-6089-gaf0d0003 (af0d0003260d1be1bd25c7a3d58968bfa9e2f522) pacific (dev), process ceph-mgr, pid 17937
That was due to the change in active modules. The mons blocklisted the old instance here:
2020-10-01T23:37:11.670+0000 7fcd30828700 4 mon.b@0(leader).mgr e6 beacon from 74108 2020-10-01T23:37:11.670+0000 7fcd30828700 4 mon.b@0(leader).mgr e6 Active daemon restart (mgr.x) 2020-10-01T23:37:11.670+0000 7fcd30828700 0 log_channel(cluster) log [INF] : Active manager daemon x restarted 2020-10-01T23:37:11.670+0000 7fcd30828700 1 -- v1:172.21.15.49:6789/0 --> v1:172.21.15.49:6789/0 -- log(1 entries from seq 122 at 2020-10-01T23:37:11.674534+0000) v1 -- 0x565056254000 con 0x565054c89400 2020-10-01T23:37:11.670+0000 7fcd30828700 5 mon.b@0(leader).mgr e6 blocklisting previous mgr.x.4649 (v1:172.21.15.137:6800/17937) 2020-10-01T23:37:11.670+0000 7fcd30828700 10 mon.b@0(leader).osd e36 blocklist v1:172.21.15.137:6800/17937 until 2020-10-02T23:37:11.674589+0000
From: /ceph/teuthology-archive/pdonnell-2020-10-01_23:18:55-rados:upgrade:nautilus-x-singleton-wip-pdonnell-testing-20201001.200945-distro-basic-smithi/5487412/remote/smithi049/log/ceph-mon.b.log.gz
The client sessions were never blocklisted (including the CephFS session). This is because the Nautilus mgrmap does not note the client sessions (this is an Octopus/Pacific feature). So I think the only thing to do here is ignore this warning. There's no way to handle it.
I'll write up a fix...
#4 Updated by Patrick Donnelly almost 3 years ago
- Status changed from In Progress to Fix Under Review
- Pull request ID set to 37528
#5 Updated by Patrick Donnelly almost 3 years ago
- Status changed from Fix Under Review to Resolved
#6 Updated by Patrick Donnelly almost 3 years ago
- Status changed from Resolved to Pending Backport
- Backport set to octopus,nautilus
#7 Updated by Patrick Donnelly almost 3 years ago
backport because of these failures: https://pulpito.ceph.com/teuthology-2020-11-17_03:29:52-upgrade:nautilus-p2p-nautilus-distro-basic-smithi/
#8 Updated by Nathan Cutler almost 3 years ago
- Copied to Backport #48285: octopus: rados/upgrade/nautilus-x-singleton fails due to cluster [WRN] evicting unresponsive client added
#9 Updated by Nathan Cutler almost 3 years ago
- Copied to Backport #48286: nautilus: rados/upgrade/nautilus-x-singleton fails due to cluster [WRN] evicting unresponsive client added
#10 Updated by Nathan Cutler over 2 years ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".
#11 Updated by Deepika Upadhyay over 2 years ago
looks similar: https://tracker.ceph.com/issues/50275
#12 Updated by Deepika Upadhyay over 2 years ago
- Related to Bug #50275: nautilus: evicting unresponsive client added