Bug #43943
closedqa: "[WRN] evicting unresponsive client smithi131:z (6314), after 304.461 seconds"
0%
Description
/a/sage-2020-01-28_03:52:05-rados-wip-sage2-testing-2020-01-27-1839-distro-basic-smithi/4713589
description: rados/mgr/{clusters/{2-node-mgr.yaml} debug/mgr.yaml objectstore/bluestore-bitmap.yaml
supported-random-distro$/{ubuntu_latest.yaml} tasks/module_selftest.yaml}
/a/sage-2020-01-30_22:27:29-rados-wip-sage-testing-2020-01-30-1230-distro-basic-smithi/4719492
Part 2 of #40867
Updated by Patrick Donnelly about 4 years ago
- Related to Bug #40867: mgr: failover during in qa testing causes unresponsive client warnings added
Updated by Venky Shankar about 4 years ago
client: 172.21.15.131:0/4191323679 (cephfs instance), registers its addrs with ceph-mgr:
2020-01-28T18:10:26.941+0000 7fccd75ba700 7 mgr register_client registering msgr client handle v2:172.21.15.131:0/4191323679
but, before the active mgr instance can send its updated client list to monitor, manager transitions to standby
2020-01-28T18:10:29.837+0000 7f8790ea2ec0 0 ceph version 15.0.0-9869-g4b944a6 (4b944a6d8397907af1750fd52b641cbb82a57ba2) octopus (dev), process ceph-mgr, pid 16726 .... 2020-01-28T18:10:32.833+0000 7f8790ea2ec0 20 mgr send_beacon standby 2020-01-28T18:10:32.833+0000 7f8790ea2ec0 10 mgr send_beacon sending beacon as gid 8056
Between the call to register and the transitioning to standby, mgr didn't get a chance to call `send_beacon()` (which is called every `mgr_tick_period` seconds), but mds knows about this client.
Maybe, sending a "final" beacon to monitor w/ updated clients before transitioning to standby might work. I'm not sure. Any other approached others can think of?
Updated by Patrick Donnelly about 4 years ago
Venky Shankar wrote:
client: 172.21.15.131:0/4191323679 (cephfs instance), registers its addrs with ceph-mgr:
[...]
but, before the active mgr instance can send its updated client list to monitor, manager transitions to standby
[...]
Between the call to register and the transitioning to standby, mgr didn't get a chance to call `send_beacon()` (which is called every `mgr_tick_period` seconds), but mds knows about this client.
Maybe, sending a "final" beacon to monitor w/ updated clients before transitioning to standby might work. I'm not sure. Any other approached others can think of?
It's a little different. The mgr sends this beacon:
2020-01-28T18:10:26.877+0000 7fccea660700 20 mgr send_beacon active 2020-01-28T18:10:26.877+0000 7fccea660700 10 mgr send_beacon sending beacon as gid 7933 2020-01-28T18:10:26.881+0000 7fccea660700 4 mgr send_beacon going active, including 317 commands in beacon
Then it registers some client handles:
2020-01-28T18:10:26.905+0000 7fccea660700 7 mgr register_client registering msgr client handle v2:172.21.15.131:0/1320837425 ... 2020-01-28T18:09:00.947+0000 7ff68a3c4700 7 mgr unregister_client unregistering msgr client handle v2:172.21.15.131:0/3910140031 ... 2020-01-28T18:10:26.925+0000 7fccea660700 7 mgr register_client registering msgr client handle v2:172.21.15.131:0/416558227 ... 2020-01-28T18:10:26.941+0000 7fccd75ba700 7 mgr register_client registering msgr client handle v2:172.21.15.131:0/4191323679
Then teuthology restarts the mgr:
2020-01-28T18:10:29.760 INFO:teuthology.orchestra.run.smithi131:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper kill ceph-mgr -f --cluster ceph -i z 2020-01-28T18:10:29.763 INFO:tasks.ceph.mgr.z:Started
Then the mgr starts:
2020-01-28T18:10:29.837+0000 7f8790ea2ec0 0 ceph version 15.0.0-9869-g4b944a6 (4b944a6d8397907af1750fd52b641cbb82a57ba2) octopus (dev), process ceph-mgr, pid 16726
No chance to send the new beacon before restart.
Besides whitelisting the warning (which I don't want to do), I see two possible solutions here:
(1) Fire off a beacon whenever the mgr receives a fatal signal. For the MDS, we send the mons STATE_DNE so rapid failover occurs. Haven't yet looked how the mgr/MgrMonitor works for this case. I think ideally we could send one last beacon to the monitor with the latest client list.
(2) Whenever the client instance list changes, send a beacon immediately.
I think (2) is troublesome has you'd need to wire up a notification mechanism so MgrStandby::send_beacon is called. (That method is not ideally placed BTW, the method is called even when the Mgr is active!) That would probably involve creating a dedicated beacon thread like what we have for the MDS which can wait on a condition variable with a timeout.
What do you think Venky?
Updated by Venky Shankar about 4 years ago
Patrick Donnelly wrote:
Venky Shankar wrote:
client: 172.21.15.131:0/4191323679 (cephfs instance), registers its addrs with ceph-mgr:
[...]
but, before the active mgr instance can send its updated client list to monitor, manager transitions to standby
[...]
Between the call to register and the transitioning to standby, mgr didn't get a chance to call `send_beacon()` (which is called every `mgr_tick_period` seconds), but mds knows about this client.
Maybe, sending a "final" beacon to monitor w/ updated clients before transitioning to standby might work. I'm not sure. Any other approached others can think of?
It's a little different. The mgr sends this beacon:
[...]
Then it registers some client handles:
[...]
Then teuthology restarts the mgr:
[...]
Then the mgr starts:
[...]
No chance to send the new beacon before restart.
Right, that's my understanding too -- maybe my description wasn't detailed.
Besides whitelisting the warning (which I don't want to do), I see two possible solutions here:
(1) Fire off a beacon whenever the mgr receives a fatal signal. For the MDS, we send the mons STATE_DNE so rapid failover occurs. Haven't yet looked how the mgr/MgrMonitor works for this case. I think ideally we could send one last beacon to the monitor with the latest client list.
(2) Whenever the client instance list changes, send a beacon immediately.
I think (2) is troublesome has you'd need to wire up a notification mechanism so MgrStandby::send_beacon is called. (That method is not ideally placed BTW, the method is called even when the Mgr is active!) That would probably involve creating a dedicated beacon thread like what we have for the MDS which can wait on a condition variable with a timeout.
What do you think Venky?
I would prefer option (1) -- sending a final beacon before going down would be enough.
Updated by Venky Shankar about 4 years ago
- Status changed from New to Fix Under Review
- Pull request ID set to 33272
Updated by Sage Weil about 4 years ago
/a/sage-2020-03-22_23:32:49-rados-wip-sage3-testing-2020-03-22-1327-distro-basic-smithi/4881104
Updated by Brad Hubbard almost 4 years ago
/a/teuthology-2020-04-26_07:01:02-rados-master-distro-basic-smithi/4986046
Updated by Brad Hubbard almost 4 years ago
/a/teuthology-2020-04-26_02:30:03-rados-octopus-distro-basic-smithi/4984936
Updated by Patrick Donnelly almost 4 years ago
- Target version changed from v15.0.0 to v16.0.0
Updated by Brad Hubbard almost 4 years ago
/a/yuriw-2020-05-24_19:30:40-rados-wip-yuri-master_5.24.20-distro-basic-smithi/5087753
Updated by Venky Shankar almost 4 years ago
- Pull request ID changed from 33272 to 35532
Updated by Patrick Donnelly almost 4 years ago
- Status changed from Fix Under Review to Pending Backport
- Backport set to octopus,nautilus
Updated by Nathan Cutler almost 4 years ago
- Copied to Backport #46199: octopus: qa: "[WRN] evicting unresponsive client smithi131:z (6314), after 304.461 seconds" added
Updated by Nathan Cutler almost 4 years ago
- Copied to Backport #46200: nautilus: qa: "[WRN] evicting unresponsive client smithi131:z (6314), after 304.461 seconds" added
Updated by Nathan Cutler over 3 years ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".