Project

General

Profile

Bug #43943

qa: "[WRN] evicting unresponsive client smithi131:z (6314), after 304.461 seconds"

Added by Patrick Donnelly 8 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
octopus,nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
qa-suite
Labels (FS):
Pull request ID:
Crash signature:

Description

/a/sage-2020-01-28_03:52:05-rados-wip-sage2-testing-2020-01-27-1839-distro-basic-smithi/4713589
description: rados/mgr/{clusters/{2-node-mgr.yaml} debug/mgr.yaml objectstore/bluestore-bitmap.yaml
supported-random-distro$/{ubuntu_latest.yaml} tasks/module_selftest.yaml}

/a/sage-2020-01-30_22:27:29-rados-wip-sage-testing-2020-01-30-1230-distro-basic-smithi/4719492

Part 2 of #40867


Related issues

Related to fs - Bug #40867: mgr: failover during in qa testing causes unresponsive client warnings Resolved
Copied to fs - Backport #46199: octopus: qa: "[WRN] evicting unresponsive client smithi131:z (6314), after 304.461 seconds" Resolved
Copied to fs - Backport #46200: nautilus: qa: "[WRN] evicting unresponsive client smithi131:z (6314), after 304.461 seconds" Resolved

History

#1 Updated by Patrick Donnelly 8 months ago

  • Related to Bug #40867: mgr: failover during in qa testing causes unresponsive client warnings added

#2 Updated by Venky Shankar 8 months ago

client: 172.21.15.131:0/4191323679 (cephfs instance), registers its addrs with ceph-mgr:

2020-01-28T18:10:26.941+0000 7fccd75ba700  7 mgr register_client registering msgr client handle v2:172.21.15.131:0/4191323679

but, before the active mgr instance can send its updated client list to monitor, manager transitions to standby

2020-01-28T18:10:29.837+0000 7f8790ea2ec0  0 ceph version 15.0.0-9869-g4b944a6 (4b944a6d8397907af1750fd52b641cbb82a57ba2) octopus (dev), process ceph-mgr, pid 16726
....
2020-01-28T18:10:32.833+0000 7f8790ea2ec0 20 mgr send_beacon standby
2020-01-28T18:10:32.833+0000 7f8790ea2ec0 10 mgr send_beacon sending beacon as gid 8056

Between the call to register and the transitioning to standby, mgr didn't get a chance to call `send_beacon()` (which is called every `mgr_tick_period` seconds), but mds knows about this client.

Maybe, sending a "final" beacon to monitor w/ updated clients before transitioning to standby might work. I'm not sure. Any other approached others can think of?

#3 Updated by Patrick Donnelly 8 months ago

Venky Shankar wrote:

client: 172.21.15.131:0/4191323679 (cephfs instance), registers its addrs with ceph-mgr:

[...]

but, before the active mgr instance can send its updated client list to monitor, manager transitions to standby

[...]

Between the call to register and the transitioning to standby, mgr didn't get a chance to call `send_beacon()` (which is called every `mgr_tick_period` seconds), but mds knows about this client.

Maybe, sending a "final" beacon to monitor w/ updated clients before transitioning to standby might work. I'm not sure. Any other approached others can think of?

It's a little different. The mgr sends this beacon:

2020-01-28T18:10:26.877+0000 7fccea660700 20 mgr send_beacon active
2020-01-28T18:10:26.877+0000 7fccea660700 10 mgr send_beacon sending beacon as gid 7933
2020-01-28T18:10:26.881+0000 7fccea660700  4 mgr send_beacon going active, including 317 commands in beacon

Then it registers some client handles:

2020-01-28T18:10:26.905+0000 7fccea660700  7 mgr register_client registering msgr client handle v2:172.21.15.131:0/1320837425
...
2020-01-28T18:09:00.947+0000 7ff68a3c4700  7 mgr unregister_client unregistering msgr client handle v2:172.21.15.131:0/3910140031
...
2020-01-28T18:10:26.925+0000 7fccea660700  7 mgr register_client registering msgr client handle v2:172.21.15.131:0/416558227
...
2020-01-28T18:10:26.941+0000 7fccd75ba700  7 mgr register_client registering msgr client handle v2:172.21.15.131:0/4191323679

Then teuthology restarts the mgr:

2020-01-28T18:10:29.760 INFO:teuthology.orchestra.run.smithi131:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper kill ceph-mgr -f --cluster ceph -i z
2020-01-28T18:10:29.763 INFO:tasks.ceph.mgr.z:Started

Then the mgr starts:

2020-01-28T18:10:29.837+0000 7f8790ea2ec0  0 ceph version 15.0.0-9869-g4b944a6 (4b944a6d8397907af1750fd52b641cbb82a57ba2) octopus (dev), process ceph-mgr, pid 16726

No chance to send the new beacon before restart.

Besides whitelisting the warning (which I don't want to do), I see two possible solutions here:

(1) Fire off a beacon whenever the mgr receives a fatal signal. For the MDS, we send the mons STATE_DNE so rapid failover occurs. Haven't yet looked how the mgr/MgrMonitor works for this case. I think ideally we could send one last beacon to the monitor with the latest client list.

(2) Whenever the client instance list changes, send a beacon immediately.

I think (2) is troublesome has you'd need to wire up a notification mechanism so MgrStandby::send_beacon is called. (That method is not ideally placed BTW, the method is called even when the Mgr is active!) That would probably involve creating a dedicated beacon thread like what we have for the MDS which can wait on a condition variable with a timeout.

What do you think Venky?

#4 Updated by Venky Shankar 8 months ago

Patrick Donnelly wrote:

Venky Shankar wrote:

client: 172.21.15.131:0/4191323679 (cephfs instance), registers its addrs with ceph-mgr:

[...]

but, before the active mgr instance can send its updated client list to monitor, manager transitions to standby

[...]

Between the call to register and the transitioning to standby, mgr didn't get a chance to call `send_beacon()` (which is called every `mgr_tick_period` seconds), but mds knows about this client.

Maybe, sending a "final" beacon to monitor w/ updated clients before transitioning to standby might work. I'm not sure. Any other approached others can think of?

It's a little different. The mgr sends this beacon:

[...]

Then it registers some client handles:

[...]

Then teuthology restarts the mgr:

[...]

Then the mgr starts:

[...]

No chance to send the new beacon before restart.

Right, that's my understanding too -- maybe my description wasn't detailed.

Besides whitelisting the warning (which I don't want to do), I see two possible solutions here:

(1) Fire off a beacon whenever the mgr receives a fatal signal. For the MDS, we send the mons STATE_DNE so rapid failover occurs. Haven't yet looked how the mgr/MgrMonitor works for this case. I think ideally we could send one last beacon to the monitor with the latest client list.

(2) Whenever the client instance list changes, send a beacon immediately.

I think (2) is troublesome has you'd need to wire up a notification mechanism so MgrStandby::send_beacon is called. (That method is not ideally placed BTW, the method is called even when the Mgr is active!) That would probably involve creating a dedicated beacon thread like what we have for the MDS which can wait on a condition variable with a timeout.

What do you think Venky?

I would prefer option (1) -- sending a final beacon before going down would be enough.

#5 Updated by Venky Shankar 7 months ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 33272

#6 Updated by Sage Weil 6 months ago

/a/sage-2020-03-22_23:32:49-rados-wip-sage3-testing-2020-03-22-1327-distro-basic-smithi/4881104

#7 Updated by Brad Hubbard 5 months ago

/a/teuthology-2020-04-26_07:01:02-rados-master-distro-basic-smithi/4986046

#8 Updated by Brad Hubbard 5 months ago

/a/teuthology-2020-04-26_02:30:03-rados-octopus-distro-basic-smithi/4984936

#9 Updated by Patrick Donnelly 5 months ago

  • Target version changed from v15.0.0 to v16.0.0

#10 Updated by Brad Hubbard 4 months ago

/a/yuriw-2020-05-24_19:30:40-rados-wip-yuri-master_5.24.20-distro-basic-smithi/5087753

#11 Updated by Venky Shankar 3 months ago

  • Pull request ID changed from 33272 to 35532

#13 Updated by Patrick Donnelly 3 months ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to octopus,nautilus

#14 Updated by Nathan Cutler 3 months ago

  • Copied to Backport #46199: octopus: qa: "[WRN] evicting unresponsive client smithi131:z (6314), after 304.461 seconds" added

#15 Updated by Nathan Cutler 3 months ago

  • Copied to Backport #46200: nautilus: qa: "[WRN] evicting unresponsive client smithi131:z (6314), after 304.461 seconds" added

#16 Updated by Nathan Cutler about 2 months ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Also available in: Atom PDF