Bug #64988: qa: fs:workloads mgr client evicted indicated by "cluster [WRN] evicting unresponsive client smithi042:x (15288), after 303.306 seconds" - CephFS - Ceph

Actions

Copy link

Bug #64988

closed

qa: fs:workloads mgr client evicted indicated by "cluster [WRN] evicting unresponsive client smithi042:x (15288), after 303.306 seconds"

Added by Patrick Donnelly about 2 months ago. Updated 29 days ago.

Status:

Resolved

Priority:

High

Assignee:

Patrick Donnelly

Category:

Testing

Target version:

Ceph - v20.0.0

% Done:

Source:

Q/A

Tags:

backport_processed

Backport:

squid,reef

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Labels (FS):

qa-failure

Pull request ID:

56354

Crash signature (v1):

Crash signature (v2):

Description

https://pulpito.ceph.com/pdonnell-2024-03-19_04:56:42-fs-wip-batrick-testing-20240318.181317-distro-default-smithi/7610533/

and many others in that run

Related issues 3 (1 open — 2 closed)

Actions

Copy link

Updated by Patrick Donnelly about 2 months ago

Related to Bug #64985: qa: mgr logs do not include client debugging added

Actions

Copy link

Updated by Patrick Donnelly about 1 month ago

Status changed from New to In Progress
Assignee set to Patrick Donnelly

Okay, so as expected this is a non-issue:

2024-03-20T18:59:44.324+0000 7ff1adba6700  1 -- 172.21.15.42:0/4057698876 <== mon.0 v2:172.21.15.42:3300/0 2621 ==== mgrmap(e 19) ==== 137871+0+0 (secure 0 0 0) 0x55bdef6bef00 con 0x55bdec7ec400
2024-03-20T18:59:44.324+0000 7ff1adba6700 10 mgr ms_dispatch2 active mgrmap(e 19)
2024-03-20T18:59:44.324+0000 7ff1adba6700  4 mgr handle_mgr_map received map epoch 19
2024-03-20T18:59:44.324+0000 7ff1adba6700  4 mgr handle_mgr_map active in map: 1 active is 14150
2024-03-20T18:59:44.324+0000 7ff1adba6700  1 mgr handle_mgr_map respawning because set of enabled modules changed!
2024-03-20T18:59:44.324+0000 7ff1adba6700  1 mgr respawn  e: '/usr/bin/ceph-mgr'
2024-03-20T18:59:44.324+0000 7ff1adba6700  1 mgr respawn  0: '/usr/bin/ceph-mgr'
2024-03-20T18:59:44.324+0000 7ff1adba6700  1 mgr respawn  1: '-n'
2024-03-20T18:59:44.324+0000 7ff1adba6700  1 mgr respawn  2: 'mgr.x'
2024-03-20T18:59:44.324+0000 7ff1adba6700  1 mgr respawn  3: '-f'
2024-03-20T18:59:44.324+0000 7ff1adba6700  1 mgr respawn  4: '--setuser'
2024-03-20T18:59:44.324+0000 7ff1adba6700  1 mgr respawn  5: 'ceph'
2024-03-20T18:59:44.324+0000 7ff1adba6700  1 mgr respawn  6: '--setgroup'
2024-03-20T18:59:44.324+0000 7ff1adba6700  1 mgr respawn  7: 'ceph'
2024-03-20T18:59:44.324+0000 7ff1adba6700  1 mgr respawn  8: '--default-log-to-file=false'
2024-03-20T18:59:44.324+0000 7ff1adba6700  1 mgr respawn  9: '--default-log-to-journald=true'
2024-03-20T18:59:44.324+0000 7ff1adba6700  1 mgr respawn  10: '--default-log-to-stderr=false'
2024-03-20T18:59:44.325+0000 7ff1adba6700  1 mgr respawn respawning with exe /usr/bin/ceph-mgr
2024-03-20T18:59:44.325+0000 7ff1adba6700  1 mgr respawn  exe_path /proc/self/exe

/teuthology/pdonnell-2024-03-20_18:16:52-fs-wip-batrick-testing-20240320.145742-distro-default-smithi/7612921/remote/smithi042/log/6efffee4-e6ea-11ee-95c9-87774f69a715/ceph-mgr.x.log.gz

The mgr modules changed so it rebooted and the client instance got evicted.

I'll work on a fix.

Actions

Copy link

Updated by Patrick Donnelly about 1 month ago

Status changed from In Progress to Fix Under Review
Pull request ID set to 56354

Actions

Copy link

Updated by Greg Farnum about 1 month ago

The mgr modules changed so it rebooted and the client instance got evicted.

o_0

Shouldn’t we do a polite unmount when rebooting? Leaving a hanging client session from the manager seems real bad…
I guess when the monitor fails it over, it does a blocklist entry so the mds cleans up faster? Otherwise there’d be disasters there, too.

Actions

Copy link

Updated by Patrick Donnelly about 1 month ago

Greg Farnum wrote:

The mgr modules changed so it rebooted and the client instance got evicted.

o_0

Shouldn’t we do a polite unmount when rebooting? Leaving a hanging client session from the manager seems real bad…
I guess when the monitor fails it over, it does a blocklist entry so the mds cleans up faster? Otherwise there’d be disasters there, too.

It's not really a big deal and unlikely to happen in production. Again, it only happens when a failover occurs between when the session is established and the beacon with the client addr is sent to the mons. The mgr doesn't do anything with the mount until it has acknowledgement**.

actually only after https://github.com/ceph/ceph/pull/51169 is merged. See:

https://github.com/ceph/ceph/pull/51169/files#diff-50ab66411d9293d402a15e00ed6843a4d37889c616873e69534e609c210f72ec

Actions

Copy link

Updated by Patrick Donnelly about 1 month ago

Status changed from Fix Under Review to Pending Backport

Actions

Copy link

Updated by Backport Bot about 1 month ago

Copied to Backport #65092: reef: qa: fs:workloads mgr client evicted indicated by "cluster [WRN] evicting unresponsive client smithi042:x (15288), after 303.306 seconds" added

Actions

Copy link

Updated by Backport Bot about 1 month ago

Copied to Backport #65093: squid: qa: fs:workloads mgr client evicted indicated by "cluster [WRN] evicting unresponsive client smithi042:x (15288), after 303.306 seconds" added

Actions

Copy link