Bug #47689: rados/upgrade/nautilus-x-singleton fails due to cluster [WRN] evicting unresponsive client - CephFS - Ceph

Actions

Copy link

Bug #47689

closed

rados/upgrade/nautilus-x-singleton fails due to cluster [WRN] evicting unresponsive client

Added by Neha Ojha over 3 years ago. Updated about 3 years ago.

Status:

Resolved

Priority:

High

Assignee:

Patrick Donnelly

Category:

Target version:

Ceph - v16.0.0

% Done:

Source:

Q/A

Tags:

Backport:

octopus,nautilus

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

rados

Component(FS):

Labels (FS):

Pull request ID:

37528

Crash signature (v1):

Crash signature (v2):

Description

"2020-09-29T12:20:24.346813+0000 mds.a (mds.0) 1 : cluster [WRN] evicting unresponsive client smithi039:y (4771), after 304.592 seconds" in cluster log

/a/teuthology-2020-09-29_07:01:02-rados-master-distro-basic-smithi/5480802

Discussed this with Patrick and he is debugging this further in https://github.com/ceph/ceph/pull/37419

Related issues 3 (1 open — 2 closed)

Actions

Copy link

Updated by Neha Ojha over 3 years ago

/a/teuthology-2020-09-30_07:01:02-rados-master-distro-basic-smithi/5483508/

Actions

Copy link

Updated by Neha Ojha over 3 years ago

Priority changed from Normal to High

/a/teuthology-2020-10-01_07:01:02-rados-master-distro-basic-smithi/5485885

Actions

Copy link

Updated by Patrick Donnelly over 3 years ago

Status changed from New to In Progress
Assignee set to Patrick Donnelly
Target version set to v16.0.0
Source set to Q/A

This appears to be a fairly old failure. Here's a few instances:

https://pulpito.ceph.com/teuthology-2020-07-22_07:01:02-rados-master-distro-basic-smithi/5248885/
https://pulpito.ceph.com/nojha-2020-06-09_20:15:18-rados:upgrade:nautilus-x-singleton-master-distro-basic-smithi/

Unfortunately no mgr logs for those.

I built a branch wip-pdonnell-testing-20201001.200945 with https://github.com/ceph/ceph/pull/37419 and got a run here:

https://pulpito.ceph.com/pdonnell-2020-10-01_23:18:55-rados:upgrade:nautilus-x-singleton-wip-pdonnell-testing-20201001.200945-distro-basic-smithi/

It's very reproducible (but not 100%!!). For job

/ceph/teuthology-archive/pdonnell-2020-10-01_23:18:55-rados:upgrade:nautilus-x-singleton-wip-pdonnell-testing-20201001.200945-distro-basic-smithi/5487412

the evicted client is client.4775. The last thing we see from it in the mgr log is:

2020-10-01T23:37:08.331+0000 7f183caa6700 20 client.4775 tick
2020-10-01T23:37:08.331+0000 7f183caa6700 20 client.4775 collect_and_send_metrics
2020-10-01T23:37:08.331+0000 7f183caa6700 20 client.4775 collect_and_send_global_metrics
2020-10-01T23:37:08.331+0000 7f183caa6700  5 client.4775 collect_and_send_global_metrics: rank=0 does not support metrics
2020-10-01T23:37:08.331+0000 7f183caa6700 20 client.4775 trim_cache size 0 max 16384

2 seconds before it's supposed to renew its caps. The the mgr failed over:

2020-10-01T23:37:08.623+0000 7f186cf7d700 10 mgr ms_dispatch2 active mgrmap(e 6) v1
2020-10-01T23:37:08.623+0000 7f186cf7d700  4 mgr handle_mgr_map received map epoch 6
2020-10-01T23:37:08.623+0000 7f186cf7d700  4 mgr handle_mgr_map active in map: 1 active is 4649
2020-10-01T23:37:08.623+0000 7f186cf7d700  1 mgr respawn  e: 'ceph-mgr'
2020-10-01T23:37:08.623+0000 7f186cf7d700  1 mgr respawn  0: 'ceph-mgr'
2020-10-01T23:37:08.623+0000 7f186cf7d700  1 mgr respawn  1: '-f'
2020-10-01T23:37:08.623+0000 7f186cf7d700  1 mgr respawn  2: '--cluster'
2020-10-01T23:37:08.623+0000 7f186cf7d700  1 mgr respawn  3: 'ceph'
2020-10-01T23:37:08.623+0000 7f186cf7d700  1 mgr respawn  4: '-i'
2020-10-01T23:37:08.623+0000 7f186cf7d700  1 mgr respawn  5: 'x'
2020-10-01T23:37:08.623+0000 7f186cf7d700  1 mgr respawn respawning with exe /usr/bin/ceph-mgr
2020-10-01T23:37:08.623+0000 7f186cf7d700  1 mgr respawn  exe_path /proc/self/exe
2020-10-01T23:37:08.655+0000 7f1f245dd200  0 ceph version 16.0.0-6089-gaf0d0003 (af0d0003260d1be1bd25c7a3d58968bfa9e2f522) pacific (dev), process ceph-mgr, pid 17937

That was due to the change in active modules. The mons blocklisted the old instance here:

2020-10-01T23:37:11.670+0000 7fcd30828700  4 mon.b@0(leader).mgr e6 beacon from 74108
2020-10-01T23:37:11.670+0000 7fcd30828700  4 mon.b@0(leader).mgr e6 Active daemon restart (mgr.x)
2020-10-01T23:37:11.670+0000 7fcd30828700  0 log_channel(cluster) log [INF] : Active manager daemon x restarted
2020-10-01T23:37:11.670+0000 7fcd30828700  1 -- v1:172.21.15.49:6789/0 --> v1:172.21.15.49:6789/0 -- log(1 entries from seq 122 at 2020-10-01T23:37:11.674534+0000) v1 -- 0x565056254000 con 0x565054c89400
2020-10-01T23:37:11.670+0000 7fcd30828700  5 mon.b@0(leader).mgr e6 blocklisting previous mgr.x.4649 (v1:172.21.15.137:6800/17937)
2020-10-01T23:37:11.670+0000 7fcd30828700 10 mon.b@0(leader).osd e36 blocklist v1:172.21.15.137:6800/17937 until 2020-10-02T23:37:11.674589+0000

From: /ceph/teuthology-archive/pdonnell-2020-10-01_23:18:55-rados:upgrade:nautilus-x-singleton-wip-pdonnell-testing-20201001.200945-distro-basic-smithi/5487412/remote/smithi049/log/ceph-mon.b.log.gz

The client sessions were never blocklisted (including the CephFS session). This is because the Nautilus mgrmap does not note the client sessions (this is an Octopus/Pacific feature). So I think the only thing to do here is ignore this warning. There's no way to handle it.

I'll write up a fix...

Actions

Copy link

Updated by Patrick Donnelly over 3 years ago

Status changed from In Progress to Fix Under Review
Pull request ID set to 37528

Actions

Copy link

Updated by Patrick Donnelly over 3 years ago

Status changed from Fix Under Review to Resolved

Actions

Copy link

Updated by Patrick Donnelly over 3 years ago

Status changed from Resolved to Pending Backport
Backport set to octopus,nautilus

Actions

Copy link

Updated by Patrick Donnelly over 3 years ago

backport because of these failures: https://pulpito.ceph.com/teuthology-2020-11-17_03:29:52-upgrade:nautilus-p2p-nautilus-distro-basic-smithi/

Actions

Copy link

Updated by Nathan Cutler over 3 years ago

Copied to Backport #48285: octopus: rados/upgrade/nautilus-x-singleton fails due to cluster [WRN] evicting unresponsive client added

Actions

Copy link

Updated by Nathan Cutler over 3 years ago

Copied to Backport #48286: nautilus: rados/upgrade/nautilus-x-singleton fails due to cluster [WRN] evicting unresponsive client added

Actions

Copy link

#10

Updated by Nathan Cutler about 3 years ago

Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Copy link

#11