Project

General

Profile

Actions

Bug #47689

closed

rados/upgrade/nautilus-x-singleton fails due to cluster [WRN] evicting unresponsive client

Added by Neha Ojha over 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
High
Category:
-
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
octopus,nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
rados
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

"2020-09-29T12:20:24.346813+0000 mds.a (mds.0) 1 : cluster [WRN] evicting unresponsive client smithi039:y (4771), after 304.592 seconds" in cluster log

/a/teuthology-2020-09-29_07:01:02-rados-master-distro-basic-smithi/5480802

Discussed this with Patrick and he is debugging this further in https://github.com/ceph/ceph/pull/37419


Related issues 3 (1 open2 closed)

Related to RADOS - Bug #50275: nautilus: evicting unresponsive clientNew

Actions
Copied to CephFS - Backport #48285: octopus: rados/upgrade/nautilus-x-singleton fails due to cluster [WRN] evicting unresponsive clientResolvedNathan CutlerActions
Copied to CephFS - Backport #48286: nautilus: rados/upgrade/nautilus-x-singleton fails due to cluster [WRN] evicting unresponsive clientResolvedPatrick DonnellyActions
Actions #1

Updated by Neha Ojha over 3 years ago

/a/teuthology-2020-09-30_07:01:02-rados-master-distro-basic-smithi/5483508/

Actions #2

Updated by Neha Ojha over 3 years ago

  • Priority changed from Normal to High

/a/teuthology-2020-10-01_07:01:02-rados-master-distro-basic-smithi/5485885

Actions #3

Updated by Patrick Donnelly over 3 years ago

  • Status changed from New to In Progress
  • Assignee set to Patrick Donnelly
  • Target version set to v16.0.0
  • Source set to Q/A

This appears to be a fairly old failure. Here's a few instances:

https://pulpito.ceph.com/teuthology-2020-07-22_07:01:02-rados-master-distro-basic-smithi/5248885/
https://pulpito.ceph.com/nojha-2020-06-09_20:15:18-rados:upgrade:nautilus-x-singleton-master-distro-basic-smithi/

Unfortunately no mgr logs for those.

I built a branch wip-pdonnell-testing-20201001.200945 with https://github.com/ceph/ceph/pull/37419 and got a run here:

https://pulpito.ceph.com/pdonnell-2020-10-01_23:18:55-rados:upgrade:nautilus-x-singleton-wip-pdonnell-testing-20201001.200945-distro-basic-smithi/

It's very reproducible (but not 100%!!). For job

/ceph/teuthology-archive/pdonnell-2020-10-01_23:18:55-rados:upgrade:nautilus-x-singleton-wip-pdonnell-testing-20201001.200945-distro-basic-smithi/5487412

the evicted client is client.4775. The last thing we see from it in the mgr log is:

2020-10-01T23:37:08.331+0000 7f183caa6700 20 client.4775 tick
2020-10-01T23:37:08.331+0000 7f183caa6700 20 client.4775 collect_and_send_metrics
2020-10-01T23:37:08.331+0000 7f183caa6700 20 client.4775 collect_and_send_global_metrics
2020-10-01T23:37:08.331+0000 7f183caa6700  5 client.4775 collect_and_send_global_metrics: rank=0 does not support metrics
2020-10-01T23:37:08.331+0000 7f183caa6700 20 client.4775 trim_cache size 0 max 16384

2 seconds before it's supposed to renew its caps. The the mgr failed over:

2020-10-01T23:37:08.623+0000 7f186cf7d700 10 mgr ms_dispatch2 active mgrmap(e 6) v1
2020-10-01T23:37:08.623+0000 7f186cf7d700  4 mgr handle_mgr_map received map epoch 6
2020-10-01T23:37:08.623+0000 7f186cf7d700  4 mgr handle_mgr_map active in map: 1 active is 4649
2020-10-01T23:37:08.623+0000 7f186cf7d700  1 mgr respawn  e: 'ceph-mgr'
2020-10-01T23:37:08.623+0000 7f186cf7d700  1 mgr respawn  0: 'ceph-mgr'
2020-10-01T23:37:08.623+0000 7f186cf7d700  1 mgr respawn  1: '-f'
2020-10-01T23:37:08.623+0000 7f186cf7d700  1 mgr respawn  2: '--cluster'
2020-10-01T23:37:08.623+0000 7f186cf7d700  1 mgr respawn  3: 'ceph'
2020-10-01T23:37:08.623+0000 7f186cf7d700  1 mgr respawn  4: '-i'
2020-10-01T23:37:08.623+0000 7f186cf7d700  1 mgr respawn  5: 'x'
2020-10-01T23:37:08.623+0000 7f186cf7d700  1 mgr respawn respawning with exe /usr/bin/ceph-mgr
2020-10-01T23:37:08.623+0000 7f186cf7d700  1 mgr respawn  exe_path /proc/self/exe
2020-10-01T23:37:08.655+0000 7f1f245dd200  0 ceph version 16.0.0-6089-gaf0d0003 (af0d0003260d1be1bd25c7a3d58968bfa9e2f522) pacific (dev), process ceph-mgr, pid 17937

That was due to the change in active modules. The mons blocklisted the old instance here:

2020-10-01T23:37:11.670+0000 7fcd30828700  4 mon.b@0(leader).mgr e6 beacon from 74108
2020-10-01T23:37:11.670+0000 7fcd30828700  4 mon.b@0(leader).mgr e6 Active daemon restart (mgr.x)
2020-10-01T23:37:11.670+0000 7fcd30828700  0 log_channel(cluster) log [INF] : Active manager daemon x restarted
2020-10-01T23:37:11.670+0000 7fcd30828700  1 -- v1:172.21.15.49:6789/0 --> v1:172.21.15.49:6789/0 -- log(1 entries from seq 122 at 2020-10-01T23:37:11.674534+0000) v1 -- 0x565056254000 con 0x565054c89400
2020-10-01T23:37:11.670+0000 7fcd30828700  5 mon.b@0(leader).mgr e6 blocklisting previous mgr.x.4649 (v1:172.21.15.137:6800/17937)
2020-10-01T23:37:11.670+0000 7fcd30828700 10 mon.b@0(leader).osd e36 blocklist v1:172.21.15.137:6800/17937 until 2020-10-02T23:37:11.674589+0000

From: /ceph/teuthology-archive/pdonnell-2020-10-01_23:18:55-rados:upgrade:nautilus-x-singleton-wip-pdonnell-testing-20201001.200945-distro-basic-smithi/5487412/remote/smithi049/log/ceph-mon.b.log.gz

The client sessions were never blocklisted (including the CephFS session). This is because the Nautilus mgrmap does not note the client sessions (this is an Octopus/Pacific feature). So I think the only thing to do here is ignore this warning. There's no way to handle it.

I'll write up a fix...

Actions #4

Updated by Patrick Donnelly over 3 years ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 37528
Actions #5

Updated by Patrick Donnelly over 3 years ago

  • Status changed from Fix Under Review to Resolved
Actions #6

Updated by Patrick Donnelly over 3 years ago

  • Status changed from Resolved to Pending Backport
  • Backport set to octopus,nautilus
Actions #8

Updated by Nathan Cutler over 3 years ago

  • Copied to Backport #48285: octopus: rados/upgrade/nautilus-x-singleton fails due to cluster [WRN] evicting unresponsive client added
Actions #9

Updated by Nathan Cutler over 3 years ago

  • Copied to Backport #48286: nautilus: rados/upgrade/nautilus-x-singleton fails due to cluster [WRN] evicting unresponsive client added
Actions #10

Updated by Nathan Cutler about 3 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions #12

Updated by Deepika Upadhyay about 3 years ago

  • Related to Bug #50275: nautilus: evicting unresponsive client added
Actions

Also available in: Atom PDF