Project

General

Profile

Actions

Bug #23928

open

qa: spurious cluster "[WRN] Manager daemon y is unresponsive. No standby daemons available." in cluster log

Added by Patrick Donnelly almost 6 years ago. Updated about 5 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
testing
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

During shutdown we sometimes see this:

2018-04-28T18:27:35.688 INFO:teuthology.misc:Shutting down mgr daemons...
2018-04-28T18:27:35.689 DEBUG:tasks.ceph.mgr.y:waiting for process to exit
2018-04-28T18:27:35.689 INFO:teuthology.orchestra.run:waiting for 300
2018-04-28T18:27:35.690 INFO:tasks.ceph.mgr.y.smithi080.stderr:2018-04-28 18:27:35.690 7f0ab5ffb700 -1 received  signal: Terminated from /usr/bin/python /bin/daemon-helper term ceph-mgr -f --cluster ceph -i y  (PID: 20319) UID: 0
2018-04-28T18:27:35.690 INFO:tasks.ceph.mgr.y.smithi080.stderr:2018-04-28 18:27:35.690 7f0ab5ffb700 -1 mgr handle_signal *** Got signal Terminated ***
2018-04-28T18:27:35.787 INFO:tasks.ceph.mgr.y:Stopped
2018-04-28T18:27:35.788 DEBUG:tasks.ceph.mgr.x:waiting for process to exit
2018-04-28T18:27:35.788 INFO:teuthology.orchestra.run:waiting for 300
2018-04-28T18:27:35.789 INFO:tasks.ceph.mgr.x.smithi047.stderr:2018-04-28 18:27:35.793 7f6151ffb700 -1 received  signal: Terminated from /usr/bin/python /bin/daemon-helper term ceph-mgr -f --cluster ceph -i x  (PID: 20150) UID: 0
2018-04-28T18:27:35.790 INFO:tasks.ceph.mgr.x.smithi047.stderr:2018-04-28 18:27:35.793 7f6151ffb700 -1 mgr handle_signal *** Got signal Terminated ***
2018-04-28T18:27:35.837 INFO:tasks.ceph.mgr.x:Stopped
2018-04-28T18:27:35.838 INFO:teuthology.misc:Shutting down mon daemons...
2018-04-28T18:27:35.838 DEBUG:tasks.ceph.mon.a:waiting for process to exit
2018-04-28T18:27:35.838 INFO:teuthology.orchestra.run:waiting for 300
2018-04-28T18:27:35.874 INFO:tasks.ceph.mon.a.smithi080.stderr:2018-04-28 18:27:35.847 1545a700 -1 received  signal: Terminated from /usr/bin/python /bin/daemon-helper term valgrind --trace-children=no --child-silent-after-fork=yes --num-callers=50 --suppressions=/home/ubuntu/cephtest/valgrind.supp --xml=yes --xml-file=/var/log/ceph/valgrind/mon.a.log --time-stamp=yes --tool=memcheck --leak-check=full --show-reachable=yes ceph-mon -f --cluster ceph -i a  (PID: 20265) UID: 0
2018-04-28T18:27:35.878 INFO:tasks.ceph.mon.a.smithi080.stderr:2018-04-28 18:27:35.850 1545a700 -1 mon.a@1(peon) e1 *** Got Signal Terminated ***
2018-04-28T18:29:13.470 INFO:tasks.ceph.mon.a:Stopped
2018-04-28T18:29:13.471 DEBUG:tasks.ceph.mon.c:waiting for process to exit
2018-04-28T18:29:13.471 INFO:teuthology.orchestra.run:waiting for 300
2018-04-28T18:29:13.508 INFO:tasks.ceph.mon.c.smithi080.stderr:2018-04-28 18:29:13.480 1545a700 -1 received  signal: Terminated from /usr/bin/python /bin/daemon-helper term valgrind --trace-children=no --child-silent-after-fork=yes --num-callers=50 --suppressions=/home/ubuntu/cephtest/valgrind.supp --xml=yes --xml-file=/var/log/ceph/valgrind/mon.c.log --time-stamp=yes --tool=memcheck --leak-check=full --show-reachable=yes ceph-mon -f --cluster ceph -i c  (PID: 20266) UID: 0
2018-04-28T18:29:13.512 INFO:tasks.ceph.mon.c.smithi080.stderr:2018-04-28 18:29:13.483 1545a700 -1 mon.c@2(peon) e1 *** Got Signal Terminated ***
2018-04-28T18:29:19.573 INFO:tasks.ceph.mon.c:Stopped
2018-04-28T18:29:19.574 DEBUG:tasks.ceph.mon.b:waiting for process to exit
2018-04-28T18:29:19.574 INFO:teuthology.orchestra.run:waiting for 300
2018-04-28T18:29:19.598 INFO:tasks.ceph.mon.b.smithi047.stderr:2018-04-28 18:29:19.589 1545a700 -1 received  signal: Terminated from /usr/bin/python /bin/daemon-helper term valgrind --trace-children=no --child-silent-after-fork=yes --num-callers=50 --suppressions=/home/ubuntu/cephtest/valgrind.supp --xml=yes --xml-file=/var/log/ceph/valgrind/mon.b.log --time-stamp=yes --tool=memcheck --leak-check=full --show-reachable=yes ceph-mon -f --cluster ceph -i b  (PID: 20142) UID: 0
2018-04-28T18:29:19.604 INFO:tasks.ceph.mon.b.smithi047.stderr:2018-04-28 18:29:19.592 1545a700 -1 mon.b@0(leader) e1 *** Got Signal Terminated ***
2018-04-28T18:29:25.676 INFO:tasks.ceph.mon.b:Stopped

(Note the odd ~2 minute turnaround for stopping mon.a)

From: http://pulpito.ceph.com/pdonnell-2018-04-28_06:27:06-multimds-wip-pdonnell-testing-20180428.041811-testing-basic-smithi/2450419/

Not a huge deal and we could silence it with a log whitelist. But, is there a better way to ignore this during shutdown?

Actions #1

Updated by John Spray almost 6 years ago

For final shutdown, it might be simpler to just kill the mons first, so that they're no longer there to complain about everyone else disappearing? Otherwise we're always going to have mons (rightly) recording the death of the other services.

Actions #2

Updated by Patrick Donnelly almost 6 years ago

  • Description updated (diff)

format fix in desc

Actions #3

Updated by Patrick Donnelly almost 6 years ago

John Spray wrote:

For final shutdown, it might be simpler to just kill the mons first, so that they're no longer there to complain about everyone else disappearing? Otherwise we're always going to have mons (rightly) recording the death of the other services.

Might also then be possible to OSD_DOWN whitelist in cephfs/overrides/whitelist_health.yaml. I like the idea.

Actions #4

Updated by Sebastian Wagner about 5 years ago

is this still reproducible?

Actions

Also available in: Atom PDF