Bug #23928
openqa: spurious cluster "[WRN] Manager daemon y is unresponsive. No standby daemons available." in cluster log
0%
Description
During shutdown we sometimes see this:
2018-04-28T18:27:35.688 INFO:teuthology.misc:Shutting down mgr daemons... 2018-04-28T18:27:35.689 DEBUG:tasks.ceph.mgr.y:waiting for process to exit 2018-04-28T18:27:35.689 INFO:teuthology.orchestra.run:waiting for 300 2018-04-28T18:27:35.690 INFO:tasks.ceph.mgr.y.smithi080.stderr:2018-04-28 18:27:35.690 7f0ab5ffb700 -1 received signal: Terminated from /usr/bin/python /bin/daemon-helper term ceph-mgr -f --cluster ceph -i y (PID: 20319) UID: 0 2018-04-28T18:27:35.690 INFO:tasks.ceph.mgr.y.smithi080.stderr:2018-04-28 18:27:35.690 7f0ab5ffb700 -1 mgr handle_signal *** Got signal Terminated *** 2018-04-28T18:27:35.787 INFO:tasks.ceph.mgr.y:Stopped 2018-04-28T18:27:35.788 DEBUG:tasks.ceph.mgr.x:waiting for process to exit 2018-04-28T18:27:35.788 INFO:teuthology.orchestra.run:waiting for 300 2018-04-28T18:27:35.789 INFO:tasks.ceph.mgr.x.smithi047.stderr:2018-04-28 18:27:35.793 7f6151ffb700 -1 received signal: Terminated from /usr/bin/python /bin/daemon-helper term ceph-mgr -f --cluster ceph -i x (PID: 20150) UID: 0 2018-04-28T18:27:35.790 INFO:tasks.ceph.mgr.x.smithi047.stderr:2018-04-28 18:27:35.793 7f6151ffb700 -1 mgr handle_signal *** Got signal Terminated *** 2018-04-28T18:27:35.837 INFO:tasks.ceph.mgr.x:Stopped 2018-04-28T18:27:35.838 INFO:teuthology.misc:Shutting down mon daemons... 2018-04-28T18:27:35.838 DEBUG:tasks.ceph.mon.a:waiting for process to exit 2018-04-28T18:27:35.838 INFO:teuthology.orchestra.run:waiting for 300 2018-04-28T18:27:35.874 INFO:tasks.ceph.mon.a.smithi080.stderr:2018-04-28 18:27:35.847 1545a700 -1 received signal: Terminated from /usr/bin/python /bin/daemon-helper term valgrind --trace-children=no --child-silent-after-fork=yes --num-callers=50 --suppressions=/home/ubuntu/cephtest/valgrind.supp --xml=yes --xml-file=/var/log/ceph/valgrind/mon.a.log --time-stamp=yes --tool=memcheck --leak-check=full --show-reachable=yes ceph-mon -f --cluster ceph -i a (PID: 20265) UID: 0 2018-04-28T18:27:35.878 INFO:tasks.ceph.mon.a.smithi080.stderr:2018-04-28 18:27:35.850 1545a700 -1 mon.a@1(peon) e1 *** Got Signal Terminated *** 2018-04-28T18:29:13.470 INFO:tasks.ceph.mon.a:Stopped 2018-04-28T18:29:13.471 DEBUG:tasks.ceph.mon.c:waiting for process to exit 2018-04-28T18:29:13.471 INFO:teuthology.orchestra.run:waiting for 300 2018-04-28T18:29:13.508 INFO:tasks.ceph.mon.c.smithi080.stderr:2018-04-28 18:29:13.480 1545a700 -1 received signal: Terminated from /usr/bin/python /bin/daemon-helper term valgrind --trace-children=no --child-silent-after-fork=yes --num-callers=50 --suppressions=/home/ubuntu/cephtest/valgrind.supp --xml=yes --xml-file=/var/log/ceph/valgrind/mon.c.log --time-stamp=yes --tool=memcheck --leak-check=full --show-reachable=yes ceph-mon -f --cluster ceph -i c (PID: 20266) UID: 0 2018-04-28T18:29:13.512 INFO:tasks.ceph.mon.c.smithi080.stderr:2018-04-28 18:29:13.483 1545a700 -1 mon.c@2(peon) e1 *** Got Signal Terminated *** 2018-04-28T18:29:19.573 INFO:tasks.ceph.mon.c:Stopped 2018-04-28T18:29:19.574 DEBUG:tasks.ceph.mon.b:waiting for process to exit 2018-04-28T18:29:19.574 INFO:teuthology.orchestra.run:waiting for 300 2018-04-28T18:29:19.598 INFO:tasks.ceph.mon.b.smithi047.stderr:2018-04-28 18:29:19.589 1545a700 -1 received signal: Terminated from /usr/bin/python /bin/daemon-helper term valgrind --trace-children=no --child-silent-after-fork=yes --num-callers=50 --suppressions=/home/ubuntu/cephtest/valgrind.supp --xml=yes --xml-file=/var/log/ceph/valgrind/mon.b.log --time-stamp=yes --tool=memcheck --leak-check=full --show-reachable=yes ceph-mon -f --cluster ceph -i b (PID: 20142) UID: 0 2018-04-28T18:29:19.604 INFO:tasks.ceph.mon.b.smithi047.stderr:2018-04-28 18:29:19.592 1545a700 -1 mon.b@0(leader) e1 *** Got Signal Terminated *** 2018-04-28T18:29:25.676 INFO:tasks.ceph.mon.b:Stopped
(Note the odd ~2 minute turnaround for stopping mon.a)
Not a huge deal and we could silence it with a log whitelist. But, is there a better way to ignore this during shutdown?
Updated by John Spray almost 6 years ago
For final shutdown, it might be simpler to just kill the mons first, so that they're no longer there to complain about everyone else disappearing? Otherwise we're always going to have mons (rightly) recording the death of the other services.
Updated by Patrick Donnelly almost 6 years ago
John Spray wrote:
For final shutdown, it might be simpler to just kill the mons first, so that they're no longer there to complain about everyone else disappearing? Otherwise we're always going to have mons (rightly) recording the death of the other services.
Might also then be possible to OSD_DOWN whitelist in cephfs/overrides/whitelist_health.yaml. I like the idea.