Bug #17944
closedmisc.py: stop_daemons_of_type did not stop them all (sometimes osds, sometimes mons)
0%
Description
2016-11-17T18:25:07.367 INFO:teuthology.misc:Shutting down osd daemons... 2016-11-17T18:25:07.367 DEBUG:tasks.ceph.osd.1:waiting for process to exit 2016-11-17T18:25:07.368 INFO:teuthology.orchestra.run:waiting for 300 2016-11-17T18:25:07.405 INFO:tasks.ceph.osd.1.smithi002.stderr:2016-11-17 18:25:07.374718 304f7700 -1 received signal: Terminated from PID: 23671 task name: /usr/bin/python /bin/daemon-helper term valgrind --trace-children=no --child-silent-after-fork=yes --num-callers=50 --suppressions=/home/ubuntu/cephtest/valgrind.supp --xml=yes --xml-file=/var/log/ceph/valgrind/osd.1.log --time-stamp=yes --tool=memcheck ceph-osd -f --cluster ceph -i 1 UID: 0 2016-11-17T18:25:07.405 INFO:tasks.ceph.osd.1.smithi002.stderr:2016-11-17 18:25:07.379027 304f7700 -1 osd.1 15 *** Got signal Terminated *** 2016-11-17T18:25:08.203 INFO:tasks.ceph.osd.1.smithi002.stderr:2016-11-17 18:25:08.190705 304f7700 -1 osd.1 15 shutdown 2016-11-17T18:25:13.368 INFO:tasks.ceph.osd.1:Stopped 2016-11-17T18:25:13.369 DEBUG:tasks.ceph.osd.0:waiting for process to exit 2016-11-17T18:25:13.369 INFO:teuthology.orchestra.run:waiting for 300 2016-11-17T18:25:13.409 INFO:tasks.ceph.osd.0.smithi002.stderr:2016-11-17 18:25:13.374977 30cf8700 -1 received signal: Terminated from PID: 23660 task name: /usr/bin/python /bin/daemon-helper term valgrind --trace-children=no --child-silent-after-fork=yes --num-callers=50 --suppressions=/home/ubuntu/cephtest/valgrind.supp --xml=yes --xml-file=/var/log/ceph/valgrind/osd.0.log --time-stamp=yes --tool=memcheck ceph-osd -f --cluster ceph -i 0 UID: 0 2016-11-17T18:25:13.409 INFO:tasks.ceph.osd.0.smithi002.stderr:2016-11-17 18:25:13.380205 30cf8700 -1 osd.0 18 *** Got signal Terminated *** 2016-11-17T18:25:14.206 INFO:tasks.ceph.osd.0.smithi002.stderr:2016-11-17 18:25:14.057377 30cf8700 -1 osd.0 18 shutdown 2016-11-17T18:25:19.369 INFO:tasks.ceph.osd.0:Stopped 2016-11-17T18:25:19.370 DEBUG:tasks.ceph.osd.3:waiting for process to exit 2016-11-17T18:25:19.370 INFO:teuthology.orchestra.run:waiting for 300 2016-11-17T18:25:19.398 INFO:tasks.ceph.osd.3.smithi031.stderr:2016-11-17 18:25:19.378217 3100a700 -1 received signal: Terminated from PID: 14913 task name: /usr/bin/python /usr/bin/daemon-helper term valgrind --trace-children=no --child-silent-after-fork=yes --num-callers=50 --suppressions=/home/ubuntu/cephtest/valgrind.supp --xml=yes --xml-file=/var/log/ceph/valgrind/osd.3.log --time-stamp=yes --tool=memcheck ceph-osd -f --cluster ceph -i 3 UID: 0 2016-11-17T18:25:19.398 INFO:tasks.ceph.osd.3.smithi031.stderr:2016-11-17 18:25:19.383355 3100a700 -1 osd.3 21 *** Got signal Terminated *** 2016-11-17T18:25:20.631 INFO:tasks.ceph.osd.3.smithi031.stderr:2016-11-17 18:25:20.581761 3100a700 -1 osd.3 21 shutdown 2016-11-17T18:30:13.404 INFO:teuthology.misc:Shutting down mgr daemons...
on a job with osd's 0 through 5.
The subsequent unmount then fails because a daemon is still running.
/a/sage-2016-11-17_17:11:43-rados-wip-sage-testing---basic-smithi/556861
not sure if something is corrupting the ctx.daemons list? or iter_daemons_of_type is buggy?
Updated by Sage Weil over 7 years ago
http://tracker.ceph.com/issues/9308 is probably a dup
Updated by Sage Weil over 6 years ago
- Subject changed from misc.py: stop_daemons_of_type did not stop them all to misc.py: stop_daemons_of_type did not stop them all (sometimes osds, sometimes mons)
- Status changed from New to 12
Another case with mons:
2017-09-08T02:11:52.451 INFO:teuthology.misc:Shutting down mon daemons... 2017-09-08T02:11:52.452 DEBUG:tasks.ceph.mon.a:waiting for process to exit 2017-09-08T02:11:52.452 INFO:teuthology.orchestra.run:waiting for 300 2017-09-08T02:11:52.492 INFO:tasks.ceph.mon.a.smithi021.stderr:2017-09-08 02:11:52.464976 1997d700 -1 received signal: Terminated from PID: 50180 task name: /usr/bin/python /bin/daemon-helper term valgrind --trace-children=no --child-silent-after-fork=yes --num-callers=50 --suppressions=/home/ubuntu/cephtest/valgrind.supp --xml=yes --xml-file=/var/log/ceph/valgrind/mon.a.log --time-stamp=yes --tool=memcheck --leak-check=full --show-reachable=yes ceph-mon -f --cluster ceph -i a UID: 0 2017-09-08T02:11:52.493 INFO:tasks.ceph.mon.a.smithi021.stderr:2017-09-08 02:11:52.468160 1997d700 -1 mon.a@0(leader) e1 *** Got Signal Terminated *** 2017-09-08T02:16:51.550 INFO:tasks.ceph:Checking cluster log for badness...
note it only stops mon.a, not b or c..
but just after that,
2017-09-08T02:16:53.177 INFO:tasks.ceph:Archiving mon data... 2017-09-08T02:16:53.180 DEBUG:teuthology.misc:Transferring archived files from smithi021:/var/lib/ceph/mon/ceph-a to /home/teuthworker/archive/sage-2017-09-08_01:07:55-rados-wip-sage-testing2-2017-09-07-1650-distro-basic-smithi/1607991/data/mon.a.tgz 2017-09-08T02:16:53.181 INFO:teuthology.orchestra.run.smithi021:Running: "python -c 'import os; import tempfile; import sys;(fd,fname) = tempfile.mkstemp();os.close(fd);sys.stdout.write(fname.rstrip());sys.stdout.flush()'" 2017-09-08T02:16:53.280 INFO:teuthology.orchestra.run.smithi021.stdout:/tmp/tmp3WPFMy 2017-09-08T02:16:53.281 INFO:teuthology.orchestra.run.smithi021:Running: 'sudo tar cz -f /tmp/tmp3WPFMy -C /var/lib/ceph/mon/ceph-a -- .' 2017-09-08T02:16:53.580 INFO:teuthology.orchestra.run.smithi021:Running: 'sudo chmod 0666 /tmp/tmp3WPFMy' 2017-09-08T02:16:53.727 DEBUG:teuthology.orchestra.remote:smithi021:/tmp/tmp3WPFMy is 671KB 2017-09-08T02:16:53.847 INFO:teuthology.orchestra.run.smithi021:Running: 'rm -fr /tmp/tmp3WPFMy' 2017-09-08T02:16:53.921 DEBUG:teuthology.misc:Transferring archived files from smithi021:/var/lib/ceph/mon/ceph-c to /home/teuthworker/archive/sage-2017-09-08_01:07:55-rados-wip-sage-testing2-2017-09-07-1650-distro-basic-smithi/1607991/data/mon.c.tgz 2017-09-08T02:16:53.922 INFO:teuthology.orchestra.run.smithi021:Running: "python -c 'import os; import tempfile; import sys;(fd,fname) = tempfile.mkstemp();os.close(fd);sys.stdout.write(fname.rstrip());sys.stdout.flush()'" 2017-09-08T02:16:54.072 INFO:teuthology.orchestra.run.smithi021.stdout:/tmp/tmpuk1IVc 2017-09-08T02:16:54.072 INFO:teuthology.orchestra.run.smithi021:Running: 'sudo tar cz -f /tmp/tmpuk1IVc -C /var/lib/ceph/mon/ceph-c -- .' 2017-09-08T02:16:54.324 INFO:teuthology.orchestra.run.smithi021:Running: 'sudo chmod 0666 /tmp/tmpuk1IVc' 2017-09-08T02:16:54.468 DEBUG:teuthology.orchestra.remote:smithi021:/tmp/tmpuk1IVc is 676KB 2017-09-08T02:16:54.611 INFO:teuthology.orchestra.run.smithi021:Running: 'rm -fr /tmp/tmpuk1IVc' 2017-09-08T02:16:54.670 DEBUG:teuthology.misc:Transferring archived files from smithi133:/var/lib/ceph/mon/ceph-b to /home/teuthworker/archive/sage-2017-09-08_01:07:55-rados-wip-sage-testing2-2017-09-07-1650-distro-basic-smithi/1607991/data/mon.b.tgz 2017-09-08T02:16:54.671 INFO:teuthology.orchestra.run.smithi133:Running: "python -c 'import os; import tempfile; import sys;(fd,fname) = tempfile.mkstemp();os.close(fd);sys.stdout.write(fname.rstrip());sys.stdout.flush()'" 2017-09-08T02:16:54.767 INFO:teuthology.orchestra.run.smithi133.stdout:/tmp/tmp7p8WW3 2017-09-08T02:16:54.767 INFO:teuthology.orchestra.run.smithi133:Running: 'sudo tar cz -f /tmp/tmp7p8WW3 -C /var/lib/ceph/mon/ceph-b -- .' 2017-09-08T02:16:55.083 INFO:teuthology.orchestra.run.smithi133:Running: 'sudo chmod 0666 /tmp/tmp7p8WW3' 2017-09-08T02:16:55.211 DEBUG:teuthology.orchestra.remote:smithi133:/tmp/tmp7p8WW3 is 677KB 2017-09-08T02:16:55.353 INFO:teuthology.orchestra.run.smithi133:Running: 'rm -fr /tmp/tmp7p8WW3' 2017-09-08T02:16:55.420 INFO:tasks.ceph:Cleaning ceph cluster...
...it archives all 3 mons' data dirs.
/a/sage-2017-09-08_01:07:55-rados-wip-sage-testing2-2017-09-07-1650-distro-basic-smithi/1607991
Updated by Sage Weil about 5 years ago
Saw this again,
2019-04-16T22:06:09.543 INFO:tasks.mon_recovery:adding mon b back in 2019-04-16T22:06:09.543 INFO:tasks.ceph.mon.b:Restarting daemon 2019-04-16T22:06:09.544 INFO:teuthology.orchestra.run.smithi075:> cd /home/ubuntu/cephtest && sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper term valgrind --trace-children=no --child-silent-after-fork=yes --num-callers=50 --suppressions=/home/ubuntu/cephtest/valgrind.supp --x ml=yes --xml-file=/var/log/ceph/valgrind/mon.b.log --time-stamp=yes --tool=memcheck --leak-check=full --show-reachable=yes ceph-mon -f --cluster ceph -i b 2019-04-16T22:06:09.548 INFO:tasks.ceph.mon.b:Started ... 2019-04-16T22:06:31.687 INFO:tasks.ceph.mon.c:Stopped ... 2019-04-16T22:07:42.920 DEBUG:teuthology.run_tasks:Unwinding manager ceph ... 2019-04-16T22:09:20.682 DEBUG:tasks.ceph.mon.a:waiting for process to exit ... 2019-04-16T22:14:21.283 INFO:tasks.ceph:Archiving mon data...
mon.b and mon.c are never stopped!
then,
2019-04-16T22:15:00.947 INFO:tasks.ceph.mon.b.smithi075.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.0-304-g6743d9d/rpm/el7/BUILD/ceph-14.2.0-304-g6743d9d/src/mon/HealthMonitor.cc: In function 'bool 2019-04-16T22:15:00.948 INFO:tasks.ceph.mon.b.smithi075.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.0-304-g6743d9d/rpm/el7/BUILD/ceph-14.2.0-304-g6743d9d/src/mon/HealthMonitor.cc: 208: FAILED ceph_ 2019-04-16T22:15:00.971 INFO:tasks.ceph.mon.b.smithi075.stderr: ceph version 14.2.0-304-g6743d9d (6743d9d911728bba4484086348c1a236f7b3d7a8) nautilus (stable) 2019-04-16T22:15:00.971 INFO:tasks.ceph.mon.b.smithi075.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a) [0x5085d76] 2019-04-16T22:15:00.971 INFO:tasks.ceph.mon.b.smithi075.stderr: 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x5085f44] 2019-04-16T22:15:00.971 INFO:tasks.ceph.mon.b.smithi075.stderr: 3: (HealthMonitor::check_member_health()+0x1148) [0x459078] 2019-04-16T22:15:00.972 INFO:tasks.ceph.mon.b.smithi075.stderr: 4: (HealthMonitor::tick()+0x52) [0x45a7d2] 2019-04-16T22:15:00.972 INFO:tasks.ceph.mon.b.smithi075.stderr: 5: (Monitor::tick()+0xa0) [0x3c2570] 2019-04-16T22:15:00.972 INFO:tasks.ceph.mon.b.smithi075.stderr: 6: (C_MonContext::finish(int)+0x39) [0x39c5c9] 2019-04-16T22:15:00.972 INFO:tasks.ceph.mon.b.smithi075.stderr: 7: (Context::complete(int)+0x9) [0x3db4a9] 2019-04-16T22:15:00.972 INFO:tasks.ceph.mon.b.smithi075.stderr: 8: (SafeTimer::timer_thread()+0x18d) [0x5119ead] 2019-04-16T22:15:00.972 INFO:tasks.ceph.mon.b.smithi075.stderr: 9: (SafeTimerThread::entry()+0xd) [0x511b6dd] 2019-04-16T22:15:00.972 INFO:tasks.ceph.mon.b.smithi075.stderr: 10: (()+0x7e25) [0x107a6e25] 2019-04-16T22:15:00.973 INFO:tasks.ceph.mon.b.smithi075.stderr: 11: (clone()+0x6d) [0x11704bad] 2019-04-16T22:15:00.976 INFO:tasks.ceph.mon.b.smithi075.stderr:*** Caught signal (Aborted) **
probably because their data directory is deleted out from underneath them?
Updated by Neha Ojha almost 5 years ago
/a/yuriw-2019-06-12_00:27:20-rados-wip-yuri-testing-2019-06-11-2200-nautilus-distro-basic-smithi/4023297/
Updated by Sage Weil almost 5 years ago
/a/sage-2019-06-21_06:56:55-rados-wip-sage-testing-2019-06-20-2155-distro-basic-smithi/4056168
Updated by Sage Weil almost 5 years ago
another symptom is
2019-07-11T20:45:40.897 INFO:teuthology.orchestra.run.smithi204:> sudo find /var/log/ceph -name '*.log' -print0 | sudo xargs -0 --no-run-if-empty -- gzip -- 2019-07-11T20:45:58.617 INFO:teuthology.orchestra.run.smithi204.stderr:gzip: /var/log/ceph/ceph-mon.b.log: file size changed while zipping 2019-07-11T20:45:59.048 INFO:teuthology.orchestra.run.smithi165.stderr:gzip: /var/log/ceph/ceph-mon.c.log: file size changed while zipping
/a/sage-2019-07-11_17:46:52-rados-wip-sage-testing-2019-07-11-1048-distro-basic-smithi/4110887
Updated by Sage Weil over 4 years ago
/a/sage-2019-09-28_14:19:28-rados-master-distro-basic-smithi/4342720
2019-09-28T15:31:30.912 INFO:teuthology.misc:Shutting down mon daemons... 2019-09-28T15:31:30.912 DEBUG:tasks.ceph.mon.a:waiting for process to exit 2019-09-28T15:31:30.912 INFO:teuthology.orchestra.run:waiting for 300 2019-09-28T15:31:31.052 INFO:tasks.ceph.mon.a.smithi074.stderr:2019-09-28T15:31:30.991+0000 16082700 -1 received signal: Terminated from /usr/bin/python /bin/daemon-helper term valgrind --trace-children=no --child-silent-after-fork=yes --num-callers=50 --suppressions=/home/ubuntu/cephtest/valgrind.supp --xml=yes --xml-file=/var/log/ceph/valgrind/mon.a.log --time-stamp=yes --tool=memcheck --leak-check=full --show-reachable=yes ceph-mon -f --cluster ceph -i a (PID: 12266) UID: 0 2019-09-28T15:31:31.054 INFO:tasks.ceph.mon.a.smithi074.stderr:2019-09-28T15:31:31.010+0000 16082700 -1 mon.a@0(leader) e1 *** Got Signal Terminated *** 2019-09-28T15:36:29.997 INFO:tasks.ceph:Checking cluster log for badness...
note the ~5 minute time difference in the last two lines
Updated by Sage Weil over 4 years ago
2019-09-28T22:04:09.181 INFO:teuthology.misc:Shutting down mon daemons... ([<teuthology.orchestra.daemon.state.DaemonState object at 0x7fe1e18e7d50>, <teuthology.orchestra.daemon.state.DaemonState object at 0x7fe1e18eb110>, <teuthology.orchestra.daemon.state.DaemonState object at 0x7fe1e18e7750>]) 2019-09-28T22:04:09.181 INFO:teuthology.misc:Daemon <teuthology.orchestra.daemon.state.DaemonState object at 0x7fe1e18e7d50> 2019-09-28T22:04:09.181 INFO:teuthology.misc:Stopping ceph.mon.a... 2019-09-28T22:04:09.182 DEBUG:tasks.ceph.mon.a:waiting for process to exit 2019-09-28T22:04:09.182 INFO:teuthology.orchestra.run:waiting for 300 2019-09-28T22:04:09.419 INFO:tasks.ceph.mon.a.smithi002.stderr:2019-09-28T22:04:09.365+0000 16082700 -1 received signal: Terminated from /usr/bin/python /bin/daemon-helper term valgrind --trace-children=no --child-silent-after-fork=yes --num-callers=50 --suppressions=/home/ubuntu/cephtest/valgrind.supp --xml=yes --xml-file=/var/log/ceph/valgrind/mon.a.log --time-stamp=yes --tool=memcheck --leak-check=full --show-reachable=yes ceph-mon -f --cluster ceph -i a (PID: 62151) UID: 0 2019-09-28T22:04:09.461 INFO:tasks.ceph.mon.a.smithi002.stderr:2019-09-28T22:04:09.434+0000 16082700 -1 mon.a@0(leader) e1 *** Got Signal Terminated *** 2019-09-28T22:09:08.261 INFO:tasks.ceph:Checking cluster log for badness...
with this patch
commit bf01518c967095ccb99d60c80e7da5b9d7487429 (HEAD -> refs/heads/bug-17944, refs/remotes/gh/bug-17944) Author: Sage Weil <sage@redhat.com> Date: Sat Sep 28 11:10:10 2019 -0500 misc: more verbose while shutting down daemons Hunting https://tracker.ceph.com/issues/17944 Signed-off-by: Sage Weil <sage@redhat.com> diff --git a/teuthology/misc.py b/teuthology/misc.py index 52c304c..54c692f 100644 --- a/teuthology/misc.py +++ b/teuthology/misc.py @@ -1240,16 +1240,20 @@ def stop_daemons_of_type(ctx, type_, cluster='ceph'): """ :param type_: type of daemons to be stopped. """ - log.info('Shutting down %s daemons...' % type_) exc_info = (None, None, None) - for daemon in ctx.daemons.iter_daemons_of_role(type_, cluster): + daemons = ctx.daemons.iter_daemons_of_role(type_, cluster) + log.info('Shutting down %s daemons... (%s)' % (type_, daemons)) + for daemon in daemons: + log.info('Daemon %s' % daemon) try: + log.info('Stopping %s.%s...' % (daemon.role, daemon.id_)) daemon.stop() except (CommandFailedError, CommandCrashedError, ConnectionLostError): exc_info = sys.exc_info() log.exception('Saw exception from %s.%s', daemon.role, daemon.id_) + log.info('Done shutting down %s daemons (%s)' % (type_, daemons)) if exc_info != (None, None, None): raise exc_info[0], exc_info[1], exc_info[2]
it looks like the daemon.stop() commadn is yielding or something? But AFAICS we're already in the finally: portion of the work, see ceph.py run_daemon():
try: yield finally: teuthology.stop_daemons_of_type(ctx, type_, cluster_name)
someone who has more of a clue about how @contextlib.contextmanager works needs to look at this
Updated by Sage Weil over 4 years ago
fyi, this reproduce 4 times out of 40 here: http://pulpito.ceph.com/sage-2019-09-28_20:08:27-rados:verify-master-distro-basic-smithi/
with this:
teuthology-suite -s rados/verify -c wip-read-hole-bypg -k distro -m smithi -p 40 --filter valgrind --subset 1/99 --teuthology-branch bug-17944 -N 10
Updated by Sage Weil over 4 years ago
/a/sage-2019-10-25_05:04:47-rados-wip-sage2-testing-2019-10-24-1944-distro-basic-smithi/4442713
Updated by Neha Ojha over 4 years ago
/a/yuriw-2019-11-08_00:57:06-rados-wip-yuri5-testing-2019-11-07-2203-nautilus-distro-basic-smithi/4481148/
Updated by Josh Durgin over 2 years ago
- Has duplicate Bug #43292: teuthology does not stop all daemons added
Updated by Josh Durgin over 2 years ago
- Status changed from New to Can't reproduce