Bug #17944: misc.py: stop_daemons_of_type did not stop them all (sometimes osds, sometimes mons) - teuthology - Ceph

Actions

Copy link

Bug #17944

closed

misc.py: stop_daemons_of_type did not stop them all (sometimes osds, sometimes mons)

Added by Sage Weil over 7 years ago. Updated over 2 years ago.

Status:

Can't reproduce

Priority:

Urgent

Assignee:

Category:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Crash signature (v1):

Crash signature (v2):

Description

2016-11-17T18:25:07.367 INFO:teuthology.misc:Shutting down osd daemons...
2016-11-17T18:25:07.367 DEBUG:tasks.ceph.osd.1:waiting for process to exit
2016-11-17T18:25:07.368 INFO:teuthology.orchestra.run:waiting for 300
2016-11-17T18:25:07.405 INFO:tasks.ceph.osd.1.smithi002.stderr:2016-11-17 18:25:07.374718 304f7700 -1 received  signal: Terminated from  PID: 23671 task name: /usr/bin/python /bin/daemon-helper term valgrind --trace-children=no --child-silent-after-fork=yes --num-callers=50 --suppressions=/home/ubuntu/cephtest/valgrind.supp --xml=yes --xml-file=/var/log/ceph/valgrind/osd.1.log --time-stamp=yes --tool=memcheck ceph-osd -f --cluster ceph -i 1  UID: 0
2016-11-17T18:25:07.405 INFO:tasks.ceph.osd.1.smithi002.stderr:2016-11-17 18:25:07.379027 304f7700 -1 osd.1 15 *** Got signal Terminated ***
2016-11-17T18:25:08.203 INFO:tasks.ceph.osd.1.smithi002.stderr:2016-11-17 18:25:08.190705 304f7700 -1 osd.1 15 shutdown
2016-11-17T18:25:13.368 INFO:tasks.ceph.osd.1:Stopped
2016-11-17T18:25:13.369 DEBUG:tasks.ceph.osd.0:waiting for process to exit
2016-11-17T18:25:13.369 INFO:teuthology.orchestra.run:waiting for 300
2016-11-17T18:25:13.409 INFO:tasks.ceph.osd.0.smithi002.stderr:2016-11-17 18:25:13.374977 30cf8700 -1 received  signal: Terminated from  PID: 23660 task name: /usr/bin/python /bin/daemon-helper term valgrind --trace-children=no --child-silent-after-fork=yes --num-callers=50 --suppressions=/home/ubuntu/cephtest/valgrind.supp --xml=yes --xml-file=/var/log/ceph/valgrind/osd.0.log --time-stamp=yes --tool=memcheck ceph-osd -f --cluster ceph -i 0  UID: 0
2016-11-17T18:25:13.409 INFO:tasks.ceph.osd.0.smithi002.stderr:2016-11-17 18:25:13.380205 30cf8700 -1 osd.0 18 *** Got signal Terminated ***
2016-11-17T18:25:14.206 INFO:tasks.ceph.osd.0.smithi002.stderr:2016-11-17 18:25:14.057377 30cf8700 -1 osd.0 18 shutdown
2016-11-17T18:25:19.369 INFO:tasks.ceph.osd.0:Stopped
2016-11-17T18:25:19.370 DEBUG:tasks.ceph.osd.3:waiting for process to exit
2016-11-17T18:25:19.370 INFO:teuthology.orchestra.run:waiting for 300
2016-11-17T18:25:19.398 INFO:tasks.ceph.osd.3.smithi031.stderr:2016-11-17 18:25:19.378217 3100a700 -1 received  signal: Terminated from  PID: 14913 task name: /usr/bin/python /usr/bin/daemon-helper term valgrind --trace-children=no --child-silent-after-fork=yes --num-callers=50 --suppressions=/home/ubuntu/cephtest/valgrind.supp --xml=yes --xml-file=/var/log/ceph/valgrind/osd.3.log --time-stamp=yes --tool=memcheck ceph-osd -f --cluster ceph -i 3  UID: 0
2016-11-17T18:25:19.398 INFO:tasks.ceph.osd.3.smithi031.stderr:2016-11-17 18:25:19.383355 3100a700 -1 osd.3 21 *** Got signal Terminated ***
2016-11-17T18:25:20.631 INFO:tasks.ceph.osd.3.smithi031.stderr:2016-11-17 18:25:20.581761 3100a700 -1 osd.3 21 shutdown
2016-11-17T18:30:13.404 INFO:teuthology.misc:Shutting down mgr daemons...

on a job with osd's 0 through 5.

The subsequent unmount then fails because a daemon is still running.

/a/sage-2016-11-17_17:11:43-rados-wip-sage-testing---basic-smithi/556861

not sure if something is corrupting the ctx.daemons list? or iter_daemons_of_type is buggy?

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Sage Weil over 7 years ago

http://tracker.ceph.com/issues/9308 is probably a dup

Actions

Copy link

Updated by Sage Weil over 6 years ago

Subject changed from misc.py: stop_daemons_of_type did not stop them all to misc.py: stop_daemons_of_type did not stop them all (sometimes osds, sometimes mons)
Status changed from New to 12

Another case with mons:

2017-09-08T02:11:52.451 INFO:teuthology.misc:Shutting down mon daemons...
2017-09-08T02:11:52.452 DEBUG:tasks.ceph.mon.a:waiting for process to exit
2017-09-08T02:11:52.452 INFO:teuthology.orchestra.run:waiting for 300
2017-09-08T02:11:52.492 INFO:tasks.ceph.mon.a.smithi021.stderr:2017-09-08 02:11:52.464976 1997d700 -1 received  signal: Terminated from  PID: 50180 task name: /usr/bin/python /bin/daemon-helper term valgrind --trace-children=no --child-silent-after-fork=yes --num-callers=50 --suppressions=/home/ubuntu/cephtest/valgrind.supp --xml=yes --xml-file=/var/log/ceph/valgrind/mon.a.log --time-stamp=yes --tool=memcheck --leak-check=full --show-reachable=yes ceph-mon -f --cluster ceph -i a  UID: 0
2017-09-08T02:11:52.493 INFO:tasks.ceph.mon.a.smithi021.stderr:2017-09-08 02:11:52.468160 1997d700 -1 mon.a@0(leader) e1 *** Got Signal Terminated ***
2017-09-08T02:16:51.550 INFO:tasks.ceph:Checking cluster log for badness...

note it only stops mon.a, not b or c..

but just after that,

2017-09-08T02:16:53.177 INFO:tasks.ceph:Archiving mon data...
2017-09-08T02:16:53.180 DEBUG:teuthology.misc:Transferring archived files from smithi021:/var/lib/ceph/mon/ceph-a to /home/teuthworker/archive/sage-2017-09-08_01:07:55-rados-wip-sage-testing2-2017-09-07-1650-distro-basic-smithi/1607991/data/mon.a.tgz
2017-09-08T02:16:53.181 INFO:teuthology.orchestra.run.smithi021:Running: "python -c 'import os; import tempfile; import sys;(fd,fname) = tempfile.mkstemp();os.close(fd);sys.stdout.write(fname.rstrip());sys.stdout.flush()'" 
2017-09-08T02:16:53.280 INFO:teuthology.orchestra.run.smithi021.stdout:/tmp/tmp3WPFMy
2017-09-08T02:16:53.281 INFO:teuthology.orchestra.run.smithi021:Running: 'sudo tar cz -f /tmp/tmp3WPFMy -C /var/lib/ceph/mon/ceph-a -- .'
2017-09-08T02:16:53.580 INFO:teuthology.orchestra.run.smithi021:Running: 'sudo chmod 0666 /tmp/tmp3WPFMy'
2017-09-08T02:16:53.727 DEBUG:teuthology.orchestra.remote:smithi021:/tmp/tmp3WPFMy is 671KB
2017-09-08T02:16:53.847 INFO:teuthology.orchestra.run.smithi021:Running: 'rm -fr /tmp/tmp3WPFMy'
2017-09-08T02:16:53.921 DEBUG:teuthology.misc:Transferring archived files from smithi021:/var/lib/ceph/mon/ceph-c to /home/teuthworker/archive/sage-2017-09-08_01:07:55-rados-wip-sage-testing2-2017-09-07-1650-distro-basic-smithi/1607991/data/mon.c.tgz
2017-09-08T02:16:53.922 INFO:teuthology.orchestra.run.smithi021:Running: "python -c 'import os; import tempfile; import sys;(fd,fname) = tempfile.mkstemp();os.close(fd);sys.stdout.write(fname.rstrip());sys.stdout.flush()'" 
2017-09-08T02:16:54.072 INFO:teuthology.orchestra.run.smithi021.stdout:/tmp/tmpuk1IVc
2017-09-08T02:16:54.072 INFO:teuthology.orchestra.run.smithi021:Running: 'sudo tar cz -f /tmp/tmpuk1IVc -C /var/lib/ceph/mon/ceph-c -- .'
2017-09-08T02:16:54.324 INFO:teuthology.orchestra.run.smithi021:Running: 'sudo chmod 0666 /tmp/tmpuk1IVc'
2017-09-08T02:16:54.468 DEBUG:teuthology.orchestra.remote:smithi021:/tmp/tmpuk1IVc is 676KB
2017-09-08T02:16:54.611 INFO:teuthology.orchestra.run.smithi021:Running: 'rm -fr /tmp/tmpuk1IVc'
2017-09-08T02:16:54.670 DEBUG:teuthology.misc:Transferring archived files from smithi133:/var/lib/ceph/mon/ceph-b to /home/teuthworker/archive/sage-2017-09-08_01:07:55-rados-wip-sage-testing2-2017-09-07-1650-distro-basic-smithi/1607991/data/mon.b.tgz
2017-09-08T02:16:54.671 INFO:teuthology.orchestra.run.smithi133:Running: "python -c 'import os; import tempfile; import sys;(fd,fname) = tempfile.mkstemp();os.close(fd);sys.stdout.write(fname.rstrip());sys.stdout.flush()'" 
2017-09-08T02:16:54.767 INFO:teuthology.orchestra.run.smithi133.stdout:/tmp/tmp7p8WW3
2017-09-08T02:16:54.767 INFO:teuthology.orchestra.run.smithi133:Running: 'sudo tar cz -f /tmp/tmp7p8WW3 -C /var/lib/ceph/mon/ceph-b -- .'
2017-09-08T02:16:55.083 INFO:teuthology.orchestra.run.smithi133:Running: 'sudo chmod 0666 /tmp/tmp7p8WW3'
2017-09-08T02:16:55.211 DEBUG:teuthology.orchestra.remote:smithi133:/tmp/tmp7p8WW3 is 677KB
2017-09-08T02:16:55.353 INFO:teuthology.orchestra.run.smithi133:Running: 'rm -fr /tmp/tmp7p8WW3'
2017-09-08T02:16:55.420 INFO:tasks.ceph:Cleaning ceph cluster...

...it archives all 3 mons' data dirs.

/a/sage-2017-09-08_01:07:55-rados-wip-sage-testing2-2017-09-07-1650-distro-basic-smithi/1607991

Actions

Copy link

Updated by Sage Weil about 5 years ago

Saw this again,

2019-04-16T22:06:09.543 INFO:tasks.mon_recovery:adding mon b back in
2019-04-16T22:06:09.543 INFO:tasks.ceph.mon.b:Restarting daemon
2019-04-16T22:06:09.544 INFO:teuthology.orchestra.run.smithi075:> cd /home/ubuntu/cephtest && sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper term valgrind --trace-children=no --child-silent-after-fork=yes --num-callers=50 --suppressions=/home/ubuntu/cephtest/valgrind.supp --x
ml=yes --xml-file=/var/log/ceph/valgrind/mon.b.log --time-stamp=yes --tool=memcheck --leak-check=full --show-reachable=yes ceph-mon -f --cluster ceph -i b
2019-04-16T22:06:09.548 INFO:tasks.ceph.mon.b:Started
...
2019-04-16T22:06:31.687 INFO:tasks.ceph.mon.c:Stopped
...
2019-04-16T22:07:42.920 DEBUG:teuthology.run_tasks:Unwinding manager ceph
...
2019-04-16T22:09:20.682 DEBUG:tasks.ceph.mon.a:waiting for process to exit
...
2019-04-16T22:14:21.283 INFO:tasks.ceph:Archiving mon data...

mon.b and mon.c are never stopped!

then,

2019-04-16T22:15:00.947 INFO:tasks.ceph.mon.b.smithi075.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.0-304-g6743d9d/rpm/el7/BUILD/ceph-14.2.0-304-g6743d9d/src/mon/HealthMonitor.cc: In function 'bool
2019-04-16T22:15:00.948 INFO:tasks.ceph.mon.b.smithi075.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.0-304-g6743d9d/rpm/el7/BUILD/ceph-14.2.0-304-g6743d9d/src/mon/HealthMonitor.cc: 208: FAILED ceph_
2019-04-16T22:15:00.971 INFO:tasks.ceph.mon.b.smithi075.stderr: ceph version 14.2.0-304-g6743d9d (6743d9d911728bba4484086348c1a236f7b3d7a8) nautilus (stable)
2019-04-16T22:15:00.971 INFO:tasks.ceph.mon.b.smithi075.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a) [0x5085d76]
2019-04-16T22:15:00.971 INFO:tasks.ceph.mon.b.smithi075.stderr: 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x5085f44]
2019-04-16T22:15:00.971 INFO:tasks.ceph.mon.b.smithi075.stderr: 3: (HealthMonitor::check_member_health()+0x1148) [0x459078]
2019-04-16T22:15:00.972 INFO:tasks.ceph.mon.b.smithi075.stderr: 4: (HealthMonitor::tick()+0x52) [0x45a7d2]
2019-04-16T22:15:00.972 INFO:tasks.ceph.mon.b.smithi075.stderr: 5: (Monitor::tick()+0xa0) [0x3c2570]
2019-04-16T22:15:00.972 INFO:tasks.ceph.mon.b.smithi075.stderr: 6: (C_MonContext::finish(int)+0x39) [0x39c5c9]
2019-04-16T22:15:00.972 INFO:tasks.ceph.mon.b.smithi075.stderr: 7: (Context::complete(int)+0x9) [0x3db4a9]
2019-04-16T22:15:00.972 INFO:tasks.ceph.mon.b.smithi075.stderr: 8: (SafeTimer::timer_thread()+0x18d) [0x5119ead]
2019-04-16T22:15:00.972 INFO:tasks.ceph.mon.b.smithi075.stderr: 9: (SafeTimerThread::entry()+0xd) [0x511b6dd]
2019-04-16T22:15:00.972 INFO:tasks.ceph.mon.b.smithi075.stderr: 10: (()+0x7e25) [0x107a6e25]
2019-04-16T22:15:00.973 INFO:tasks.ceph.mon.b.smithi075.stderr: 11: (clone()+0x6d) [0x11704bad]
2019-04-16T22:15:00.976 INFO:tasks.ceph.mon.b.smithi075.stderr:*** Caught signal (Aborted) **

probably because their data directory is deleted out from underneath them?

Actions

Copy link

Updated by Neha Ojha almost 5 years ago

/a/yuriw-2019-06-12_00:27:20-rados-wip-yuri-testing-2019-06-11-2200-nautilus-distro-basic-smithi/4023297/

Actions

Copy link

Updated by Sage Weil almost 5 years ago

/a/sage-2019-06-21_06:56:55-rados-wip-sage-testing-2019-06-20-2155-distro-basic-smithi/4056168

Actions

Copy link

Updated by Sage Weil almost 5 years ago

another symptom is

2019-07-11T20:45:40.897 INFO:teuthology.orchestra.run.smithi204:> sudo find /var/log/ceph -name '*.log' -print0 | sudo xargs -0 --no-run-if-empty -- gzip --
2019-07-11T20:45:58.617 INFO:teuthology.orchestra.run.smithi204.stderr:gzip: /var/log/ceph/ceph-mon.b.log: file size changed while zipping
2019-07-11T20:45:59.048 INFO:teuthology.orchestra.run.smithi165.stderr:gzip: /var/log/ceph/ceph-mon.c.log: file size changed while zipping

/a/sage-2019-07-11_17:46:52-rados-wip-sage-testing-2019-07-11-1048-distro-basic-smithi/4110887

Actions

Copy link

Updated by Sage Weil over 4 years ago

/a/sage-2019-09-28_14:19:28-rados-master-distro-basic-smithi/4342720

2019-09-28T15:31:30.912 INFO:teuthology.misc:Shutting down mon daemons...
2019-09-28T15:31:30.912 DEBUG:tasks.ceph.mon.a:waiting for process to exit
2019-09-28T15:31:30.912 INFO:teuthology.orchestra.run:waiting for 300
2019-09-28T15:31:31.052 INFO:tasks.ceph.mon.a.smithi074.stderr:2019-09-28T15:31:30.991+0000 16082700 -1 received  signal: Terminated from /usr/bin/python /bin/daemon-helper term valgrind --trace-children=no --child-silent-after-fork=yes --num-callers=50 --suppressions=/home/ubuntu/cephtest/valgrind.supp --xml=yes --xml-file=/var/log/ceph/valgrind/mon.a.log --time-stamp=yes --tool=memcheck --leak-check=full --show-reachable=yes ceph-mon -f --cluster ceph -i a  (PID: 12266) UID: 0
2019-09-28T15:31:31.054 INFO:tasks.ceph.mon.a.smithi074.stderr:2019-09-28T15:31:31.010+0000 16082700 -1 mon.a@0(leader) e1 *** Got Signal Terminated ***
2019-09-28T15:36:29.997 INFO:tasks.ceph:Checking cluster log for badness...

note the ~5 minute time difference in the last two lines

Actions

Copy link

Updated by Sage Weil over 4 years ago

2019-09-28T22:04:09.181 INFO:teuthology.misc:Shutting down mon daemons... ([<teuthology.orchestra.daemon.state.DaemonState object at 0x7fe1e18e7d50>, <teuthology.orchestra.daemon.state.DaemonState object at 0x7fe1e18eb110>, <teuthology.orchestra.daemon.state.DaemonState object at 0x7fe1e18e7750>])
2019-09-28T22:04:09.181 INFO:teuthology.misc:Daemon <teuthology.orchestra.daemon.state.DaemonState object at 0x7fe1e18e7d50>
2019-09-28T22:04:09.181 INFO:teuthology.misc:Stopping ceph.mon.a...
2019-09-28T22:04:09.182 DEBUG:tasks.ceph.mon.a:waiting for process to exit
2019-09-28T22:04:09.182 INFO:teuthology.orchestra.run:waiting for 300
2019-09-28T22:04:09.419 INFO:tasks.ceph.mon.a.smithi002.stderr:2019-09-28T22:04:09.365+0000 16082700 -1 received  signal: Terminated from /usr/bin/python /bin/daemon-helper term valgrind --trace-children=no --child-silent-after-fork=yes --num-callers=50 --suppressions=/home/ubuntu/cephtest/valgrind.supp --xml=yes --xml-file=/var/log/ceph/valgrind/mon.a.log --time-stamp=yes --tool=memcheck --leak-check=full --show-reachable=yes ceph-mon -f --cluster ceph -i a  (PID: 62151) UID: 0
2019-09-28T22:04:09.461 INFO:tasks.ceph.mon.a.smithi002.stderr:2019-09-28T22:04:09.434+0000 16082700 -1 mon.a@0(leader) e1 *** Got Signal Terminated ***
2019-09-28T22:09:08.261 INFO:tasks.ceph:Checking cluster log for badness...

with this patch

commit bf01518c967095ccb99d60c80e7da5b9d7487429 (HEAD -> refs/heads/bug-17944, refs/remotes/gh/bug-17944)
Author: Sage Weil <sage@redhat.com>
Date:   Sat Sep 28 11:10:10 2019 -0500

    misc: more verbose while shutting down daemons

    Hunting https://tracker.ceph.com/issues/17944

    Signed-off-by: Sage Weil <sage@redhat.com>

diff --git a/teuthology/misc.py b/teuthology/misc.py
index 52c304c..54c692f 100644
--- a/teuthology/misc.py
+++ b/teuthology/misc.py
@@ -1240,16 +1240,20 @@ def stop_daemons_of_type(ctx, type_, cluster='ceph'):
     """ 
     :param type_: type of daemons to be stopped.
     """ 
-    log.info('Shutting down %s daemons...' % type_)
     exc_info = (None, None, None)
-    for daemon in ctx.daemons.iter_daemons_of_role(type_, cluster):
+    daemons = ctx.daemons.iter_daemons_of_role(type_, cluster)
+    log.info('Shutting down %s daemons... (%s)' % (type_, daemons))
+    for daemon in daemons:
+        log.info('Daemon %s' % daemon)
         try:
+            log.info('Stopping %s.%s...' % (daemon.role, daemon.id_))
             daemon.stop()
         except (CommandFailedError,
                 CommandCrashedError,
                 ConnectionLostError):
             exc_info = sys.exc_info()
             log.exception('Saw exception from %s.%s', daemon.role, daemon.id_)
+    log.info('Done shutting down %s daemons (%s)' % (type_, daemons))
     if exc_info != (None, None, None):
         raise exc_info[0], exc_info[1], exc_info[2]

it looks like the daemon.stop() commadn is yielding or something? But AFAICS we're already in the finally: portion of the work, see ceph.py run_daemon():

    try:
        yield
    finally:
        teuthology.stop_daemons_of_type(ctx, type_, cluster_name)

someone who has more of a clue about how @contextlib.contextmanager works needs to look at this

Actions

Copy link

Updated by Sage Weil over 4 years ago

fyi, this reproduce 4 times out of 40 here: http://pulpito.ceph.com/sage-2019-09-28_20:08:27-rados:verify-master-distro-basic-smithi/
with this:

teuthology-suite -s rados/verify -c wip-read-hole-bypg -k distro -m smithi -p 40 --filter valgrind --subset 1/99 --teuthology-branch bug-17944  -N 10

Actions

Copy link

#10

Updated by Sage Weil over 4 years ago

/a/sage-2019-10-25_05:04:47-rados-wip-sage2-testing-2019-10-24-1944-distro-basic-smithi/4442713

Actions

Copy link

#11

Updated by Neha Ojha over 4 years ago

/a/yuriw-2019-11-08_00:57:06-rados-wip-yuri5-testing-2019-11-07-2203-nautilus-distro-basic-smithi/4481148/

Actions

Copy link

#12

Updated by Patrick Donnelly over 4 years ago

Status changed from 12 to New

Actions

Copy link

#13

Updated by Josh Durgin over 2 years ago

Is anyone still seeing this happen?

Actions

Copy link

#14

Updated by Josh Durgin over 2 years ago

Has duplicate Bug #43292: teuthology does not stop all daemons added

Actions

Copy link

#15

Updated by Josh Durgin over 2 years ago

Status changed from New to Can't reproduce

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Tools » teuthology

Custom queries

Bug #17944

misc.py: stop_daemons_of_type did not stop them all (sometimes osds, sometimes mons)

Updated by Sage Weil over 7 years ago

Updated by Sage Weil over 6 years ago

Updated by Sage Weil about 5 years ago

Updated by Neha Ojha almost 5 years ago

Updated by Sage Weil almost 5 years ago

Updated by Sage Weil almost 5 years ago

Updated by Sage Weil over 4 years ago

Updated by Sage Weil over 4 years ago

Updated by Sage Weil over 4 years ago

Updated by Sage Weil over 4 years ago

Updated by Neha Ojha over 4 years ago

Updated by Patrick Donnelly over 4 years ago

Updated by Josh Durgin over 2 years ago

Updated by Josh Durgin over 2 years ago

Updated by Josh Durgin over 2 years ago