Project

General

Profile

Bug #2090

mon: assertion failed on shutdown

Added by Alex Elder over 8 years ago. Updated over 8 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
Monitor
Target version:
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

I was running repeated cycles of the kernel_untar_build.sh workunit
to try to reproduce a problem in the client and hit this problem
instead.
./common/Mutex.h: 89: FAILED assert(nlock 0)
Looks like this is the version:
ceph version 0.42-106-g761ecc6 (761ecc69c24856b15531c92b69b1c73c5cc81bfc)
Full info below (sorry if it's excessive, I'm not sure what's needed).

Here is the YAML file that was driving it:
roles:
- [mon.a, mon.c, osd.0]
- [mon.b, mds.a, osd.1]
- [client.0]
kernel:
mon:
branch: master
client:
branch: testing
overrides:
ceph:
conf:
osd:
osd op complaint time: 120
coverage: true
fs: btrfs
log-whitelist:
- clocks not synchronized
- old request
tasks:
- ceph:
- rbd:
all:
- workunit:
all:
- kernel_untar_build.sh

And here is the information I got at the end of the run:

. . .
INFO:teuthology.task.workunit.client.0.out:removed `linux-2.6.33/usr/gen_init_cpio.c'
INFO:teuthology.task.workunit.client.0.out:removed directory: `linux-2.6.33/usr'
INFO:teuthology.task.workunit.client.0.out:removed `linux-2.6.33/README'
INFO:teuthology.task.workunit.client.0.out:removed `linux-2.6.33/COPYING'
INFO:teuthology.task.workunit.client.0.out:removed directory: `linux-2.6.33'
INFO:teuthology.task.workunit.client.0.out:removed directory: `t'
INFO:teuthology.task.workunit.client.0.out:removed `linux-2.6.33.tar.bz2'
INFO:teuthology.task.rbd:Unmounting rbd images...
INFO:teuthology.task.rbd:Unmapping rbd devices...
INFO:teuthology.task.rbd:Unloading rbd kernel module...
INFO:teuthology.task.rbd:Deleting rbd images...
Removing image: 100% complete...done.
INFO:teuthology.task.ceph:Shutting down mds daemons...
INFO:teuthology.task.ceph.mds.a.err:2012-02-21 22:14:40.784639 7f5f8deb8700 mds.0.1 * got signal Terminated
INFO:teuthology.task.ceph.mds.a:Stopped
INFO:teuthology.task.ceph:Shutting down osd daemons...
INFO:teuthology.task.ceph.osd.1:Stopped
INFO:teuthology.task.ceph.osd.0:Stopped
INFO:teuthology.task.ceph:Shutting down mon daemons...
INFO:teuthology.task.ceph.mon.a.err:2012-02-21 22:14:42.088201 7f51e2e79700 mon.a@1(peon) e1
Got Signal Terminated
INFO:teuthology.task.ceph.mon.a.err:./common/Mutex.h: In function 'Mutex::~Mutex()' thread 7f51e6d16780 time 2012-02-21 22:14:42.091122
INFO:teuthology.task.ceph.mon.a.err:./common/Mutex.h: 89: FAILED assert(nlock 0)
INFO:teuthology.task.ceph.mon.a.err: ceph version 0.42-106-g761ecc6 (761ecc69c24856b15531c92b69b1c73c5cc81bfc)
INFO:teuthology.task.ceph.mon.a.err: 1: (Monitor::~Monitor()+0x95c) [0x476d2c]
INFO:teuthology.task.ceph.mon.a.err: 2: (main()+0x301e) [0x461a1e]
INFO:teuthology.task.ceph.mon.a.err: 3: (libc_start_main()+0xfe) [0x7f51e50b5d8e]
INFO:teuthology.task.ceph.mon.a.err: 4: /tmp/cephtest/binary/usr/local/bin/ceph-mon() [0x45e799]
INFO:teuthology.task.ceph.mon.a.err: ceph version 0.42-106-g761ecc6 (761ecc69c24856b15531c92b69b1c73c5cc81bfc)
INFO:teuthology.task.ceph.mon.a.err: 1: (Monitor::~Monitor()+0x95c) [0x476d2c]
INFO:teuthology.task.ceph.mon.a.err: 2: (main()+0x301e) [0x461a1e]
INFO:teuthology.task.ceph.mon.a.err: 3: (_libc_start_main()+0xfe) [0x7f51e50b5d8e]
INFO:teuthology.task.ceph.mon.a.err: 4: /tmp/cephtest/binary/usr/local/bin/ceph-mon() [0x45e799]
INFO:teuthology.task.ceph.mon.a.err:terminate called after throwing an instance of 'ceph::FailedAssertion'
INFO:teuthology.task.ceph.mon.a.err:
** Caught signal (Aborted)
INFO:teuthology.task.ceph.mon.a.err: in thread 7f51e6d16780
INFO:teuthology.task.ceph.mon.a.err: ceph version 0.42-106-g761ecc6 (761ecc69c24856b15531c92b69b1c73c5cc81bfc)
INFO:teuthology.task.ceph.mon.a.err: 1: /tmp/cephtest/binary/usr/local/bin/ceph-mon() [0x5b2931]
INFO:teuthology.task.ceph.mon.a.err: 2: (()+0xfb40) [0x7f51e68f6b40]
INFO:teuthology.task.ceph.mon.a.err: 3: (gsignal()+0x35) [0x7f51e50caba5]
INFO:teuthology.task.ceph.mon.a.err: 4: (abort()+0x180) [0x7f51e50ce6b0]
INFO:teuthology.task.ceph.mon.a.err: 5: (
_gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f51e596e6bd]
INFO:teuthology.task.ceph.mon.a.err: 6: (()+0xb9906) [0x7f51e596c906]
INFO:teuthology.task.ceph.mon.a.err: 7: (()+0xb9933) [0x7f51e596c933]
INFO:teuthology.task.ceph.mon.a.err: 8: (()+0xb9a3e) [0x7f51e596ca3e]
INFO:teuthology.task.ceph.mon.a.err: 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x841) [0x534871]
INFO:teuthology.task.ceph.mon.a.err: 10: (Monitor::~Monitor()+0x95c) [0x476d2c]
INFO:teuthology.task.ceph.mon.a.err: 11: (main()+0x301e) [0x461a1e]
INFO:teuthology.task.ceph.mon.a.err: 12: (_libc_start_main()+0xfe) [0x7f51e50b5d8e]
INFO:teuthology.task.ceph.mon.a.err: 13: /tmp/cephtest/binary/usr/local/bin/ceph-mon() [0x45e799]
INFO:teuthology.task.ceph.mon.a.err:daemon-helper: command crashed with signal 6
ERROR:teuthology.task.ceph:Saw exception from mon.a
Traceback (most recent call last):
File "/home/elder/ceph/teuthology/teuthology/task/ceph.py", line 817, in run_daemon
daemon.stop()
File "/home/elder/ceph/teuthology/teuthology/task/ceph.py", line 40, in stop
run.wait([self.proc])
File "/home/elder/ceph/teuthology/teuthology/orchestra/run.py", line 272, in wait
proc.exitstatus.get()
File "/home/elder/ceph/teuthology/virtualenv/local/lib/python2.7/site-packages/gevent/event.py", line 223, in get
raise self._exception
CommandFailedError: Command failed with status 1: '/tmp/cephtest/enable-coredump /tmp/cephtest/binary/usr/local/bin/ceph-coverage /tmp/cephtest/archive/coverage /tmp/cephtest/daemon-helper term /tmp/cephtest/binary/usr/local/bin/ceph-mon -f -i a -c /tmp/cephtest/ceph.conf'
INFO:teuthology.task.ceph.mon.c.err:2012-02-21 22:14:42.332393 7fb5e458c700 mon.c@2(peon) e1
Got Signal Terminated *
INFO:teuthology.task.ceph.mon.c:Stopped
INFO:teuthology.task.ceph.mon.b.err:2012-02-21 22:14:42.428996 7f6586237700 mon.b@0(leader) e1 *
Got Signal Terminated *
*
INFO:teuthology.task.ceph.mon.b:Stopped
INFO:teuthology.task.ceph:Grabbing cluster log from mon.a...
INFO:teuthology.task.ceph:Checking cluster ceph.log for badness...
INFO:teuthology.task.ceph:Unmounting /tmp/cephtest/data/osd.0.data on
INFO:teuthology.task.ceph:Unmounting /tmp/cephtest/data/osd.1.data on
INFO:teuthology.task.ceph:Cleaning ceph cluster...
INFO:teuthology.task.ceph:Removing ceph binaries...
INFO:teuthology.task.ceph:Removing shipped files: daemon-helper enable-coredump...
INFO:teuthology.task.ceph:Compressing logs...
ERROR:teuthology.run_tasks:Manager failed: <contextlib.GeneratorContextManager object at 0x2868950>
Traceback (most recent call last):
File "/home/elder/ceph/teuthology/teuthology/run_tasks.py", line 45, in run_tasks
suppress = manager.
_exit
(*exc_info)
File "/usr/lib/python2.7/contextlib.py", line 24, in exit
self.gen.next()
File "/home/elder/ceph/teuthology/teuthology/task/ceph.py", line 1017, in task
yield
File "/usr/lib/python2.7/contextlib.py", line 24, in exit
self.gen.next()
File "/home/elder/ceph/teuthology/teuthology/contextutil.py", line 35, in nested
if exit(*exc):
File "/usr/lib/python2.7/contextlib.py", line 24, in exit
self.gen.next()
File "/home/elder/ceph/teuthology/teuthology/task/ceph.py", line 817, in run_daemon
daemon.stop()
File "/home/elder/ceph/teuthology/teuthology/task/ceph.py", line 40, in stop
run.wait([self.proc])
File "/home/elder/ceph/teuthology/teuthology/orchestra/run.py", line 272, in wait
proc.exitstatus.get()
File "/home/elder/ceph/teuthology/virtualenv/local/lib/python2.7/site-packages/gevent/event.py", line 223, in get
raise self._exception
CommandFailedError: Command failed with status 1: '/tmp/cephtest/enable-coredump /tmp/cephtest/binary/usr/local/bin/ceph-coverage /tmp/cephtest/archive/coverage /tmp/cephtest/daemon-helper term /tmp/cephtest/binary/usr/local/bin/ceph-mon -f -i a -c /tmp/cephtest/ceph.conf'
INFO:teuthology.task.internal:Shutting down syslog monitoring...
INFO:teuthology.orchestra.run.out:rsyslog start/running, process 4452
INFO:teuthology.orchestra.run.out:rsyslog start/running, process 18961
INFO:teuthology.orchestra.run.out:rsyslog start/running, process 25670
INFO:teuthology.task.internal:Checking logs for errors...
INFO:teuthology.task.internal:Compressing syslogs...
INFO:teuthology.orchestra.run.out:kernel.core_pattern = core
INFO:teuthology.orchestra.run.out:kernel.core_pattern = core
INFO:teuthology.orchestra.run.out:kernel.core_pattern = core
WARNING:teuthology.task.internal:Found coredumps on , flagging run as failed
INFO:teuthology.task.internal:Transferring archived files...
INFO:teuthology.task.internal:Removing archive directory...
INFO:teuthology.task.internal:Tidying up after the test...
INFO:teuthology.run:Duration was 773.190049 seconds

Associated revisions

Revision 62a113aa (diff)
Added by Sage Weil over 8 years ago

mon: unlock mon before msgr shutdown

The ceph_mon.cc main() will delete mon when the msgr dispatch thread
completes. Make sure we unlock before we shut down the messenger, and
avoid touching this after messenger->shutdown().

Fixes: #2090
Signed-off-by: Sage Weil <>

History

#1 Updated by Sage Weil over 8 years ago

  • Category set to Monitor
  • Status changed from New to 4
  • Target version set to v0.43

wip-2090

#2 Updated by Sage Weil over 8 years ago

(gdb) thr app all bt

Thread 4 (Thread 25243):
#0  __lll_unlock_wake () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:366
#1  0x00007f51e68f1d66 in _L_unlock_696 () from /lib/libpthread.so.0
#2  0x00007f51e68f1ca3 in __pthread_mutex_unlock_usercnt (mutex=0x1e62810) at pthread_mutex_unlock.c:52
#3  __pthread_mutex_unlock (mutex=0x1e62810) at pthread_mutex_unlock.c:290
#4  0x000000000047dc5a in Mutex::Unlock (this=0x1e62800) at ./common/Mutex.h:122
#5  0x00000000005487b0 in SimpleMessenger::mark_down_all (this=0x1e62680) at msg/SimpleMessenger.cc:2733
#6  0x000000000053b79d in SimpleMessenger::shutdown (this=0x1e62680) at msg/SimpleMessenger.cc:409
#7  0x0000000000469731 in Monitor::shutdown (this=0x1e80000) at mon/Monitor.cc:377
#8  0x0000000000471c21 in Monitor::handle_signal (this=0x1e80000, signum=<value optimized out>) at mon/Monitor.cc:224
#9  0x00000000005b3973 in SignalHandler::entry() ()
#10 0x00007f51e68ee971 in start_thread (arg=<value optimized out>) at pthread_create.c:304
#11 0x00007f51e517d92d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#12 0x0000000000000000 in ?? ()

Thread 3 (Thread 25240):
#0  0x00007f51e5171203 in __poll (fds=<value optimized out>, nfds=<value optimized out>, timeout=-1)
    at ../sysdeps/unix/sysv/linux/poll.c:87
#1  0x0000000000523d2e in AdminSocket::entry (this=0x1e75000) at common/admin_socket.cc:211
#2  0x00007f51e68ee971 in start_thread (arg=<value optimized out>) at pthread_create.c:304
#3  0x00007f51e517d92d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#4  0x0000000000000000 in ?? ()

Thread 2 (Thread 25239):
#0  sem_timedwait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_timedwait.S:103
#1  0x00000000005a569f in CephContextServiceThread::entry (this=0x1e66c80) at common/ceph_context.cc:53
#2  0x00007f51e68ee971 in start_thread (arg=<value optimized out>) at pthread_create.c:304
#3  0x00007f51e517d92d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#4  0x0000000000000000 in ?? ()

Thread 1 (Thread 25238):
#0  0x00007f51e68f6a0b in raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:42
#1  0x00000000005b2b6e in reraise_fatal (signum=6) at global/signal_handler.cc:59
#2  handle_fatal_signal (signum=6) at global/signal_handler.cc:95
#3  <signal handler called>
#4  0x00007f51e50caba5 in raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#5  0x00007f51e50ce6b0 in abort () at abort.c:92
#6  0x00007f51e596e6bd in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/libstdc++.so.6
#7  0x00007f51e596c906 in ?? () from /usr/lib/libstdc++.so.6
#8  0x00007f51e596c933 in std::terminate() () from /usr/lib/libstdc++.so.6
#9  0x00007f51e596ca3e in __cxa_throw () from /usr/lib/libstdc++.so.6
#10 0x0000000000534871 in ceph::__ceph_assert_fail (assertion=<value optimized out>, file=<value optimized out>, 
    line=<value optimized out>, func=<value optimized out>) at common/assert.cc:75
#11 0x0000000000476d2c in ~Mutex (this=0x1e80000, __in_chrg=<value optimized out>) at ./common/Mutex.h:89
#12 Monitor::~Monitor (this=0x1e80000, __in_chrg=<value optimized out>) at mon/Monitor.cc:188
#13 0x0000000000461a1e in main (argc=<value optimized out>, argv=<value optimized out>) at ceph_mon.cc:420

#3 Updated by Sage Weil over 8 years ago

  • Status changed from 4 to Resolved

Also available in: Atom PDF