Bug #2073
closed
Added by Sage Weil about 12 years ago.
Updated about 5 years ago.
Description
saw this
2012-02-16T14:04:01.446 INFO:teuthology.task.mon_recovery:removing mon 2
2012-02-16T14:04:01.446 DEBUG:teuthology.task.ceph.mon.2:waiting for process to exit
2012-02-16T14:04:01.447 INFO:teuthology.task.ceph.mon.2.err:2012-02-16 14:04:01.355464 7f1de35ee700 mon.2@0(leader) e1 *** Got Signal Terminated ***
2012-02-16T16:16:45.935 INFO:teuthology.task.ceph.mon.2:Stopped
this is fallout from teh signal handling change. need to run this test in a loop to verify we behave properly.
2012-02-16T14:01:13.190 DEBUG:teuthology.run:Config:
kernel:
sha1: 07fd42934a53b8486709f7f866346a9e4bb6d5ce
nuke-on-error: true
overrides:
ceph:
conf:
osd:
osd op complaint time: 120
coverage: true
log-whitelist:
- clocks not synchronized
- old request
sha1: 4b3bb5ab37a05fa001d59f24da7d9c30d650321b
roles:
- - mon.0
- osd.0
- - mon.1
- mds.a
- - mon.2
- osd.1
tasks:
- chef: null
- ceph: null
- mon_recovery: null
2012-02-16T14:01:13.190 INFO:teuthology.run_tasks:Running task internal.lock_machines...
- Subject changed from mon: shutdown can hang to msgr: shutdown can hang
- Category changed from Monitor to msgr
here's the bt:
2012-02-16 18:04:33.090989 7f1939949700 mon.g@8(peon) e1 shutdown
2012-02-16 18:04:33.092656 7f1937237700 -- 10.3.14.160:6791/0 >> 10.3.14.141:6789/0 pipe(0x28af500 sd=17 pgs=0 cs=0 l=0).accept we reset (peer sent cseq 2), sending RESETSESSION
2012-02-16 18:19:41.288209 7f1937237700 -- 10.3.14.160:6791/0 >> 10.3.14.141:6789/0 pipe(0x28af500 sd=17 pgs=69 cs=1 l=0).fault with nothing to send, going to standby
(gdb) thr appl all bt
Thread 6 (Thread 0x7f193b94d700 (LWP 5581)):
#0 sem_timedwait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_timedwait.S:103
#1 0x00000000005a35cf in CephContextServiceThread::entry (this=0x2848b40) at common/ceph_context.cc:53
#2 0x00007f193d3be971 in start_thread (arg=<value optimized out>) at pthread_create.c:304
#3 0x00007f193bc4d92d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#4 0x0000000000000000 in ?? ()
Thread 5 (Thread 0x7f193b14c700 (LWP 5582)):
#0 0x00007f193bc41203 in __poll (fds=<value optimized out>, nfds=<value optimized out>, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:87
#1 0x0000000000524b9e in AdminSocket::entry (this=0x2857000) at common/admin_socket.cc:211
#2 0x00007f193d3be971 in start_thread (arg=<value optimized out>) at pthread_create.c:304
#3 0x00007f193bc4d92d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#4 0x0000000000000000 in ?? ()
Thread 4 (Thread 0x7f1939949700 (LWP 5585)):
#0 0x00007f193bc462c3 in select () at ../sysdeps/unix/syscall-template.S:82
#1 0x00000000005b17ae in SignalHandler::entry() ()
#2 0x00007f193d3be971 in start_thread (arg=<value optimized out>) at pthread_create.c:304
#3 0x00007f193bc4d92d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#4 0x0000000000000000 in ?? ()
Thread 3 (Thread 0x7f1937237700 (LWP 6052)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1 0x000000000054d7c6 in Wait (this=0x28af500) at ./common/Cond.h:48
#2 SimpleMessenger::Pipe::reader (this=0x28af500) at msg/SimpleMessenger.cc:1560
#3 0x00000000004635dd in SimpleMessenger::Pipe::Reader::entry (this=<value optimized out>) at msg/SimpleMessenger.h:196
#4 0x00007f193d3be971 in start_thread (arg=<value optimized out>) at pthread_create.c:304
#5 0x00007f193bc4d92d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#6 0x0000000000000000 in ?? ()
Thread 2 (Thread 0x7f193773c700 (LWP 6053)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1 0x0000000000545b86 in Wait (this=0x28af500) at ./common/Cond.h:48
#2 SimpleMessenger::Pipe::writer (this=0x28af500) at msg/SimpleMessenger.cc:1784
#3 0x00000000004635fd in SimpleMessenger::Pipe::Writer::entry (this=<value optimized out>) at msg/SimpleMessenger.h:204
#4 0x00007f193d3be971 in start_thread (arg=<value optimized out>) at pthread_create.c:304
#5 0x00007f193bc4d92d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#6 0x0000000000000000 in ?? ()
Thread 1 (Thread 0x7f193d7e6780 (LWP 5553)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1 0x0000000000540a22 in Wait (this=0x2844680) at ./common/Cond.h:48
#2 SimpleMessenger::wait (this=0x2844680) at msg/SimpleMessenger.cc:2689
#3 0x000000000046147f in main (argc=<value optimized out>, argv=<value optimized out>) at ceph_mon.cc:417
basically, we were in the midst of accept()ing a new connection when we shut down, it registered itself (in STANDBY in this case), and wait() is waiting for it to go away.
- the accept should abort if we shut down
- it needs to not register itself
- Status changed from New to Resolved
- Assignee set to Sage Weil
this appears to be fixed with commit:787dd1709797876dd9fa6004c6723df859003b59, unless there is some subtle difference between my manual tests and the nightly teuth runs.
- Project changed from Ceph to Messengers
- Category deleted (
msgr)
- Target version deleted (
v0.43)
Also available in: Atom
PDF