Actions
Bug #2073
closedmsgr: shutdown can hang
% Done:
0%
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
saw this
2012-02-16T14:04:01.446 INFO:teuthology.task.mon_recovery:removing mon 2 2012-02-16T14:04:01.446 DEBUG:teuthology.task.ceph.mon.2:waiting for process to exit 2012-02-16T14:04:01.447 INFO:teuthology.task.ceph.mon.2.err:2012-02-16 14:04:01.355464 7f1de35ee700 mon.2@0(leader) e1 *** Got Signal Terminated *** 2012-02-16T16:16:45.935 INFO:teuthology.task.ceph.mon.2:Stopped
this is fallout from teh signal handling change. need to run this test in a loop to verify we behave properly.
2012-02-16T14:01:13.190 DEBUG:teuthology.run:Config: kernel: sha1: 07fd42934a53b8486709f7f866346a9e4bb6d5ce nuke-on-error: true overrides: ceph: conf: osd: osd op complaint time: 120 coverage: true log-whitelist: - clocks not synchronized - old request sha1: 4b3bb5ab37a05fa001d59f24da7d9c30d650321b roles: - - mon.0 - osd.0 - - mon.1 - mds.a - - mon.2 - osd.1 tasks: - chef: null - ceph: null - mon_recovery: null 2012-02-16T14:01:13.190 INFO:teuthology.run_tasks:Running task internal.lock_machines...
Updated by Sage Weil about 12 years ago
- Subject changed from mon: shutdown can hang to msgr: shutdown can hang
- Category changed from Monitor to msgr
here's the bt:
2012-02-16 18:04:33.090989 7f1939949700 mon.g@8(peon) e1 shutdown 2012-02-16 18:04:33.092656 7f1937237700 -- 10.3.14.160:6791/0 >> 10.3.14.141:6789/0 pipe(0x28af500 sd=17 pgs=0 cs=0 l=0).accept we reset (peer sent cseq 2), sending RESETSESSION 2012-02-16 18:19:41.288209 7f1937237700 -- 10.3.14.160:6791/0 >> 10.3.14.141:6789/0 pipe(0x28af500 sd=17 pgs=69 cs=1 l=0).fault with nothing to send, going to standby (gdb) thr appl all bt Thread 6 (Thread 0x7f193b94d700 (LWP 5581)): #0 sem_timedwait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_timedwait.S:103 #1 0x00000000005a35cf in CephContextServiceThread::entry (this=0x2848b40) at common/ceph_context.cc:53 #2 0x00007f193d3be971 in start_thread (arg=<value optimized out>) at pthread_create.c:304 #3 0x00007f193bc4d92d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112 #4 0x0000000000000000 in ?? () Thread 5 (Thread 0x7f193b14c700 (LWP 5582)): #0 0x00007f193bc41203 in __poll (fds=<value optimized out>, nfds=<value optimized out>, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:87 #1 0x0000000000524b9e in AdminSocket::entry (this=0x2857000) at common/admin_socket.cc:211 #2 0x00007f193d3be971 in start_thread (arg=<value optimized out>) at pthread_create.c:304 #3 0x00007f193bc4d92d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112 #4 0x0000000000000000 in ?? () Thread 4 (Thread 0x7f1939949700 (LWP 5585)): #0 0x00007f193bc462c3 in select () at ../sysdeps/unix/syscall-template.S:82 #1 0x00000000005b17ae in SignalHandler::entry() () #2 0x00007f193d3be971 in start_thread (arg=<value optimized out>) at pthread_create.c:304 #3 0x00007f193bc4d92d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112 #4 0x0000000000000000 in ?? () Thread 3 (Thread 0x7f1937237700 (LWP 6052)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 #1 0x000000000054d7c6 in Wait (this=0x28af500) at ./common/Cond.h:48 #2 SimpleMessenger::Pipe::reader (this=0x28af500) at msg/SimpleMessenger.cc:1560 #3 0x00000000004635dd in SimpleMessenger::Pipe::Reader::entry (this=<value optimized out>) at msg/SimpleMessenger.h:196 #4 0x00007f193d3be971 in start_thread (arg=<value optimized out>) at pthread_create.c:304 #5 0x00007f193bc4d92d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112 #6 0x0000000000000000 in ?? () Thread 2 (Thread 0x7f193773c700 (LWP 6053)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 #1 0x0000000000545b86 in Wait (this=0x28af500) at ./common/Cond.h:48 #2 SimpleMessenger::Pipe::writer (this=0x28af500) at msg/SimpleMessenger.cc:1784 #3 0x00000000004635fd in SimpleMessenger::Pipe::Writer::entry (this=<value optimized out>) at msg/SimpleMessenger.h:204 #4 0x00007f193d3be971 in start_thread (arg=<value optimized out>) at pthread_create.c:304 #5 0x00007f193bc4d92d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112 #6 0x0000000000000000 in ?? () Thread 1 (Thread 0x7f193d7e6780 (LWP 5553)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 #1 0x0000000000540a22 in Wait (this=0x2844680) at ./common/Cond.h:48 #2 SimpleMessenger::wait (this=0x2844680) at msg/SimpleMessenger.cc:2689 #3 0x000000000046147f in main (argc=<value optimized out>, argv=<value optimized out>) at ceph_mon.cc:417
basically, we were in the midst of accept()ing a new connection when we shut down, it registered itself (in STANDBY in this case), and wait() is waiting for it to go away.
- the accept should abort if we shut down
- it needs to not register itself
Updated by Sage Weil about 12 years ago
- Status changed from New to Resolved
- Assignee set to Sage Weil
this appears to be fixed with commit:787dd1709797876dd9fa6004c6723df859003b59, unless there is some subtle difference between my manual tests and the nightly teuth runs.
Updated by Greg Farnum about 5 years ago
- Project changed from Ceph to Messengers
- Category deleted (
msgr) - Target version deleted (
v0.43)
Actions