Project

General

Profile

Actions

Bug #2073

closed

msgr: shutdown can hang

Added by Sage Weil about 12 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

saw this


2012-02-16T14:04:01.446 INFO:teuthology.task.mon_recovery:removing mon 2
2012-02-16T14:04:01.446 DEBUG:teuthology.task.ceph.mon.2:waiting for process to exit
2012-02-16T14:04:01.447 INFO:teuthology.task.ceph.mon.2.err:2012-02-16 14:04:01.355464 7f1de35ee700 mon.2@0(leader) e1 *** Got Signal Terminated ***
2012-02-16T16:16:45.935 INFO:teuthology.task.ceph.mon.2:Stopped

this is fallout from teh signal handling change. need to run this test in a loop to verify we behave properly.

2012-02-16T14:01:13.190 DEBUG:teuthology.run:Config:
  kernel:
    sha1: 07fd42934a53b8486709f7f866346a9e4bb6d5ce
  nuke-on-error: true
  overrides:
    ceph:
      conf:
        osd:
          osd op complaint time: 120
      coverage: true
      log-whitelist:
      - clocks not synchronized
      - old request
      sha1: 4b3bb5ab37a05fa001d59f24da7d9c30d650321b
  roles:
  - - mon.0
    - osd.0
  - - mon.1
    - mds.a
  - - mon.2
    - osd.1
  tasks:
  - chef: null
  - ceph: null
  - mon_recovery: null
2012-02-16T14:01:13.190 INFO:teuthology.run_tasks:Running task internal.lock_machines...

Actions #1

Updated by Sage Weil about 12 years ago

  • Subject changed from mon: shutdown can hang to msgr: shutdown can hang
  • Category changed from Monitor to msgr

here's the bt:

2012-02-16 18:04:33.090989 7f1939949700 mon.g@8(peon) e1 shutdown
2012-02-16 18:04:33.092656 7f1937237700 -- 10.3.14.160:6791/0 >> 10.3.14.141:6789/0 pipe(0x28af500 sd=17 pgs=0 cs=0 l=0).accept we reset (peer sent cseq 2), sending RESETSESSION
2012-02-16 18:19:41.288209 7f1937237700 -- 10.3.14.160:6791/0 >> 10.3.14.141:6789/0 pipe(0x28af500 sd=17 pgs=69 cs=1 l=0).fault with nothing to send, going to standby

(gdb) thr appl all bt

Thread 6 (Thread 0x7f193b94d700 (LWP 5581)):
#0  sem_timedwait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_timedwait.S:103
#1  0x00000000005a35cf in CephContextServiceThread::entry (this=0x2848b40) at common/ceph_context.cc:53
#2  0x00007f193d3be971 in start_thread (arg=<value optimized out>) at pthread_create.c:304
#3  0x00007f193bc4d92d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#4  0x0000000000000000 in ?? ()

Thread 5 (Thread 0x7f193b14c700 (LWP 5582)):
#0  0x00007f193bc41203 in __poll (fds=<value optimized out>, nfds=<value optimized out>, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:87
#1  0x0000000000524b9e in AdminSocket::entry (this=0x2857000) at common/admin_socket.cc:211
#2  0x00007f193d3be971 in start_thread (arg=<value optimized out>) at pthread_create.c:304
#3  0x00007f193bc4d92d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#4  0x0000000000000000 in ?? ()

Thread 4 (Thread 0x7f1939949700 (LWP 5585)):
#0  0x00007f193bc462c3 in select () at ../sysdeps/unix/syscall-template.S:82
#1  0x00000000005b17ae in SignalHandler::entry() ()
#2  0x00007f193d3be971 in start_thread (arg=<value optimized out>) at pthread_create.c:304
#3  0x00007f193bc4d92d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#4  0x0000000000000000 in ?? ()

Thread 3 (Thread 0x7f1937237700 (LWP 6052)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x000000000054d7c6 in Wait (this=0x28af500) at ./common/Cond.h:48
#2  SimpleMessenger::Pipe::reader (this=0x28af500) at msg/SimpleMessenger.cc:1560
#3  0x00000000004635dd in SimpleMessenger::Pipe::Reader::entry (this=<value optimized out>) at msg/SimpleMessenger.h:196
#4  0x00007f193d3be971 in start_thread (arg=<value optimized out>) at pthread_create.c:304
#5  0x00007f193bc4d92d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#6  0x0000000000000000 in ?? ()

Thread 2 (Thread 0x7f193773c700 (LWP 6053)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x0000000000545b86 in Wait (this=0x28af500) at ./common/Cond.h:48
#2  SimpleMessenger::Pipe::writer (this=0x28af500) at msg/SimpleMessenger.cc:1784
#3  0x00000000004635fd in SimpleMessenger::Pipe::Writer::entry (this=<value optimized out>) at msg/SimpleMessenger.h:204
#4  0x00007f193d3be971 in start_thread (arg=<value optimized out>) at pthread_create.c:304
#5  0x00007f193bc4d92d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#6  0x0000000000000000 in ?? ()

Thread 1 (Thread 0x7f193d7e6780 (LWP 5553)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x0000000000540a22 in Wait (this=0x2844680) at ./common/Cond.h:48
#2  SimpleMessenger::wait (this=0x2844680) at msg/SimpleMessenger.cc:2689
#3  0x000000000046147f in main (argc=<value optimized out>, argv=<value optimized out>) at ceph_mon.cc:417

basically, we were in the midst of accept()ing a new connection when we shut down, it registered itself (in STANDBY in this case), and wait() is waiting for it to go away.

- the accept should abort if we shut down
- it needs to not register itself

Actions #2

Updated by Sage Weil about 12 years ago

  • Status changed from New to Resolved
  • Assignee set to Sage Weil

this appears to be fixed with commit:787dd1709797876dd9fa6004c6723df859003b59, unless there is some subtle difference between my manual tests and the nightly teuth runs.

Actions #3

Updated by Greg Farnum about 5 years ago

  • Project changed from Ceph to Messengers
  • Category deleted (msgr)
  • Target version deleted (v0.43)
Actions

Also available in: Atom PDF