Project

General

Profile

Bug #535

cephtool hangs forever until a UNIX signal is received

Added by Colin McCabe about 9 years ago. Updated about 9 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Start date:
11/01/2010
Due date:
% Done:

0%

Spent time:
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

I just saw this twice in a row. cephtool hangs forever until a UNIX signal is received. That seems to break the logjam.

I notice that SimpleMessenger is blocked on a read from a tcp socket.
Perhaps this bug is caused by Nagle's algorithm?
http://en.wikipedia.org/wiki/Nagle%27s_algorithm

(gdb) thread apply all bt

Thread 3 (Thread 0x7fa776ff8710 (LWP 3408)):
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:211
#1 0x00000000005f035f in Cond::WaitUntil (this=0x981390, mutex=@0x981310, when={tv = {tv_sec = 1288654163, tv_nsec = 547593000}})
at ./common/Cond.h:60
#2 0x0000000000686751 in Timer::timer_entry (this=0x9812a8) at common/Timer.cc:119
#3 0x00000000005bf8df in Timer::TimerThread::entry (this=0x9813c8) at ./common/Timer.h:77
#4 0x00000000005ea05e in Thread::_entry_func (arg=0x9813c8) at common/Thread.h:39
#5 0x00007fa77f8e28ba in start_thread (arg=<value optimized out>) at pthread_create.c:300
#6 0x00007fa77abe102d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#7 0x0000000000000000 in ?? ()

Thread 2 (Thread 0x7fa7767f7710 (LWP 3411)):
#0 0x00007fa77abd6113 in *GI_poll (fds=<value optimized out>, nfds=<value optimized out>, timeout=900000) at ../sysdeps/unix/sysv/linux/poll.c:87
#1 0x00000000005e0ce6 in tcp_read (sd=3, buf=0x7fa7767f6da3 "?", len=1, timeout=900000) at msg/tcp.cc:18
#2 0x00000000005e68ac in SimpleMessenger::Pipe::reader (this=0x26385e0) at msg/SimpleMessenger.cc:1473
#3 0x00000000005c6fb9 in SimpleMessenger::Pipe::Reader::entry (this=0x2638818) at ./msg/SimpleMessenger.h:193
#4 0x00000000005ea05e in Thread::_entry_func (arg=0x2638818) at common/Thread.h:39
#5 0x00007fa77f8e28ba in start_thread (arg=<value optimized out>) at pthread_create.c:300
#6 0x00007fa77abe102d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#7 0x0000000000000000 in ?? ()

Thread 1 (Thread 0x7fa77ff127c0 (LWP 3402)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1 0x00000000005c8b6d in Cond::Wait (this=0x2632888, mutex=@0x26325b8) at ./common/Cond.h:46
#2 0x00000000005df717 in SimpleMessenger::wait (this=0x2632430) at msg/SimpleMessenger.cc:2559
#3 0x0000000000596410 in main (argc=6, argv=0x2630ab0) at tools/ceph.cc:683
Current language: auto; currently c
162 in ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S
Current language: auto; currently asm
(gdb)

messenger-bug.txt View - with --debug_objecter 20 --debug-ms 1 (22.5 KB) Colin McCabe, 11/09/2010 03:08 PM

cephtool-hang-at-966369aad07461f2610b4dd2a9cdc770155c5a89.txt View - cephtool-hang-at-966369aad07461f2610b4dd2a9cdc770155c5a89.txt (243 KB) Colin McCabe, 11/09/2010 04:28 PM

History

#1 Updated by Sage Weil about 9 years ago

  • Assignee set to Greg Farnum

#2 Updated by Colin McCabe about 9 years ago

Perhaps this bug is caused by Nagle's algorithm?

As Sage pointed out, we're already running with TCP_NODELAY. So scratch that theory.

#3 Updated by Colin McCabe about 9 years ago

While running vstart.sh, I reproduced this bug with debug_ms = 20.

Here's what the output was. Since cephtool doesn't redirect logs to a logfile, this is taken from the console.

./cmds -i a -c ceph.conf
 ** WARNING: Ceph is still under heavy development, and is only suitable for **
 **          testing and review.  Do not trust it with important data.       **
starting mds.a at 0.0.0.0:6804/13017
creating dev/mds.b.keyring
read 119 bytes from dev/mds.b.keyring
2010-11-01 16:54:15.511902 mon <- [auth,add,mds.b]
2010-11-01 16:54:18.224411 mon0 -> 'added key for mds.b' (0)

#4 Updated by Colin McCabe about 9 years ago

Colin McCabe wrote:

While running vstart.sh, I reproduced this bug with debug_ms = 20.

Here's what the output was. Since cephtool doesn't redirect logs to a logfile, this is taken from the console.

[...]

And here is the backtrace that goes with this occurrence of the bug.

(gdb) thread apply all bt

Thread 4 (Thread 0x7f84e8e7d710 (LWP 13031)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:211
#1  0x00000000005f035f in Cond::WaitUntil (this=0x981390, mutex=@0x981310, when={tv = {tv_sec = 1288656798, tv_nsec = 515238000}})
    at ./common/Cond.h:60
#2  0x0000000000686671 in Timer::timer_entry (this=0x9812a8) at common/Timer.cc:119
#3  0x00000000005bf8df in Timer::TimerThread::entry (this=0x9813c8) at ./common/Timer.h:77
#4  0x00000000005ea05e in Thread::_entry_func (arg=0x9813c8) at common/Thread.h:39
#5  0x00007f84f17678ba in start_thread (arg=<value optimized out>) at pthread_create.c:300
#6  0x00007f84eca6602d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#7  0x0000000000000000 in ?? ()
Current language:  auto; currently asm

Thread 3 (Thread 0x7f84e847a710 (LWP 13116)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00000000005c8b6d in Cond::Wait (this=0x210ebb8, mutex=@0x210eab0) at ./common/Cond.h:46
#2  0x00000000005e4450 in SimpleMessenger::Pipe::writer (this=0x210e9e0) at msg/SimpleMessenger.cc:1680
#3  0x00000000005c6fd9 in SimpleMessenger::Pipe::Writer::entry (this=0x210ec30) at ./msg/SimpleMessenger.h:201
#4  0x00000000005ea05e in Thread::_entry_func (arg=0x210ec30) at common/Thread.h:39
#5  0x00007f84f17678ba in start_thread (arg=<value optimized out>) at pthread_create.c:300
#6  0x00007f84eca6602d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#7  0x0000000000000000 in ?? ()

Thread 2 (Thread 0x7f84e857b710 (LWP 13119)):
#0  0x00007f84eca5b113 in *__GI___poll (fds=<value optimized out>, nfds=<value optimized out>, timeout=900000) at ../sysdeps/unix/sysv/linux/poll.c:87
#1  0x00000000005e0ce6 in tcp_read (sd=3, buf=0x7f84e857ada3 "?", len=1, timeout=900000) at msg/tcp.cc:18
#2  0x00000000005e68ac in SimpleMessenger::Pipe::reader (this=0x210e9e0) at msg/SimpleMessenger.cc:1473
#3  0x00000000005c6fb9 in SimpleMessenger::Pipe::Reader::entry (this=0x210ec18) at ./msg/SimpleMessenger.h:193
#4  0x00000000005ea05e in Thread::_entry_func (arg=0x210ec18) at common/Thread.h:39
#5  0x00007f84f17678ba in start_thread (arg=<value optimized out>) at pthread_create.c:300
#6  0x00007f84eca6602d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#7  0x0000000000000000 in ?? ()

Thread 1 (Thread 0x7f84f1d977c0 (LWP 13025)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00000000005c8b6d in Cond::Wait (this=0x2108888, mutex=@0x21085b8) at ./common/Cond.h:46
#2  0x00000000005df717 in SimpleMessenger::wait (this=0x2108430) at msg/SimpleMessenger.cc:2559
#3  0x0000000000596410 in main (argc=6, argv=0x2106ab0) at tools/ceph.cc:683
Current language:  auto; currently c
162     in ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S
Current language:  auto; currently asm
(gdb) 

#5 Updated by Greg Farnum about 9 years ago

  • Status changed from New to In Progress

Okay, got in on a hang. The Pipe's been doing a disconnect/reconnect loop for about 4 minutes, it's currently in state 4; it was reliably calling mark_down on mon1 every 3 seconds and then reconnecting.
Got a copy of hte log, am taking a look.

#6 Updated by Greg Farnum about 9 years ago

  • Status changed from In Progress to Rejected

This occurrence is a problem on the monitor side that reproduces in the timer-fixes branch, but not unstable.

#7 Updated by Colin McCabe about 9 years ago

I should have written this at the top of the bug report, but this was on the unstable branch.

Anyway, I'll add more information here when/if it reproduces again on unstable.

#8 Updated by Colin McCabe about 9 years ago

  • Status changed from Rejected to In Progress

Reproduced again on the unfound branch, which is very close to what is in unstable now.

cmccabe@flab:~/src/ceph/src$ ps-ceph.rb
cmccabe 16972 0.0 0.0 10528 1376 pts/25 S+ 10:31 0:00 /bin/bash -x ./test/test_unfound.sh run
cmccabe 16983 0.0 0.0 11036 1676 pts/25 S+ 10:31 0:00 /bin/sh ./vstart.sh -d -n -o osd recovery delay start = 10000
cmccabe 17024 10.6 0.0 196352 4148 ? Ssl 10:31 0:17 ./cmon -i a -c ceph.conf
cmccabe 17028 3.7 0.0 132908 4172 ? Ssl 10:31 0:06 ./cmon -i b -c ceph.conf
cmccabe 17032 3.1 0.0 130880 4080 ? Ssl 10:31 0:05 ./cmon -i c -c ceph.conf
cmccabe 17130 0.4 0.1 227556 13400 ? Ssl 10:31 0:00 ./cosd -i 0 -c ceph.conf
cmccabe 17174 0.6 0.1 227556 13180 ? Ssl 10:31 0:01 ./cosd -i 1 -c ceph.conf
cmccabe 17181 0.0 0.1 226556 9712 pts/25 Sl+ 10:31 0:00 ./ceph -c ceph.conf -i dev/mds.a.keyring auth add mds.a

(gdb) thread apply all bt

Thread 5 (Thread 0x7f9029fd9710 (LWP 17183)):
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:211
#1 0x00000000005f0417 in Cond::WaitUntil (this=0x9813d8, mutex=@0x981358, when={tv = {tv_sec = 1289241398, tv_nsec = 803867000}}) at ./common/Cond.h:60
#2 0x0000000000686955 in Timer::timer_entry (this=0x9812f0) at common/Timer.cc:119
#3 0x00000000005bf96d in Timer::TimerThread::entry (this=0x981410) at ./common/Timer.h:77
#4 0x00000000005ea116 in Thread::_entry_func (arg=0x981410) at common/Thread.h:39
#5 0x00007f90328c38ba in start_thread (arg=<value optimized out>) at pthread_create.c:300
#6 0x00007f902dbc202d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#7 0x0000000000000000 in ?? ()
Current language: auto; currently asm

Thread 4 (Thread 0x7f90297d8710 (LWP 17186)):
#0 0x00007f902dbb7113 in *GI_poll (fds=<value optimized out>, nfds=<value optimized out>, timeout=900000) at ../sysdeps/unix/sysv/linux/poll.c:87
#1 0x00000000005e0d9e in tcp_read (sd=3, buf=0x7f90297d7da3 "?", len=1, timeout=900000) at msg/tcp.cc:18
#2 0x00000000005e6964 in SimpleMessenger::Pipe::reader (this=0x214e260) at msg/SimpleMessenger.cc:1473
#3 0x00000000005c7047 in SimpleMessenger::Pipe::Reader::entry (this=0x214e498) at ./msg/SimpleMessenger.h:193
#4 0x00000000005ea116 in Thread::_entry_func (arg=0x214e498) at common/Thread.h:39
#5 0x00007f90328c38ba in start_thread (arg=<value optimized out>) at pthread_create.c:300
#6 0x00007f902dbc202d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#7 0x0000000000000000 in ?? ()

Thread 3 (Thread 0x7f90295d6710 (LWP 17237)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1 0x00000000005c8c17 in Cond::Wait (this=0x2150748, mutex=@0x2150640) at ./common/Cond.h:46
#2 0x00000000005e4508 in SimpleMessenger::Pipe::writer (this=0x2150570) at msg/SimpleMessenger.cc:1680
#3 0x00000000005c7067 in SimpleMessenger::Pipe::Writer::entry (this=0x21507c0) at ./msg/SimpleMessenger.h:201
#4 0x00000000005ea116 in Thread::_entry_func (arg=0x21507c0) at common/Thread.h:39
#5 0x00007f90328c38ba in start_thread (arg=<value optimized out>) at pthread_create.c:300
#6 0x00007f902dbc202d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#7 0x0000000000000000 in ?? ()
Current language: auto; currently c

Thread 2 (Thread 0x7f90296d7710 (LWP 17239)):
#0 0x00007f902dbb7113 in *GI_poll (fds=<value optimized out>, nfds=<value optimized out>, timeout=900000) at ../sysdeps/unix/sysv/linux/poll.c:87
#1 0x00000000005e0d9e in tcp_read (sd=3, buf=0x7f90296d6da3 "?", len=1, timeout=900000) at msg/tcp.cc:18
#2 0x00000000005e6964 in SimpleMessenger::Pipe::reader (this=0x2150570) at msg/SimpleMessenger.cc:1473
#3 0x00000000005c7047 in SimpleMessenger::Pipe::Reader::entry (this=0x21507a8) at ./msg/SimpleMessenger.h:193
#4 0x00000000005ea116 in Thread::_entry_func (arg=0x21507a8) at common/Thread.h:39
#5 0x00007f90328c38ba in start_thread (arg=<value optimized out>) at pthread_create.c:300
#6 0x00007f902dbc202d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#7 0x0000000000000000 in ?? ()
Current language: auto; currently asm

Thread 1 (Thread 0x7f9032ef37c0 (LWP 17181)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1 0x00000000005c8c17 in Cond::Wait (this=0x2148888, mutex=@0x21485b8) at ./common/Cond.h:46
#2 0x00000000005df7cf in SimpleMessenger::wait (this=0x2148430) at msg/SimpleMessenger.cc:2559
#3 0x0000000000596410 in main (argc=6, argv=0x2146ad0) at tools/ceph.cc:683
Current language: auto; currently c
162 in ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S
Current language: auto; currently asm
(gdb)

#9 Updated by Colin McCabe about 9 years ago

The process that is hung is 17181, cephtool.

#10 Updated by Greg Farnum about 9 years ago

  • Status changed from In Progress to Can't reproduce

Look, I know it's a pain, but work on this isn't going to progress unless we collect AT LEAST:
1) The state of each messenger Pipe,
2) All logs in the system, the more detailed the better.
My guess is we're not going to get anywhere on this unless we manage to observe it with full messenger logging on the ceph tool, but maybe we'll get lucky from just the daemon logs and the ceph tool state. Preferably I get the chance to look at it in gdb so I can follow up on any weirdness exposed by the logs and the state.

A backtrace of the threads is no use at all; that looks the same whether the program is behaving or not.

#11 Updated by Colin McCabe about 9 years ago

  • Source set to 0

Look, I know it's a pain, but work on this isn't going to progress unless
we collect AT LEAST:
1) The state of each messenger Pipe,
2) All logs in the system, the more detailed the better.

I can definitely get #2. Will running with debug_ms = 20 be sufficient to get #1?

My guess is we're not going to get anywhere on this unless we manage to
observe it with full messenger logging on the ceph tool, but maybe we'll
get lucky from just the daemon logs and the ceph tool state. Preferably I
get the chance to look at it in gdb so I can follow up on any weirdness
exposed by the logs and the state.

You'll get one chance with gdb. Then when you detach it will start working again!

That's the way this bug goes.

I think that if you were developing on flab, you would not think "can't reproduce" was the right state for this bug. I see it perhaps 3, 4 times a day and it's highly distracting, especially when it crops up during an automated test. Perhaps we should exchange dev machines or something? Just an idea.

#12 Updated by Colin McCabe about 9 years ago

messenger-bug.txt

#13 Updated by Colin McCabe about 9 years ago

cephtool-hang-at-966369aad07461f2610b4dd2a9cdc770155c5a89.txt

#14 Updated by Greg Farnum about 9 years ago

  • Status changed from Can't reproduce to In Progress

After checking the logs and conferring with Sage, I think I've found a possible cause. Designing and testing a fix now. Once I've made sure it doesn't break anything I'm going to need to have Colin run it before pushing, though, since he's the one who kept seeing this bug.

#15 Updated by Greg Farnum about 9 years ago

  • Status changed from In Progress to Testing

Pushed a potential fix to the msgr branch, waiting for Colin to report back on if it works or not. :)

#16 Updated by Colin McCabe about 9 years ago

It looks good so far.

#17 Updated by Greg Farnum about 9 years ago

  • Status changed from Testing to Resolved

Sage spent some time on the messenger too, and I suspect we're done now.

Also available in: Atom PDF