Actions
Bug #4357
closedosd: FAILED assert("join on thread that was never started" == 0)
Status:
Can't reproduce
Priority:
High
Assignee:
-
Category:
OSD
Target version:
-
% Done:
0%
Source:
Development
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
I found #1650 which seems related, but rather old and a different use-case.
I got a message from my monitoring system that the health of a small cluster was not ok. Turns out, all 12 OSDs went down with the same backtrace/message:
-3> 2013-03-05 19:04:18.679948 7f5f37300780 10 -- [2a00:f10:113:0:d585:1138:64c6:be36]:6806/8564 wait: dispatch queue is stopped -2> 2013-03-05 19:04:18.679971 7f5f37300780 20 -- [2a00:f10:113:0:d585:1138:64c6:be36]:6806/8564 wait: stopping accepter thread -1> 2013-03-05 19:04:18.679984 7f5f37300780 10 accepter.stop accepter 0> 2013-03-05 19:04:18.683892 7f5f37300780 -1 common/Thread.cc: In function 'int Thread::join(void**)' thread 7f5f37300780 time 2013-03-05 19:04:18.679999 common/Thread.cc: 117: FAILED assert("join on thread that was never started" == 0) ceph version 0.56.3-19-g8c6f522 (8c6f52215240f48b5e4d5bb99a5f2f451e7ce70a) 1: (Thread::join(void**)+0x41) [0x823ee1] 2: (Accepter::stop()+0x7b) [0x8af5fb] 3: (SimpleMessenger::wait()+0xa4a) [0x81d6ba] 4: (main()+0x2282) [0x5733f2] 5: (__libc_start_main()+0xed) [0x7f5f34f9c76d] 6: /usr/bin/ceph-osd() [0x575909] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
I attached the logs of two OSDs, but I want to mention again that ALL 12 OSDs went down with the same backtrace in about 2 minutes time. Rendering the cluster unable to do any I/O.
Files
Actions