Actions
Bug #746
closedcore dump on radostool failure
% Done:
0%
Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
radostool failed because of a problem with some daemons. However, instead of failing gracefully, it dumped core.
Output:
cmccabe@flab:~/src/ceph/src$ ./rados -p data put obj001 /tmp/tmp.Jef7itlGo1/ver1 2011-01-26 03:29:52.935812 7fa304118720 -- :/22836 messenger.start 2011-01-26 03:29:52.936188 7fa304118720 -- :/22836 --> mon1 192.168.0.11:6789/0 -- auth(proto 0 30 bytes) v1 -- ?+0 0x12c1820 2011-01-26 03:29:55.936187 7fa3012d7710 -- :/22836 mark_down 192.168.0.11:6789/0 -- 0x12c1a10 2011-01-26 03:29:55.936360 7fa3012d7710 -- :/22836 --> mon0 192.168.0.10:6789/0 -- auth(proto 0 30 bytes) v1 -- ?+0 0x12d05a0 2011-01-26 03:29:58.936600 7fa3012d7710 -- :/22836 mark_down 192.168.0.10:6789/0 -- 0x12d07c0 2011-01-26 03:29:58.936672 7fa3012d7710 -- :/22836 --> mon1 192.168.0.11:6789/0 -- auth(proto 0 30 bytes) v1 -- ?+0 0x12d0390 2011-01-26 03:30:01.936857 7fa3012d7710 -- :/22836 mark_down 192.168.0.11:6789/0 -- 0x12c1820 2011-01-26 03:30:01.936917 7fa3012d7710 -- :/22836 --> mon2 192.168.0.12:6789/0 -- auth(proto 0 30 bytes) v1 -- ?+0 0x12d05a0 2011-01-26 03:30:04.937074 7fa3012d7710 -- :/22836 mark_down 192.168.0.12:6789/0 -- 0x12d0790 2011-01-26 03:30:04.937135 7fa3012d7710 -- :/22836 --> mon1 192.168.0.11:6789/0 -- auth(proto 0 30 bytes) v1 -- ?+0 0x12d0390 2011-01-26 03:30:07.937294 7fa3012d7710 -- :/22836 mark_down 192.168.0.11:6789/0 -- 0x12c1820 2011-01-26 03:30:07.937354 7fa3012d7710 -- :/22836 --> mon2 192.168.0.12:6789/0 -- auth(proto 0 30 bytes) v1 -- ?+0 0x12d05a0 2011-01-26 03:30:10.937508 7fa3012d7710 -- :/22836 mark_down 192.168.0.12:6789/0 -- 0x12d0790 2011-01-26 03:30:10.937567 7fa3012d7710 -- :/22836 --> mon0 192.168.0.10:6789/0 -- auth(proto 0 30 bytes) v1 -- ?+0 0x12d02c0 2011-01-26 03:30:13.936845 7fa300ad6710 -- :/22836 >> 192.168.0.10:6789/0 pipe(0x12c1820 sd=3 pgs=0 cs=0 l=0).fault first fault 2011-01-26 03:30:13.937751 7fa3012d7710 -- :/22836 mark_down 192.168.0.10:6789/0 -- 0x12c1820 2011-01-26 03:30:13.937814 7fa3012d7710 -- :/22836 --> mon2 192.168.0.12:6789/0 -- auth(proto 0 30 bytes) v1 -- ?+0 0x12d05a0 2011-01-26 03:30:16.937970 7fa3012d7710 -- :/22836 mark_down 192.168.0.12:6789/0 -- 0x12d0790 2011-01-26 03:30:16.938031 7fa3012d7710 -- :/22836 --> mon1 192.168.0.11:6789/0 -- auth(proto 0 30 bytes) v1 -- ?+0 0x12d02c0 2011-01-26 03:30:19.938193 7fa3012d7710 -- :/22836 mark_down 192.168.0.11:6789/0 -- 0x12c1820 2011-01-26 03:30:19.938253 7fa3012d7710 -- :/22836 --> mon2 192.168.0.12:6789/0 -- auth(proto 0 30 bytes) v1 -- ?+0 0x12d04b0 2011-01-26 03:30:22.936462 7fa304118720 monclient(hunting): authenticate timed out after 30 2011-01-26 03:30:22.936520 7fa304118720 librados: client.admin authentication error Connection timed out couldn't initialize rados! 2011-01-26 03:30:22.936643 7fa3009d5710 -- :/22836 >> 192.168.0.12:6789/0 pipe(0x12d06a0 sd=4 pgs=0 cs=0 l=0).fault first fault ./common/Mutex.h: In function 'Mutex::~Mutex()', In thread 7fa304118720 ./common/Mutex.h:97: FAILED assert(nlock == 0) ceph version 0.25~rc (commit:5a15bca2d327aef73756209c7e1c18fa32f86767) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x34) [0x7fa303bfc4bc] 2: (Mutex::~Mutex()+0x30) [0x41ae14] 3: (()+0x34e43a) [0x7fa303c1c43a] 4: (__cxa_finalize()+0xa5) [0x7fa302bdb965] 5: (()+0x21aa33) [0x7fa303ae8a33] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. terminate called after throwing an instance of 'ceph::FailedAssertion' *** Caught signal (Aborted) *** in thread 7fa304118720 ceph version 0.25~rc (commit:5a15bca2d327aef73756209c7e1c18fa32f86767) 1: (ceph::BackTrace::BackTrace(int)+0x2a) [0x7fa303bfc7a6] 2: (()+0x34f300) [0x7fa303c1d300] 3: (()+0xef60) [0x7fa3036c0f60] 4: (gsignal()+0x35) [0x7fa302bd7165] 5: (abort()+0x180) [0x7fa302bd9f70] 6: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7fa30346adc5] 7: (()+0xcb166) [0x7fa303469166] 8: (()+0xcb193) [0x7fa303469193] 9: (()+0xcb28e) [0x7fa30346928e] a: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1b3) [0x7fa303bfc63b] b: (Mutex::~Mutex()+0x30) [0x41ae14] c: (()+0x34e43a) [0x7fa303c1c43a] d: (__cxa_finalize()+0xa5) [0x7fa302bdb965] e: (()+0x21aa33) [0x7fa303ae8a33] Aborted (core dumped)
The problem seems to be this assertion:
(gdb) list 92 // locked or a mutex which is unlocked, undefined behavior results. 93 pthread_mutex_init(&_m, NULL); 94 } 95 } 96 ~Mutex() { 97 assert(nlock == 0); 98 pthread_mutex_destroy(&_m); 99 } 100 101 bool is_locked() {
We need to ensure that we unlock all mutexes before calling exit. Either that, or we call something like _exit that skips running destructors. The former solution is much more desirable!
Actions