Actions
Bug #746
closedcore dump on radostool failure
% Done:
0%
Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
radostool failed because of a problem with some daemons. However, instead of failing gracefully, it dumped core.
Output:
cmccabe@flab:~/src/ceph/src$ ./rados -p data put obj001 /tmp/tmp.Jef7itlGo1/ver1 2011-01-26 03:29:52.935812 7fa304118720 -- :/22836 messenger.start 2011-01-26 03:29:52.936188 7fa304118720 -- :/22836 --> mon1 192.168.0.11:6789/0 -- auth(proto 0 30 bytes) v1 -- ?+0 0x12c1820 2011-01-26 03:29:55.936187 7fa3012d7710 -- :/22836 mark_down 192.168.0.11:6789/0 -- 0x12c1a10 2011-01-26 03:29:55.936360 7fa3012d7710 -- :/22836 --> mon0 192.168.0.10:6789/0 -- auth(proto 0 30 bytes) v1 -- ?+0 0x12d05a0 2011-01-26 03:29:58.936600 7fa3012d7710 -- :/22836 mark_down 192.168.0.10:6789/0 -- 0x12d07c0 2011-01-26 03:29:58.936672 7fa3012d7710 -- :/22836 --> mon1 192.168.0.11:6789/0 -- auth(proto 0 30 bytes) v1 -- ?+0 0x12d0390 2011-01-26 03:30:01.936857 7fa3012d7710 -- :/22836 mark_down 192.168.0.11:6789/0 -- 0x12c1820 2011-01-26 03:30:01.936917 7fa3012d7710 -- :/22836 --> mon2 192.168.0.12:6789/0 -- auth(proto 0 30 bytes) v1 -- ?+0 0x12d05a0 2011-01-26 03:30:04.937074 7fa3012d7710 -- :/22836 mark_down 192.168.0.12:6789/0 -- 0x12d0790 2011-01-26 03:30:04.937135 7fa3012d7710 -- :/22836 --> mon1 192.168.0.11:6789/0 -- auth(proto 0 30 bytes) v1 -- ?+0 0x12d0390 2011-01-26 03:30:07.937294 7fa3012d7710 -- :/22836 mark_down 192.168.0.11:6789/0 -- 0x12c1820 2011-01-26 03:30:07.937354 7fa3012d7710 -- :/22836 --> mon2 192.168.0.12:6789/0 -- auth(proto 0 30 bytes) v1 -- ?+0 0x12d05a0 2011-01-26 03:30:10.937508 7fa3012d7710 -- :/22836 mark_down 192.168.0.12:6789/0 -- 0x12d0790 2011-01-26 03:30:10.937567 7fa3012d7710 -- :/22836 --> mon0 192.168.0.10:6789/0 -- auth(proto 0 30 bytes) v1 -- ?+0 0x12d02c0 2011-01-26 03:30:13.936845 7fa300ad6710 -- :/22836 >> 192.168.0.10:6789/0 pipe(0x12c1820 sd=3 pgs=0 cs=0 l=0).fault first fault 2011-01-26 03:30:13.937751 7fa3012d7710 -- :/22836 mark_down 192.168.0.10:6789/0 -- 0x12c1820 2011-01-26 03:30:13.937814 7fa3012d7710 -- :/22836 --> mon2 192.168.0.12:6789/0 -- auth(proto 0 30 bytes) v1 -- ?+0 0x12d05a0 2011-01-26 03:30:16.937970 7fa3012d7710 -- :/22836 mark_down 192.168.0.12:6789/0 -- 0x12d0790 2011-01-26 03:30:16.938031 7fa3012d7710 -- :/22836 --> mon1 192.168.0.11:6789/0 -- auth(proto 0 30 bytes) v1 -- ?+0 0x12d02c0 2011-01-26 03:30:19.938193 7fa3012d7710 -- :/22836 mark_down 192.168.0.11:6789/0 -- 0x12c1820 2011-01-26 03:30:19.938253 7fa3012d7710 -- :/22836 --> mon2 192.168.0.12:6789/0 -- auth(proto 0 30 bytes) v1 -- ?+0 0x12d04b0 2011-01-26 03:30:22.936462 7fa304118720 monclient(hunting): authenticate timed out after 30 2011-01-26 03:30:22.936520 7fa304118720 librados: client.admin authentication error Connection timed out couldn't initialize rados! 2011-01-26 03:30:22.936643 7fa3009d5710 -- :/22836 >> 192.168.0.12:6789/0 pipe(0x12d06a0 sd=4 pgs=0 cs=0 l=0).fault first fault ./common/Mutex.h: In function 'Mutex::~Mutex()', In thread 7fa304118720 ./common/Mutex.h:97: FAILED assert(nlock == 0) ceph version 0.25~rc (commit:5a15bca2d327aef73756209c7e1c18fa32f86767) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x34) [0x7fa303bfc4bc] 2: (Mutex::~Mutex()+0x30) [0x41ae14] 3: (()+0x34e43a) [0x7fa303c1c43a] 4: (__cxa_finalize()+0xa5) [0x7fa302bdb965] 5: (()+0x21aa33) [0x7fa303ae8a33] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. terminate called after throwing an instance of 'ceph::FailedAssertion' *** Caught signal (Aborted) *** in thread 7fa304118720 ceph version 0.25~rc (commit:5a15bca2d327aef73756209c7e1c18fa32f86767) 1: (ceph::BackTrace::BackTrace(int)+0x2a) [0x7fa303bfc7a6] 2: (()+0x34f300) [0x7fa303c1d300] 3: (()+0xef60) [0x7fa3036c0f60] 4: (gsignal()+0x35) [0x7fa302bd7165] 5: (abort()+0x180) [0x7fa302bd9f70] 6: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7fa30346adc5] 7: (()+0xcb166) [0x7fa303469166] 8: (()+0xcb193) [0x7fa303469193] 9: (()+0xcb28e) [0x7fa30346928e] a: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1b3) [0x7fa303bfc63b] b: (Mutex::~Mutex()+0x30) [0x41ae14] c: (()+0x34e43a) [0x7fa303c1c43a] d: (__cxa_finalize()+0xa5) [0x7fa302bdb965] e: (()+0x21aa33) [0x7fa303ae8a33] Aborted (core dumped)
The problem seems to be this assertion:
(gdb) list 92 // locked or a mutex which is unlocked, undefined behavior results. 93 pthread_mutex_init(&_m, NULL); 94 } 95 } 96 ~Mutex() { 97 assert(nlock == 0); 98 pthread_mutex_destroy(&_m); 99 } 100 101 bool is_locked() {
We need to ensure that we unlock all mutexes before calling exit. Either that, or we call something like _exit that skips running destructors. The former solution is much more desirable!
Updated by Sage Weil about 13 years ago
- Status changed from New to Resolved
this looks like it is fixed by 027335afe30127f841a5ea875e173ffc4cd7cf91.
Updated by Colin McCabe about 13 years ago
Although it wasn't apparent from my bug report, 027335afe30127f841a5ea875e173ffc4cd7cf91 was a parent revision of 5a15bca2d327aef73756209c7e1c18fa32f86767. Sorry for the confusion.
I fixed this one in eda48faf36e03156e0b6745c247244995989b1e1
Actions