Project

General

Profile

Actions

Bug #746

closed

core dump on radostool failure

Added by Colin McCabe about 13 years ago. Updated about 13 years ago.

Status:
Resolved
Priority:
Low
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

radostool failed because of a problem with some daemons. However, instead of failing gracefully, it dumped core.

Output:

cmccabe@flab:~/src/ceph/src$ ./rados -p data put obj001 /tmp/tmp.Jef7itlGo1/ver1
2011-01-26 03:29:52.935812 7fa304118720 -- :/22836 messenger.start
2011-01-26 03:29:52.936188 7fa304118720 -- :/22836 --> mon1 192.168.0.11:6789/0 -- auth(proto 0 30 bytes) v1 -- ?+0 0x12c1820
2011-01-26 03:29:55.936187 7fa3012d7710 -- :/22836 mark_down 192.168.0.11:6789/0 -- 0x12c1a10
2011-01-26 03:29:55.936360 7fa3012d7710 -- :/22836 --> mon0 192.168.0.10:6789/0 -- auth(proto 0 30 bytes) v1 -- ?+0 0x12d05a0
2011-01-26 03:29:58.936600 7fa3012d7710 -- :/22836 mark_down 192.168.0.10:6789/0 -- 0x12d07c0
2011-01-26 03:29:58.936672 7fa3012d7710 -- :/22836 --> mon1 192.168.0.11:6789/0 -- auth(proto 0 30 bytes) v1 -- ?+0 0x12d0390
2011-01-26 03:30:01.936857 7fa3012d7710 -- :/22836 mark_down 192.168.0.11:6789/0 -- 0x12c1820
2011-01-26 03:30:01.936917 7fa3012d7710 -- :/22836 --> mon2 192.168.0.12:6789/0 -- auth(proto 0 30 bytes) v1 -- ?+0 0x12d05a0
2011-01-26 03:30:04.937074 7fa3012d7710 -- :/22836 mark_down 192.168.0.12:6789/0 -- 0x12d0790
2011-01-26 03:30:04.937135 7fa3012d7710 -- :/22836 --> mon1 192.168.0.11:6789/0 -- auth(proto 0 30 bytes) v1 -- ?+0 0x12d0390
2011-01-26 03:30:07.937294 7fa3012d7710 -- :/22836 mark_down 192.168.0.11:6789/0 -- 0x12c1820
2011-01-26 03:30:07.937354 7fa3012d7710 -- :/22836 --> mon2 192.168.0.12:6789/0 -- auth(proto 0 30 bytes) v1 -- ?+0 0x12d05a0
2011-01-26 03:30:10.937508 7fa3012d7710 -- :/22836 mark_down 192.168.0.12:6789/0 -- 0x12d0790
2011-01-26 03:30:10.937567 7fa3012d7710 -- :/22836 --> mon0 192.168.0.10:6789/0 -- auth(proto 0 30 bytes) v1 -- ?+0 0x12d02c0
2011-01-26 03:30:13.936845 7fa300ad6710 -- :/22836 >> 192.168.0.10:6789/0 pipe(0x12c1820 sd=3 pgs=0 cs=0 l=0).fault first fault
2011-01-26 03:30:13.937751 7fa3012d7710 -- :/22836 mark_down 192.168.0.10:6789/0 -- 0x12c1820
2011-01-26 03:30:13.937814 7fa3012d7710 -- :/22836 --> mon2 192.168.0.12:6789/0 -- auth(proto 0 30 bytes) v1 -- ?+0 0x12d05a0
2011-01-26 03:30:16.937970 7fa3012d7710 -- :/22836 mark_down 192.168.0.12:6789/0 -- 0x12d0790
2011-01-26 03:30:16.938031 7fa3012d7710 -- :/22836 --> mon1 192.168.0.11:6789/0 -- auth(proto 0 30 bytes) v1 -- ?+0 0x12d02c0
2011-01-26 03:30:19.938193 7fa3012d7710 -- :/22836 mark_down 192.168.0.11:6789/0 -- 0x12c1820
2011-01-26 03:30:19.938253 7fa3012d7710 -- :/22836 --> mon2 192.168.0.12:6789/0 -- auth(proto 0 30 bytes) v1 -- ?+0 0x12d04b0
2011-01-26 03:30:22.936462 7fa304118720 monclient(hunting): authenticate timed out after 30
2011-01-26 03:30:22.936520 7fa304118720 librados: client.admin authentication error Connection timed out
couldn't initialize rados!
2011-01-26 03:30:22.936643 7fa3009d5710 -- :/22836 >> 192.168.0.12:6789/0 pipe(0x12d06a0 sd=4 pgs=0 cs=0 l=0).fault first fault
./common/Mutex.h: In function 'Mutex::~Mutex()', In thread 7fa304118720
./common/Mutex.h:97: FAILED assert(nlock == 0)
 ceph version 0.25~rc (commit:5a15bca2d327aef73756209c7e1c18fa32f86767)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x34) [0x7fa303bfc4bc]
 2: (Mutex::~Mutex()+0x30) [0x41ae14]
 3: (()+0x34e43a) [0x7fa303c1c43a]
 4: (__cxa_finalize()+0xa5) [0x7fa302bdb965]
 5: (()+0x21aa33) [0x7fa303ae8a33]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
terminate called after throwing an instance of 'ceph::FailedAssertion'
*** Caught signal (Aborted) ***
in thread 7fa304118720
 ceph version 0.25~rc (commit:5a15bca2d327aef73756209c7e1c18fa32f86767)
 1: (ceph::BackTrace::BackTrace(int)+0x2a) [0x7fa303bfc7a6]
 2: (()+0x34f300) [0x7fa303c1d300]
 3: (()+0xef60) [0x7fa3036c0f60]
 4: (gsignal()+0x35) [0x7fa302bd7165]
 5: (abort()+0x180) [0x7fa302bd9f70]
 6: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7fa30346adc5]
 7: (()+0xcb166) [0x7fa303469166]
 8: (()+0xcb193) [0x7fa303469193]
 9: (()+0xcb28e) [0x7fa30346928e]
 a: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1b3) [0x7fa303bfc63b]
 b: (Mutex::~Mutex()+0x30) [0x41ae14]
 c: (()+0x34e43a) [0x7fa303c1c43a]
 d: (__cxa_finalize()+0xa5) [0x7fa302bdb965]
 e: (()+0x21aa33) [0x7fa303ae8a33]
Aborted (core dumped)

The problem seems to be this assertion:

(gdb) list
92            // locked or a mutex which is unlocked, undefined behavior results.
93            pthread_mutex_init(&_m, NULL);
94          }
95        }
96        ~Mutex() {
97          assert(nlock == 0);
98          pthread_mutex_destroy(&_m);
99        }
100
101       bool is_locked() {

We need to ensure that we unlock all mutexes before calling exit. Either that, or we call something like _exit that skips running destructors. The former solution is much more desirable!

Actions #1

Updated by Sage Weil about 13 years ago

  • Assignee set to Yehuda Sadeh
Actions #2

Updated by Sage Weil about 13 years ago

  • Status changed from New to Resolved

this looks like it is fixed by 027335afe30127f841a5ea875e173ffc4cd7cf91.

Actions #3

Updated by Colin McCabe about 13 years ago

Although it wasn't apparent from my bug report, 027335afe30127f841a5ea875e173ffc4cd7cf91 was a parent revision of 5a15bca2d327aef73756209c7e1c18fa32f86767. Sorry for the confusion.

I fixed this one in eda48faf36e03156e0b6745c247244995989b1e1

Actions

Also available in: Atom PDF