Bug #3692: OSD's abort with "./common/Mutex.h: 89: FAILED assert(nlock == 0)" - rbd - Ceph

Actions

Copy link

Bug #3692

closed

OSD's abort with "./common/Mutex.h: 89: FAILED assert(nlock == 0)"

Added by Justin Lott over 11 years ago. Updated over 11 years ago.

Status:

Won't Fix

Priority:

Normal

Assignee:

Target version:

% Done:

Source:

Development

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

I've seen this happen twice:

- Reboot a node running a number of OSD's
- Within a short period of time, seemingly random OSD's running on other nodes in the cluster will terminate with the following message:

2012-12-28 16:06:05.919215 7f1f67368700 -1 ./common/Mutex.h: In function 'Mutex::~Mutex()' thread 7f1f67368700 time 2012-12-28 16:06:05.912066
./common/Mutex.h: 89: FAILED assert(nlock == 0)

ceph version 0.48.2argonaut (commit:3e02b2fad88c2a95d9c0c86878f10d1beb780bfe)
 1: /usr/bin/ceph-osd() [0x51546e]
 2: (SimpleMessenger::Pipe::~Pipe()+0x2af) [0x7471ff]
 3: (SimpleMessenger::reaper()+0x50f) [0x782c3f]
 4: (SimpleMessenger::reaper_entry()+0x168) [0x783268]
 5: (SimpleMessenger::ReaperThread::entry()+0xd) [0x74694d]
 6: (()+0x7e9a) [0x7f1f6af48e9a]
 7: (clone()+0x6d) [0x7f1f69be5cbd]
 NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

Logs are attached. Core dumps are available if desired.

Files

cephlogs.tar.gz (22.4 MB) cephlogs.tar.gz

Justin Lott, 12/28/2012 12:01 PM

Actions

Copy link

Updated by Justin Lott over 11 years ago

Chronology of events (UTC) in the latest example of this happening, in case it's relevant:

15:50:46 mon.b is stopped on hpbs-c01-s02

:??: hpbs-c01-s02 configured with 10.30.66.5 (m02's current IP)
:??: hpbs-c01-m02 configured with 10.30.66.13 (s02's current IP)

15:54:12 mon.b updated in ceph.conf on entire cluster (set hostname to hpbs-c01-m02)

15:55:07 dns records for s02 and m02 updated

15:56:22 mon.a restarted on hpbs-c01-s01

15:56:47 mon.c restarted on hpbs-c01-m03

16:04:57 hpbs-c01-s02 is rebooted

16:04:58 hpbs-c01-m02 is rebooted

16:06:03 (approximate time) OSD's on other storage chassis SIGABRT'd with the following (identical) messages:

hpbs-c01-s01 (5,13)
    hpbs-c01-s05 (56,61,63)
    hpbs-c01-s06 (77)
    hpbs-c01-s09 (117)
    hpbs-c01-s10 (127)
    hpbs-c01-s11 (140)
    hpbs-c01-s12 (154)

(all time stamps were within 5 seconds of each other)

&lt;time stamp&gt; &lt;thread id&gt; ./common/Mutex.h: In function 'Mutex::~Mutex()' thread &lt;thread id&gt; time &lt;time stamp&gt;
    ./common/Mutex.h: 89: FAILED assert(nlock == 0)

16:12:13 (approximate time) all SIGABRT'd OSD's are restarted

16:12:42 the cluster loses quorum
2012-12-28 16:12:42.771544 7f1d9cb53700 1 mon.c@2(probing) e1 discarding message auth(proto 0 30 bytes epoch 0) v1 and sending client elsewhere; we are not in quorum
2012-12-28 16:12:51.435783 7f1d9d354700 0 log [INF] : mon.c calling new monitor election
2012-12-28 16:12:51.440984 7feba4313700 0 log [INF] : mon.a calling new monitor election
2012-12-28 16:12:56.424264 7feba3b12700 1 mon.a@0(electing) e1 discarding message auth(proto 0 27 bytes epoch 1) v1 and sending client elsewhere; we are not in quorum

at this point
all ceph commands to the cluster would hang (since we have no quorum)
volume creates/destroy were failing (again, no quorum)
disk IO on each storage chassis was in the 300-350MB/s range (nominal is 30MB/s at most)
load avg was approx 35.00 on every storage node
IO operations on RBD volumes attached to VM's would hang indefinitely

16:22:24 cluster regains quorum
2012-12-28 16:22:24.151799 7feba3b12700 0 log [INF] : mon.a@0 won leader election with quorum 0,2
2012-12-28 16:22:25.642174 7f1d9d354700 1 mon.c@2(peon).osd e68019 e68019: 168 osds: 161 up, 162 in

Actions

Copy link

Updated by Sage Weil over 11 years ago

Status changed from New to Won't Fix

This is a known problem with argonaut, but the fix is a rewrite of the whole module and we've chosen not to backport it. It is resolved in the new code (~v0.51 and later)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » rbd

Custom queries

Bug #3692

OSD's abort with "./common/Mutex.h: 89: FAILED assert(nlock == 0)"

Updated by Justin Lott over 11 years ago

Updated by Sage Weil over 11 years ago