Bug #3692
closedOSD's abort with "./common/Mutex.h: 89: FAILED assert(nlock == 0)"
0%
Description
I've seen this happen twice:
- Reboot a node running a number of OSD's
- Within a short period of time, seemingly random OSD's running on other nodes in the cluster will terminate with the following message:
2012-12-28 16:06:05.919215 7f1f67368700 -1 ./common/Mutex.h: In function 'Mutex::~Mutex()' thread 7f1f67368700 time 2012-12-28 16:06:05.912066
./common/Mutex.h: 89: FAILED assert(nlock == 0)
ceph version 0.48.2argonaut (commit:3e02b2fad88c2a95d9c0c86878f10d1beb780bfe)
1: /usr/bin/ceph-osd() [0x51546e]
2: (SimpleMessenger::Pipe::~Pipe()+0x2af) [0x7471ff]
3: (SimpleMessenger::reaper()+0x50f) [0x782c3f]
4: (SimpleMessenger::reaper_entry()+0x168) [0x783268]
5: (SimpleMessenger::ReaperThread::entry()+0xd) [0x74694d]
6: (()+0x7e9a) [0x7f1f6af48e9a]
7: (clone()+0x6d) [0x7f1f69be5cbd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Logs are attached. Core dumps are available if desired.
Files
Updated by Justin Lott over 11 years ago
Chronology of events (UTC) in the latest example of this happening, in case it's relevant:
15:50:46 mon.b is stopped on hpbs-c01-s02
:??: hpbs-c01-s02 configured with 10.30.66.5 (m02's current IP)
:??: hpbs-c01-m02 configured with 10.30.66.13 (s02's current IP)
15:54:12 mon.b updated in ceph.conf on entire cluster (set hostname to hpbs-c01-m02)
15:55:07 dns records for s02 and m02 updated
15:56:22 mon.a restarted on hpbs-c01-s01
15:56:47 mon.c restarted on hpbs-c01-m03
16:04:57 hpbs-c01-s02 is rebooted
16:04:58 hpbs-c01-m02 is rebooted
16:06:03 (approximate time) OSD's on other storage chassis SIGABRT'd with the following (identical) messages:
hpbs-c01-s01 (5,13)
hpbs-c01-s05 (56,61,63)
hpbs-c01-s06 (77)
hpbs-c01-s09 (117)
hpbs-c01-s10 (127)
hpbs-c01-s11 (140)
hpbs-c01-s12 (154)
(all time stamps were within 5 seconds of each other)
<time stamp> <thread id> ./common/Mutex.h: In function 'Mutex::~Mutex()' thread <thread id> time <time stamp>
./common/Mutex.h: 89: FAILED assert(nlock == 0)
16:12:13 (approximate time) all SIGABRT'd OSD's are restarted
16:12:42 the cluster loses quorum
2012-12-28 16:12:42.771544 7f1d9cb53700 1 mon.c@2(probing) e1 discarding message auth(proto 0 30 bytes epoch 0) v1 and sending client elsewhere; we are not in quorum
2012-12-28 16:12:51.435783 7f1d9d354700 0 log [INF] : mon.c calling new monitor election
2012-12-28 16:12:51.440984 7feba4313700 0 log [INF] : mon.a calling new monitor election
2012-12-28 16:12:56.424264 7feba3b12700 1 mon.a@0(electing) e1 discarding message auth(proto 0 27 bytes epoch 1) v1 and sending client elsewhere; we are not in quorum
- at this point
- all ceph commands to the cluster would hang (since we have no quorum)
- volume creates/destroy were failing (again, no quorum)
- disk IO on each storage chassis was in the 300-350MB/s range (nominal is 30MB/s at most)
- load avg was approx 35.00 on every storage node
- IO operations on RBD volumes attached to VM's would hang indefinitely
16:22:24 cluster regains quorum
2012-12-28 16:22:24.151799 7feba3b12700 0 log [INF] : mon.a@0 won leader election with quorum 0,2
2012-12-28 16:22:25.642174 7f1d9d354700 1 mon.c@2(peon).osd e68019 e68019: 168 osds: 161 up, 162 in
Updated by Sage Weil over 11 years ago
- Status changed from New to Won't Fix
This is a known problem with argonaut, but the fix is a rewrite of the whole module and we've chosen not to backport it. It is resolved in the new code (~v0.51 and later)