Project

General

Profile

Actions

Bug #3692

closed

OSD's abort with "./common/Mutex.h: 89: FAILED assert(nlock == 0)"

Added by Justin Lott over 11 years ago. Updated over 11 years ago.

Status:
Won't Fix
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I've seen this happen twice:

- Reboot a node running a number of OSD's
- Within a short period of time, seemingly random OSD's running on other nodes in the cluster will terminate with the following message:

2012-12-28 16:06:05.919215 7f1f67368700 -1 ./common/Mutex.h: In function 'Mutex::~Mutex()' thread 7f1f67368700 time 2012-12-28 16:06:05.912066
./common/Mutex.h: 89: FAILED assert(nlock == 0)

ceph version 0.48.2argonaut (commit:3e02b2fad88c2a95d9c0c86878f10d1beb780bfe)
1: /usr/bin/ceph-osd() [0x51546e]
2: (SimpleMessenger::Pipe::~Pipe()+0x2af) [0x7471ff]
3: (SimpleMessenger::reaper()+0x50f) [0x782c3f]
4: (SimpleMessenger::reaper_entry()+0x168) [0x783268]
5: (SimpleMessenger::ReaperThread::entry()+0xd) [0x74694d]
6: (()+0x7e9a) [0x7f1f6af48e9a]
7: (clone()+0x6d) [0x7f1f69be5cbd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Logs are attached. Core dumps are available if desired.


Files

cephlogs.tar.gz (22.4 MB) cephlogs.tar.gz Justin Lott, 12/28/2012 12:01 PM
Actions #1

Updated by Justin Lott over 11 years ago

Chronology of events (UTC) in the latest example of this happening, in case it's relevant:

15:50:46 mon.b is stopped on hpbs-c01-s02

:??: hpbs-c01-s02 configured with 10.30.66.5 (m02's current IP)
:??: hpbs-c01-m02 configured with 10.30.66.13 (s02's current IP)

15:54:12 mon.b updated in ceph.conf on entire cluster (set hostname to hpbs-c01-m02)

15:55:07 dns records for s02 and m02 updated

15:56:22 mon.a restarted on hpbs-c01-s01

15:56:47 mon.c restarted on hpbs-c01-m03

16:04:57 hpbs-c01-s02 is rebooted

16:04:58 hpbs-c01-m02 is rebooted

16:06:03 (approximate time) OSD's on other storage chassis SIGABRT'd with the following (identical) messages:

hpbs-c01-s01 (5,13)
hpbs-c01-s05 (56,61,63)
hpbs-c01-s06 (77)
hpbs-c01-s09 (117)
hpbs-c01-s10 (127)
hpbs-c01-s11 (140)
hpbs-c01-s12 (154)
(all time stamps were within 5 seconds of each other)
<time stamp> <thread id> ./common/Mutex.h: In function 'Mutex::~Mutex()' thread <thread id> time <time stamp>
./common/Mutex.h: 89: FAILED assert(nlock == 0)

16:12:13 (approximate time) all SIGABRT'd OSD's are restarted

16:12:42 the cluster loses quorum
2012-12-28 16:12:42.771544 7f1d9cb53700 1 mon.c@2(probing) e1 discarding message auth(proto 0 30 bytes epoch 0) v1 and sending client elsewhere; we are not in quorum
2012-12-28 16:12:51.435783 7f1d9d354700 0 log [INF] : mon.c calling new monitor election
2012-12-28 16:12:51.440984 7feba4313700 0 log [INF] : mon.a calling new monitor election
2012-12-28 16:12:56.424264 7feba3b12700 1 mon.a@0(electing) e1 discarding message auth(proto 0 27 bytes epoch 1) v1 and sending client elsewhere; we are not in quorum

  1. at this point
  2. all ceph commands to the cluster would hang (since we have no quorum)
  3. volume creates/destroy were failing (again, no quorum)
  4. disk IO on each storage chassis was in the 300-350MB/s range (nominal is 30MB/s at most)
  5. load avg was approx 35.00 on every storage node
  6. IO operations on RBD volumes attached to VM's would hang indefinitely

16:22:24 cluster regains quorum
2012-12-28 16:22:24.151799 7feba3b12700 0 log [INF] : mon.a@0 won leader election with quorum 0,2
2012-12-28 16:22:25.642174 7f1d9d354700 1 mon.c@2(peon).osd e68019 e68019: 168 osds: 161 up, 162 in

Actions #2

Updated by Sage Weil over 11 years ago

  • Status changed from New to Won't Fix

This is a known problem with argonaut, but the fix is a rewrite of the whole module and we've chosen not to backport it. It is resolved in the new code (~v0.51 and later)

Actions

Also available in: Atom PDF