Project

General

Profile

Actions

Bug #532

closed

OSD: repop_queue.front() == repop

Added by Wido den Hollander over 13 years ago. Updated over 13 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

On two of my OSD's I had the following crash:

Core was generated by `/usr/bin/cosd -i 3 -c /etc/ceph/ceph.conf'.
Program terminated with signal 11, Segmentation fault.
#0  0x0000000000000000 in ?? ()
(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x00000000005d97c1 in sigabrt_handler (signum=6) at config.cc:238
#2  <signal handler called>
#3  0x00007fce0c446a75 in raise () from /lib/libc.so.6
#4  0x00007fce0c44a5c0 in abort () from /lib/libc.so.6
#5  0x00007fce0ccfc8e5 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/libstdc++.so.6
#6  0x00007fce0ccfad16 in ?? () from /usr/lib/libstdc++.so.6
#7  0x00007fce0ccfad43 in std::terminate() () from /usr/lib/libstdc++.so.6
#8  0x00007fce0ccfae3e in __cxa_throw () from /usr/lib/libstdc++.so.6
#9  0x00000000005c7098 in ceph::__ceph_assert_fail (assertion=0x5f2e03 "repop_queue.front() == repop", 
    file=<value optimized out>, line=2024, func=<value optimized out>) at common/assert.cc:30
#10 0x0000000000479e72 in ReplicatedPG::eval_repop (this=0x2585700, repop=0x2e80d20) at osd/ReplicatedPG.cc:2024
#11 0x000000000047ccda in ReplicatedPG::op_applied (this=0x2585700, repop=0x2e80d20) at osd/ReplicatedPG.cc:1914
#12 0x00000000004b7a61 in C_OSD_OpApplied::finish(int) ()
#13 0x00000000005c60b8 in Finisher::finisher_thread_entry (this=0xe125f8) at common/Finisher.cc:54
#14 0x000000000046e73a in Thread::_entry_func (arg=0x5125) at ./common/Thread.h:39
#15 0x00007fce0d5419ca in start_thread () from /lib/libpthread.so.0
#16 0x00007fce0c4f96fd in clone () from /lib/libc.so.6
#17 0x0000000000000000 in ?? ()

osd5 (node06) also went down with a message about repop_queue.front() == repop.

I have no clue what could have triggered this, the cluster just had a fresh mkcephfs, so I have no idea how to reproduce it.

I've used cdebugpack to gather the relevant information, both packs have been uploaded to logger.ceph.widodh.nl:/srv/ceph/issues/osd_crash_repop_queue

Restarting the OSD's goes fine, they don't crash again.

Actions #1

Updated by Wido den Hollander over 13 years ago

I think I was a bit to premature about that, since osd5 just crash again with the same backtrace.

2010-11-01 20:48:03.971955 7f47067dc710 cephx: verify_authorizer ok nonce 4cca5a6223d6f34d reply_bl.length()=36
2010-11-01 20:48:04.059924 7f47097e2710 osd5 980 pg[1.74( v 980'130 (980'127,980'130] n=1 ec=2 les=980 976/979/979) [5,3,10] r=0 mlcod 980'128 active+clean]  removing repgather(0x2d7c0f0 applied 980'130 rep_tid=189 wfack= wfdisk= op=osd_op(mds0.2:160 200.00000001 [write 919796~3954] 1.f474) v1)
2010-11-01 20:48:04.059995 7f47097e2710 osd5 980 pg[1.74( v 980'130 (980'127,980'130] n=1 ec=2 les=980 976/979/979) [5,3,10] r=0 mlcod 980'128 active+clean]    q front is repgather(0x2d7c4b0 applied 980'129 rep_tid=188 wfack=3,10 wfdisk=3,10 op=osd_op(mds0.2:158 200.00000001 [write 910707~9089] 1.f474) v1)
osd/ReplicatedPG.cc: In function 'void ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)':
osd/ReplicatedPG.cc:2024: FAILED assert(repop_queue.front() == repop)
 ceph version 0.22 (commit:8a7c95f60ad0d821443721abf9779b8e2656ace8)
 1: (ReplicatedPG::repop_ack(ReplicatedPG::RepGather*, int, int, int, eversion_t)+0x168) [0x47c0a8]
 2: (ReplicatedPG::sub_op_modify_reply(MOSDSubOpReply*)+0x13c) [0x47c3cc]
 3: (OSD::dequeue_op(PG*)+0x122) [0x4d9b92]
 4: (ThreadPool::worker()+0x28f) [0x5c775f]
 5: (ThreadPool::WorkThread::entry()+0xd) [0x4fd26d]
 6: (Thread::_entry_func(void*)+0xa) [0x46e73a]
 7: (()+0x69ca) [0x7f4712e519ca]
 8: (clone()+0x6d) [0x7f4711e096fd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
*** Caught signal (ABRT) ***
 ceph version 0.22 (commit:8a7c95f60ad0d821443721abf9779b8e2656ace8)
 1: (sigabrt_handler(int)+0x7d) [0x5d978d]
 2: (()+0x33af0) [0x7f4711d56af0]
 3: (gsignal()+0x35) [0x7f4711d56a75]
 4: (abort()+0x180) [0x7f4711d5a5c0]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f471260c8e5]
 6: (()+0xcad16) [0x7f471260ad16]
 7: (()+0xcad43) [0x7f471260ad43]
 8: (()+0xcae3e) [0x7f471260ae3e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x448) [0x5c7098]
 10: (ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)+0x862) [0x479e72]
 11: (ReplicatedPG::repop_ack(ReplicatedPG::RepGather*, int, int, int, eversion_t)+0x168) [0x47c0a8]
 12: (ReplicatedPG::sub_op_modify_reply(MOSDSubOpReply*)+0x13c) [0x47c3cc]
 13: (OSD::dequeue_op(PG*)+0x122) [0x4d9b92]
 14: (ThreadPool::worker()+0x28f) [0x5c775f]
 15: (ThreadPool::WorkThread::entry()+0xd) [0x4fd26d]
 16: (Thread::_entry_func(void*)+0xa) [0x46e73a]
 17: (()+0x69ca) [0x7f4712e519ca]
 18: (clone()+0x6d) [0x7f4711e096fd]

Logging wasn't that high at the moment, i'll up it to see wether it crashes again with a bit more information.

Actions #2

Updated by Sage Weil over 13 years ago

This problem was in v0.22, but fixed in v0.22.1. Can you try with the latest testing (v0.22.2) or unstable?

Actions #3

Updated by Wido den Hollander over 13 years ago

  • Status changed from New to Closed

Indeed, my build system was still building the rc branch, oops!

Actions

Also available in: Atom PDF