Bug #532
closedOSD: repop_queue.front() == repop
0%
Description
On two of my OSD's I had the following crash:
Core was generated by `/usr/bin/cosd -i 3 -c /etc/ceph/ceph.conf'. Program terminated with signal 11, Segmentation fault. #0 0x0000000000000000 in ?? () (gdb) bt #0 0x0000000000000000 in ?? () #1 0x00000000005d97c1 in sigabrt_handler (signum=6) at config.cc:238 #2 <signal handler called> #3 0x00007fce0c446a75 in raise () from /lib/libc.so.6 #4 0x00007fce0c44a5c0 in abort () from /lib/libc.so.6 #5 0x00007fce0ccfc8e5 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/libstdc++.so.6 #6 0x00007fce0ccfad16 in ?? () from /usr/lib/libstdc++.so.6 #7 0x00007fce0ccfad43 in std::terminate() () from /usr/lib/libstdc++.so.6 #8 0x00007fce0ccfae3e in __cxa_throw () from /usr/lib/libstdc++.so.6 #9 0x00000000005c7098 in ceph::__ceph_assert_fail (assertion=0x5f2e03 "repop_queue.front() == repop", file=<value optimized out>, line=2024, func=<value optimized out>) at common/assert.cc:30 #10 0x0000000000479e72 in ReplicatedPG::eval_repop (this=0x2585700, repop=0x2e80d20) at osd/ReplicatedPG.cc:2024 #11 0x000000000047ccda in ReplicatedPG::op_applied (this=0x2585700, repop=0x2e80d20) at osd/ReplicatedPG.cc:1914 #12 0x00000000004b7a61 in C_OSD_OpApplied::finish(int) () #13 0x00000000005c60b8 in Finisher::finisher_thread_entry (this=0xe125f8) at common/Finisher.cc:54 #14 0x000000000046e73a in Thread::_entry_func (arg=0x5125) at ./common/Thread.h:39 #15 0x00007fce0d5419ca in start_thread () from /lib/libpthread.so.0 #16 0x00007fce0c4f96fd in clone () from /lib/libc.so.6 #17 0x0000000000000000 in ?? ()
osd5 (node06) also went down with a message about repop_queue.front() == repop.
I have no clue what could have triggered this, the cluster just had a fresh mkcephfs, so I have no idea how to reproduce it.
I've used cdebugpack to gather the relevant information, both packs have been uploaded to logger.ceph.widodh.nl:/srv/ceph/issues/osd_crash_repop_queue
Restarting the OSD's goes fine, they don't crash again.
Updated by Wido den Hollander over 13 years ago
I think I was a bit to premature about that, since osd5 just crash again with the same backtrace.
2010-11-01 20:48:03.971955 7f47067dc710 cephx: verify_authorizer ok nonce 4cca5a6223d6f34d reply_bl.length()=36 2010-11-01 20:48:04.059924 7f47097e2710 osd5 980 pg[1.74( v 980'130 (980'127,980'130] n=1 ec=2 les=980 976/979/979) [5,3,10] r=0 mlcod 980'128 active+clean] removing repgather(0x2d7c0f0 applied 980'130 rep_tid=189 wfack= wfdisk= op=osd_op(mds0.2:160 200.00000001 [write 919796~3954] 1.f474) v1) 2010-11-01 20:48:04.059995 7f47097e2710 osd5 980 pg[1.74( v 980'130 (980'127,980'130] n=1 ec=2 les=980 976/979/979) [5,3,10] r=0 mlcod 980'128 active+clean] q front is repgather(0x2d7c4b0 applied 980'129 rep_tid=188 wfack=3,10 wfdisk=3,10 op=osd_op(mds0.2:158 200.00000001 [write 910707~9089] 1.f474) v1) osd/ReplicatedPG.cc: In function 'void ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)': osd/ReplicatedPG.cc:2024: FAILED assert(repop_queue.front() == repop) ceph version 0.22 (commit:8a7c95f60ad0d821443721abf9779b8e2656ace8) 1: (ReplicatedPG::repop_ack(ReplicatedPG::RepGather*, int, int, int, eversion_t)+0x168) [0x47c0a8] 2: (ReplicatedPG::sub_op_modify_reply(MOSDSubOpReply*)+0x13c) [0x47c3cc] 3: (OSD::dequeue_op(PG*)+0x122) [0x4d9b92] 4: (ThreadPool::worker()+0x28f) [0x5c775f] 5: (ThreadPool::WorkThread::entry()+0xd) [0x4fd26d] 6: (Thread::_entry_func(void*)+0xa) [0x46e73a] 7: (()+0x69ca) [0x7f4712e519ca] 8: (clone()+0x6d) [0x7f4711e096fd] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. *** Caught signal (ABRT) *** ceph version 0.22 (commit:8a7c95f60ad0d821443721abf9779b8e2656ace8) 1: (sigabrt_handler(int)+0x7d) [0x5d978d] 2: (()+0x33af0) [0x7f4711d56af0] 3: (gsignal()+0x35) [0x7f4711d56a75] 4: (abort()+0x180) [0x7f4711d5a5c0] 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f471260c8e5] 6: (()+0xcad16) [0x7f471260ad16] 7: (()+0xcad43) [0x7f471260ad43] 8: (()+0xcae3e) [0x7f471260ae3e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x448) [0x5c7098] 10: (ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)+0x862) [0x479e72] 11: (ReplicatedPG::repop_ack(ReplicatedPG::RepGather*, int, int, int, eversion_t)+0x168) [0x47c0a8] 12: (ReplicatedPG::sub_op_modify_reply(MOSDSubOpReply*)+0x13c) [0x47c3cc] 13: (OSD::dequeue_op(PG*)+0x122) [0x4d9b92] 14: (ThreadPool::worker()+0x28f) [0x5c775f] 15: (ThreadPool::WorkThread::entry()+0xd) [0x4fd26d] 16: (Thread::_entry_func(void*)+0xa) [0x46e73a] 17: (()+0x69ca) [0x7f4712e519ca] 18: (clone()+0x6d) [0x7f4711e096fd]
Logging wasn't that high at the moment, i'll up it to see wether it crashes again with a bit more information.
Updated by Sage Weil over 13 years ago
This problem was in v0.22, but fixed in v0.22.1. Can you try with the latest testing (v0.22.2) or unstable?
Updated by Wido den Hollander over 13 years ago
- Status changed from New to Closed
Indeed, my build system was still building the rc branch, oops!