Actions
Bug #326
closedOSD crash PG::IndexedLog::unindex
% Done:
0%
Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
I've just seen this crash on one of my OSD's running with the latest unstable.
Have no idea what went wrong (Was just testing with the RADOS gw, changing bucket ACL's), just saw the OSD was down:
Core was generated by `/usr/bin/cosd -i 25 -c /etc/ceph/ceph.conf'. Program terminated with signal 6, Aborted. #0 0x00007f608c865a75 in raise () from /lib/libc.so.6 (gdb) bt #0 0x00007f608c865a75 in raise () from /lib/libc.so.6 #1 0x00007f608c8695c0 in abort () from /lib/libc.so.6 #2 0x00007f608d11a8e5 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/libstdc++.so.6 #3 0x00007f608d118d16 in ?? () from /usr/lib/libstdc++.so.6 #4 0x00007f608d118d43 in std::terminate() () from /usr/lib/libstdc++.so.6 #5 0x00007f608d118e3e in __cxa_throw () from /usr/lib/libstdc++.so.6 #6 0x00000000005bdc08 in ceph::__ceph_assert_fail (assertion=0x5f18fd "caller_ops.count(e.reqid)", file=<value optimized out>, line=437, func=<value optimized out>) at common/assert.cc:30 #7 0x000000000052975e in PG::IndexedLog::unindex (this=0x2817bc0, t=<value optimized out>, s=...) at osd/PG.h:437 #8 PG::IndexedLog::trim (this=0x2817bc0, t=<value optimized out>, s=...) at osd/PG.cc:137 #9 0x000000000052e04d in PG::trim (this=0x2817970, t=..., trim_to=...) at osd/PG.cc:2049 #10 0x0000000000487f51 in ReplicatedPG::log_op (this=0x2817970, logv=..., trim_to=..., t=...) at osd/ReplicatedPG.cc:2028 #11 0x0000000000498462 in ReplicatedPG::do_op (this=0x2817970, op=0x31b9240) at osd/ReplicatedPG.cc:657 #12 0x00000000004d5a65 in OSD::dequeue_op (this=0x2616010, pg=0x2817970) at osd/OSD.cc:4653 #13 0x00000000005be2cf in ThreadPool::worker (this=0x26164d8) at common/WorkQueue.cc:44 #14 0x00000000004f7cfd in ThreadPool::WorkThread::entry() () #15 0x000000000046ccca in Thread::_entry_func (arg=0x4f06) at ./common/Thread.h:39 #16 0x00007f608d6f89ca in start_thread () from /lib/libpthread.so.0 #17 0x00007f608c9186cd in clone () from /lib/libc.so.6 #18 0x0000000000000000 in ?? () (gdb) quit
10.07.30_19:47:47.549862 7f607f7fe710 osd25 3792 pg[1.758( v 3792'2549 (2948'2546,3792'2549] n=58 ec=2 les=3790 3789/3789/376 3) [25,28] r=0 mlcod 0'0 active+clean] CEPH_OSD_OP_READ 10.07.30_19:47:47.549940 7f607f7fe710 osd25 3792 pg[1.758( v 3792'2549 (2948'2546,3792'2549] n=58 ec=2 les=3790 3789/3789/376 3) [25,28] r=0 mlcod 0'0 active+clean] read got 951 / 951 bytes from obj 100001174aa.00000000/head 10.07.30_19:47:47.550123 7f607f7fe710 osd25 3792 pg[1.3f6( v 3792'1946 (2950'1943,3792'1946] n=113 ec=2 les=3790 3789/3789/37 63) [25,28] r=0 mlcod 2950'1943 active+clean] CEPH_OSD_OP_READ osd/PG.h: In function 'void PG::IndexedLog::unindex(PG::Log::Entry&)': osd/PG.h:437: FAILED assert(caller_ops.count(e.reqid)) 1: (PG::trim(ObjectStore::Transaction&, eversion_t)+0x5d) [0x52e04d] 2: (ReplicatedPG::log_op(std::vector<PG::Log::Entry, std::allocator<PG::Log::Entry> >&, eversion_t, ObjectStore::Transaction&)+0x91) [0x487f51] 3: (ReplicatedPG::do_op(MOSDOp*)+0xaa2) [0x498462] 4: (OSD::dequeue_op(PG*)+0x405) [0x4d5a65] 5: (ThreadPool::worker()+0x28f) [0x5be2cf] 6: (ThreadPool::WorkThread::entry()+0xd) [0x4f7cfd] 7: (Thread::_entry_func(void*)+0xa) [0x46ccca] 8: (()+0x69ca) [0x7f608d6f89ca] 9: (clone()+0x6d) [0x7f608c9186cd] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
The logs, coredump and binary are available on logger.ceph.widodh.nl in the directory /srv/ceph/issues/osd_crash_pg_indexedlog_unindex
Note: Watch the timestamp of the coredump, the logfile contains entries after the crash since i started the OSD again.
Updated by Wido den Hollander over 13 years ago
Saw this crash again, just added a new core file (core.node10.2629) to the logger machine. Also uploaded the log from today in the same directory.
Updated by Sage Weil over 13 years ago
- Priority changed from Normal to High
- Target version set to v0.21.3
Reported again on ML:
Date: Mon, 6 Sep 2010 17:18:04 +0800 From: Leander Yu <leander.yu@gmail.com> To: ceph-devel@vger.kernel.org Subject: OSD assert fail Parts/Attachments: 1 Shown ~37 lines Text 2 942 bytes Application ---------------------------------------- Hi all, I have setup a 10 osd + 2 mds + 3 mon ceph cluster. it runs ok at beginning. However after one day, some of the osd crashed with following assert fail I am using the unstable trunk. ceph.conf is attached. -------------- osd 3 ----------------- osd/PG.h: In function 'void PG::IndexedLog::index(PG::Log::Entry&)': osd/PG.h:429: FAILED assert(caller_ops.count(e.reqid) == 0) ?1: (OSD::_process_pg_info(unsigned int, int, PG::Info&, PG::Log&, PG::Missing&, std::map<int, MOSDPGInfo*, std::less<int>, std::allocator<std::pair<int const, MOSDPGInfo*> > >*, int&)+0xb06) [0x4cf426] ?2: (OSD::handle_pg_log(MOSDPGLog*)+0xa9) [0x4cf999] ?3: (OSD::_dispatch(Message*)+0x3ed) [0x4e7dfd] ?4: (OSD::ms_dispatch(Message*)+0x39) [0x4e86c9] ?5: (SimpleMessenger::dispatch_entry()+0x789) [0x46b5f9] ?6: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x45849c] ?7: (Thread::_entry_func(void*)+0xa) [0x46c0ca] ?8: (()+0x6a3a) [0x7f69fd39ea3a] ?9: (clone()+0x6d) [0x7f69fc5bc77d] -------------- osd 7 -------------------- osd/ReplicatedPG.cc: In function 'void ReplicatedPG::sub_op_pull(MOSDSubOp*)': osd/ReplicatedPG.cc:3021: FAILED assert(r == 0) 1: (OSD::dequeue_op(PG*)+0x344) [0x4e6fd4] 2: (ThreadPool::worker()+0x28f) [0x5b5a9f] 3: (ThreadPool::WorkThread::entry()+0xd) [0x4f0acd] 4: (Thread::_entry_func(void*)+0xa) [0x46c0ca] 5: (()+0x6a3a) [0x7efff4f12a3a] 6: (clone()+0x6d) [0x7efff413077d] Please let me if you need more information. I still keep the environment for collecting more data for debug. Thanks.
Updated by Sage Weil over 13 years ago
From the assert line numbers this looks like the unstable branch.
Updated by Sage Weil over 13 years ago
- Target version changed from v0.21.3 to v0.21.4
Updated by Sage Weil over 13 years ago
- Status changed from New to Resolved
Updated by Sage Weil over 13 years ago
- Target version changed from v0.21.4 to v0.22
Actions