Project

General

Profile

Actions

Bug #326

closed

OSD crash PG::IndexedLog::unindex

Added by Wido den Hollander over 13 years ago. Updated over 13 years ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
OSD
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I've just seen this crash on one of my OSD's running with the latest unstable.

Have no idea what went wrong (Was just testing with the RADOS gw, changing bucket ACL's), just saw the OSD was down:

Core was generated by `/usr/bin/cosd -i 25 -c /etc/ceph/ceph.conf'.
Program terminated with signal 6, Aborted.
#0  0x00007f608c865a75 in raise () from /lib/libc.so.6
(gdb) bt
#0  0x00007f608c865a75 in raise () from /lib/libc.so.6
#1  0x00007f608c8695c0 in abort () from /lib/libc.so.6
#2  0x00007f608d11a8e5 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/libstdc++.so.6
#3  0x00007f608d118d16 in ?? () from /usr/lib/libstdc++.so.6
#4  0x00007f608d118d43 in std::terminate() () from /usr/lib/libstdc++.so.6
#5  0x00007f608d118e3e in __cxa_throw () from /usr/lib/libstdc++.so.6
#6  0x00000000005bdc08 in ceph::__ceph_assert_fail (assertion=0x5f18fd "caller_ops.count(e.reqid)", 
    file=<value optimized out>, line=437, func=<value optimized out>) at common/assert.cc:30
#7  0x000000000052975e in PG::IndexedLog::unindex (this=0x2817bc0, t=<value optimized out>, s=...) at osd/PG.h:437
#8  PG::IndexedLog::trim (this=0x2817bc0, t=<value optimized out>, s=...) at osd/PG.cc:137
#9  0x000000000052e04d in PG::trim (this=0x2817970, t=..., trim_to=...) at osd/PG.cc:2049
#10 0x0000000000487f51 in ReplicatedPG::log_op (this=0x2817970, logv=..., trim_to=..., t=...) at osd/ReplicatedPG.cc:2028
#11 0x0000000000498462 in ReplicatedPG::do_op (this=0x2817970, op=0x31b9240) at osd/ReplicatedPG.cc:657
#12 0x00000000004d5a65 in OSD::dequeue_op (this=0x2616010, pg=0x2817970) at osd/OSD.cc:4653
#13 0x00000000005be2cf in ThreadPool::worker (this=0x26164d8) at common/WorkQueue.cc:44
#14 0x00000000004f7cfd in ThreadPool::WorkThread::entry() ()
#15 0x000000000046ccca in Thread::_entry_func (arg=0x4f06) at ./common/Thread.h:39
#16 0x00007f608d6f89ca in start_thread () from /lib/libpthread.so.0
#17 0x00007f608c9186cd in clone () from /lib/libc.so.6
#18 0x0000000000000000 in ?? ()
(gdb) quit
10.07.30_19:47:47.549862 7f607f7fe710 osd25 3792 pg[1.758( v 3792'2549 (2948'2546,3792'2549] n=58 ec=2 les=3790 3789/3789/376
3) [25,28] r=0 mlcod 0'0 active+clean] CEPH_OSD_OP_READ
10.07.30_19:47:47.549940 7f607f7fe710 osd25 3792 pg[1.758( v 3792'2549 (2948'2546,3792'2549] n=58 ec=2 les=3790 3789/3789/376
3) [25,28] r=0 mlcod 0'0 active+clean]  read got 951 / 951 bytes from obj 100001174aa.00000000/head
10.07.30_19:47:47.550123 7f607f7fe710 osd25 3792 pg[1.3f6( v 3792'1946 (2950'1943,3792'1946] n=113 ec=2 les=3790 3789/3789/37
63) [25,28] r=0 mlcod 2950'1943 active+clean] CEPH_OSD_OP_READ
osd/PG.h: In function 'void PG::IndexedLog::unindex(PG::Log::Entry&)':
osd/PG.h:437: FAILED assert(caller_ops.count(e.reqid))
 1: (PG::trim(ObjectStore::Transaction&, eversion_t)+0x5d) [0x52e04d]
 2: (ReplicatedPG::log_op(std::vector<PG::Log::Entry, std::allocator<PG::Log::Entry> >&, eversion_t, ObjectStore::Transaction&)+0x91) [0x487f51]
 3: (ReplicatedPG::do_op(MOSDOp*)+0xaa2) [0x498462]
 4: (OSD::dequeue_op(PG*)+0x405) [0x4d5a65]
 5: (ThreadPool::worker()+0x28f) [0x5be2cf]
 6: (ThreadPool::WorkThread::entry()+0xd) [0x4f7cfd]
 7: (Thread::_entry_func(void*)+0xa) [0x46ccca]
 8: (()+0x69ca) [0x7f608d6f89ca]
 9: (clone()+0x6d) [0x7f608c9186cd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

The logs, coredump and binary are available on logger.ceph.widodh.nl in the directory /srv/ceph/issues/osd_crash_pg_indexedlog_unindex

Note: Watch the timestamp of the coredump, the logfile contains entries after the crash since i started the OSD again.


Related issues 1 (0 open1 closed)

Related to Ceph - Bug #327: OSD crash PG::IndexedLog::printClosed07/30/2010

Actions
Actions #1

Updated by Wido den Hollander over 13 years ago

Saw this crash again, just added a new core file (core.node10.2629) to the logger machine. Also uploaded the log from today in the same directory.

Actions #2

Updated by Sage Weil over 13 years ago

  • Priority changed from Normal to High
  • Target version set to v0.21.3

Reported again on ML:

Date: Mon, 6 Sep 2010 17:18:04 +0800
From: Leander Yu <leander.yu@gmail.com>
To: ceph-devel@vger.kernel.org
Subject: OSD assert fail
Parts/Attachments:
   1 Shown    ~37 lines  Text
   2          942 bytes  Application
----------------------------------------

Hi all,
I have setup a 10 osd + 2 mds + 3 mon ceph cluster. it runs ok at
beginning. However after one day, some of the osd  crashed with
following assert fail
I am using the unstable trunk. ceph.conf is attached.

-------------- osd 3 -----------------
osd/PG.h: In function 'void PG::IndexedLog::index(PG::Log::Entry&)':
osd/PG.h:429: FAILED assert(caller_ops.count(e.reqid) == 0)
?1: (OSD::_process_pg_info(unsigned int, int, PG::Info&, PG::Log&,
PG::Missing&, std::map<int, MOSDPGInfo*, std::less<int>,
std::allocator<std::pair<int const, MOSDPGInfo*> > >*, int&)+0xb06)
[0x4cf426]
?2: (OSD::handle_pg_log(MOSDPGLog*)+0xa9) [0x4cf999]
?3: (OSD::_dispatch(Message*)+0x3ed) [0x4e7dfd]
?4: (OSD::ms_dispatch(Message*)+0x39) [0x4e86c9]
?5: (SimpleMessenger::dispatch_entry()+0x789) [0x46b5f9]
?6: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x45849c]
?7: (Thread::_entry_func(void*)+0xa) [0x46c0ca]
?8: (()+0x6a3a) [0x7f69fd39ea3a]
?9: (clone()+0x6d) [0x7f69fc5bc77d]

-------------- osd 7 --------------------
osd/ReplicatedPG.cc: In function 'void ReplicatedPG::sub_op_pull(MOSDSubOp*)':
osd/ReplicatedPG.cc:3021: FAILED assert(r == 0)
 1: (OSD::dequeue_op(PG*)+0x344) [0x4e6fd4]
 2: (ThreadPool::worker()+0x28f) [0x5b5a9f]
 3: (ThreadPool::WorkThread::entry()+0xd) [0x4f0acd]
 4: (Thread::_entry_func(void*)+0xa) [0x46c0ca]
 5: (()+0x6a3a) [0x7efff4f12a3a]
 6: (clone()+0x6d) [0x7efff413077d]

Please let me if you need more information. I still keep the
environment for collecting more data for debug.

Thanks.

Actions #3

Updated by Sage Weil over 13 years ago

From the assert line numbers this looks like the unstable branch.

Actions #4

Updated by Sage Weil over 13 years ago

  • Target version changed from v0.21.3 to v0.21.4
Actions #5

Updated by Sage Weil over 13 years ago

  • Status changed from New to Resolved
Actions #6

Updated by Sage Weil over 13 years ago

  • Target version changed from v0.21.4 to v0.22
Actions

Also available in: Atom PDF