Project

General

Profile

Actions

Bug #7406

closed

Seg fault in find_object_context()in recent master rados run

Added by David Zafman about 10 years ago. Updated about 10 years ago.

Status:
Duplicate
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

On commit 7fe10f1271dd94d4c5014a35c6362a7772dc4508

/a/teuthology-2014-02-09_23:00:13-rados-master-testing-basic-plana/74133

osd.0 seg faulted

#0 0x00007f6c0ad37b7b in raise (sig=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:42
#1 0x000000000091bf6e in reraise_fatal (signum=11) at global/signal_handler.cc:59
#2 handle_fatal_signal (signum=11) at global/signal_handler.cc:105
#3 <signal handler called>
#4 ReplicatedPG::find_object_context (this=0x2b5d000, oid=..., pobc=0x7f6bf5e8b200, can_create=<optimized out>, pmissing=0x7f6bf5e8aea0) at osd/ReplicatedPG.cc:6775
#5 0x000000000088c724 in ReplicatedPG::do_op (this=0x2b5d000, op=...) at osd/ReplicatedPG.cc:1170
#6 0x0000000000833d22 in ReplicatedPG::do_request (this=0x2b5d000, op=..., handle=...) at osd/ReplicatedPG.cc:1027
#7 0x00000000006081ba in OSD::dequeue_op (this=0x2657000, pg=..., op=..., handle=...) at osd/OSD.cc:7239
#8 0x0000000000621608 in OSD::OpWQ::_process (this=0x2657e28, pg=..., handle=...) at osd/OSD.cc:7209
#9 0x00000000006634ec in ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process (
this=0x2657e28, handle=...) at ./common/WorkQueue.h:190
#10 0x00000000009e91b6 in ThreadPool::worker (this=0x2657470, wt=0x2692e40) at common/WorkQueue.cc:125
#11 0x00000000009eafc0 in ThreadPool::WorkThread::entry (this=<optimized out>) at common/WorkQueue.h:317
#12 0x00007f6c0ad2fe9a in start_thread (arg=0x7f6bf5e8c700) at pthread_create.c:308
#13 0x00007f6c094f43fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#14 0x0000000000000000 in ?? ()
(gdb) f 4
#4 ReplicatedPG::find_object_context (this=0x2b5d000, oid=..., pobc=0x7f6bf5e8b200, can_create=<optimized out>, pmissing=0x7f6bf5e8aea0) at osd/ReplicatedPG.cc:6775
6775 in osd/ReplicatedPG.cc
(gdb) p obc
$13 = {<std::tr1::__shared_ptr<ObjectContext, (_gnu_cxx::_Lock_policy)2>> = {_M_ptr = 0x2c872c0, _M_refcount = {_M_pi = 0x2c86000}}, <No data fields>}
(gdb) p ((ObjectContext *)0x2c872c0)->obs.oi.snaps
$15 = {<std::_Vector_base<snapid_t, std::allocator<snapid_t> >> = {
_M_impl = {<std::allocator<snapid_t>> = {<
_gnu_cxx::new_allocator<snapid_t>> = {<No data fields>}, <No data fields>}, _M_start = 0x0, _M_finish = 0x0,
_M_end_of_storage = 0x0}}, <No data fields>}

Lines 6774 and 6775. Not sure why it didn't crash on 6774 with snaps.size() == 0 it didn't crash on snaps[-1] reference but did on snaps0 unless this was optimized in a different order:

snapid_t first = obc->obs.oi.snaps[obc->obs.oi.snaps.size()-1];
snapid_t last = obc->obs.oi.snaps[0];

Nothing in the last section of code in find_object_context() has changed recently:

c6b73eb4 src/osd/ReplicatedPG.cc        (Sage Weil         2013-12-30 12:52:20 -0800  6763)   if (!obc->ssc) {
c6b73eb4 src/osd/ReplicatedPG.cc        (Sage Weil         2013-12-30 12:52:20 -0800  6764)     obc->ssc = ssc;
c6b73eb4 src/osd/ReplicatedPG.cc        (Sage Weil         2013-12-30 12:52:20 -0800  6765)   } else {
c6b73eb4 src/osd/ReplicatedPG.cc        (Sage Weil         2013-12-30 12:52:20 -0800  6766)     assert(obc->ssc == ssc);
c6b73eb4 src/osd/ReplicatedPG.cc        (Sage Weil         2013-12-30 12:52:20 -0800  6767)     put_snapset_context(ssc);
c6b73eb4 src/osd/ReplicatedPG.cc        (Sage Weil         2013-12-30 12:52:20 -0800  6768)   }
c3c1541c src/osd/ReplicatedPG.cc        (Sage Weil         2013-12-23 17:25:07 -0800  6769)   ssc = 0;
c2012f3c src/osd/ReplicatedPG.cc        (Sage Weil         2009-05-22 11:43:38 -0700  6770)
c2012f3c src/osd/ReplicatedPG.cc        (Sage Weil         2009-05-22 11:43:38 -0700  6771)   // clone
c3c1541c src/osd/ReplicatedPG.cc        (Sage Weil         2013-12-23 17:25:07 -0800  6772)   dout(20) << "find_object_context  " << soid << " snaps " << obc->obs.oi.snaps
c3c1541c src/osd/ReplicatedPG.cc        (Sage Weil         2013-12-23 17:25:07 -0800  6773)        << dendl;
0ebda4a9 src/osd/ReplicatedPG.cc        (Sage Weil         2009-05-27 21:43:41 -0700  6774)   snapid_t first = obc->obs.oi.snaps[obc->obs.oi.snaps.size()-1];
0ebda4a9 src/osd/ReplicatedPG.cc        (Sage Weil         2009-05-27 21:43:41 -0700  6775)   snapid_t last = obc->obs.oi.snaps[0];
5492cbdf src/osd/ReplicatedPG.cc        (Samuel Just       2011-07-13 11:02:56 -0700  6776)   if (first <= oid.snap) {
bc1782a1 src/osd/ReplicatedPG.cc        (Sage Weil         2011-06-16 19:42:48 -0700  6777)     dout(20) << "find_object_context  " << soid << " [" << first << "," << last
5492cbdf src/osd/ReplicatedPG.cc        (Samuel Just       2011-07-13 11:02:56 -0700  6778)          << "] contains " << oid.snap << " -- HIT " << obc->obs << dendl;
c2012f3c src/osd/ReplicatedPG.cc        (Sage Weil         2009-05-22 11:43:38 -0700  6779)     *pobc = obc;
c2012f3c src/osd/ReplicatedPG.cc        (Sage Weil         2009-05-22 11:43:38 -0700  6780)     return 0;
c2012f3c src/osd/ReplicatedPG.cc        (Sage Weil         2009-05-22 11:43:38 -0700  6781)   } else {
bc1782a1 src/osd/ReplicatedPG.cc        (Sage Weil         2011-06-16 19:42:48 -0700  6782)     dout(20) << "find_object_context  " << soid << " [" << first << "," << last
5492cbdf src/osd/ReplicatedPG.cc        (Samuel Just       2011-07-13 11:02:56 -0700  6783)          << "] does not contain " << oid.snap << " -- DNE" << dendl;
c2012f3c src/osd/ReplicatedPG.cc        (Sage Weil         2009-05-22 11:43:38 -0700  6784)     return -ENOENT;
c2012f3c src/osd/ReplicatedPG.cc        (Sage Weil         2009-05-22 11:43:38 -0700  6785)   }
7a1eaa98 src/osd/ReplicatedPG.cc        (Sage Weil         2008-12-03 19:46:08 -0800  6786) }

Related issues 1 (0 open1 closed)

Is duplicate of Ceph - Bug #7634: osd/ReplicatedPG.cc: 7877: FAILED assert((data_included.empty() && data.length() == 0) || (!data_included.empty() && data.length() > 0))ResolvedSamuel Just03/06/2014

Actions
Actions #1

Updated by David Zafman about 10 years ago

  • Description updated (diff)
Actions #2

Updated by David Zafman about 10 years ago

  • Status changed from New to Fix Under Review
  • Assignee set to David Zafman

Pushed wip-7406

Actions #3

Updated by Sage Weil about 10 years ago

I don't think snaps should ever be empty.. if it is, there is a bug somewhere that is causing it. On the one hand not crashing might be a good thing, but right now I think it's more important that we find the bug that allowed it to appear empty in the first place. If this was triggered in QA then we should work on reproducing it with logs.

Actions #4

Updated by David Zafman about 10 years ago

  • Status changed from Fix Under Review to 12
Actions #5

Updated by David Zafman about 10 years ago

  • Status changed from 12 to In Progress
Actions #6

Updated by Samuel Just about 10 years ago

  • Priority changed from Normal to Urgent
Actions #7

Updated by David Zafman about 10 years ago

  • Status changed from In Progress to New
Actions #8

Updated by David Zafman about 10 years ago

  • Assignee deleted (David Zafman)
Actions #9

Updated by Samuel Just about 10 years ago

  • Status changed from New to Duplicate
Actions

Also available in: Atom PDF