Project

General

Profile

Actions

Bug #1028

closed

segfault in OSDMap::object_locator_to_pg

Added by ar Fred almost 13 years ago. Updated almost 13 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
OSD
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

As reported yesterday on IRC, this is a crash I get when starting an OSD.

this is at v0.27

From the logs:

2011-04-26 19:33:53.105243 7f7ff576c700 osd0 2055 pg[3.9( v 1956'286201 (1956'286196,1956'286201]+backlog n=5 ec=2 les=2047 2051/2054/2051) [0,1] r=0 lcod 0'0 mlcod 0'0 active] oi.user_version=0'0 is_modify=1
2011-04-26 19:33:53.105300 7f7ff576c700 osd0 2055 pg[3.9( v 1956'286201 (1956'286196,1956'286201]+backlog n=5 ec=2 les=2047 2051/2054/2051) [0,1] r=0 lcod 0'0 mlcod 0'0 active] watch: ctx->obc=0x1c0b480 cookie=1 oi.version=1 ctx->at_version=2055'286202
2011-04-26 19:33:53.105315 7f7ff576c700 osd0 2055 pg[3.9( v 1956'286201 (1956'286196,1956'286201]+backlog n=5 ec=2 les=2047 2051/2054/2051) [0,1] r=0 lcod 0'0 mlcod 0'0 active] watch: oi.user_version=0
*** Caught signal (Segmentation fault) **
 in thread 0x7f7ff576c700
 ceph version  (commit:)
 1: /usr/bin/cosd() [0x642279]
 2: (()+0xfc60) [0x7f800273fc60]
 3: (OSDMap::object_locator_to_pg(object_t const&, object_locator_t const&)+0x72) [0x4d6a52]
 4: (ReplicatedPG::do_osd_ops(ReplicatedPG::OpContext*, std::vector<OSDOp, std::allocator<OSDOp> >&, ceph::buffer::list&)+0x8207) [0x4c5637]
 5: (ReplicatedPG::prepare_transaction(ReplicatedPG::OpContext*)+0x68) [0x4c6278]
 6: (ReplicatedPG::do_op(MOSDOp*)+0x97f) [0x4c7c3f]
 7: (OSD::dequeue_op(PG*)+0x36d) [0x51050d]
 8: (ThreadPool::worker()+0x2a2) [0x626fa2]
 9: (ThreadPool::WorkThread::entry()+0xd) [0x529f1d]
 10: (()+0x6d8c) [0x7f8002736d8c]
 11: (clone()+0x6d) [0x7f800138404d]

What GDB has to say:

#0  0x00007f104e54db3b in raise () from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) bt
#0  0x00007f104e54db3b in raise () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x0000000000641a12 in reraise_fatal (signum=11) at common/signal.cc:63
#2  0x000000000064248c in handle_fatal_signal (signum=11) at common/signal.cc:110
#3  <signal handler called>
#4  0x00000000004d6a52 in OSDMap::object_locator_to_pg (this=0x1cb7900, oid=..., loc=...) at osd/OSDMap.h:748
#5  0x00000000004c5637 in ReplicatedPG::do_osd_ops (this=0x2260000, ctx=0x467d678, ops=..., odata=...) at osd/ReplicatedPG.cc:1617
#6  0x00000000004c6278 in ReplicatedPG::prepare_transaction (this=0x2260000, ctx=0x3c81b00) at osd/ReplicatedPG.cc:2240
#7  0x00000000004c7c3f in ReplicatedPG::do_op (this=0x2260000, op=0x4711000) at osd/ReplicatedPG.cc:501
#8  0x000000000051050d in OSD::dequeue_op (this=0x1ca7000, pg=0x2260000) at osd/OSD.cc:5437
#9  0x0000000000626fa2 in ThreadPool::worker (this=0x1ca73f0) at common/WorkQueue.cc:44
#10 0x0000000000529f1d in ThreadPool::WorkThread::entry (this=<value optimized out>) at ./common/WorkQueue.h:113
#11 0x00007f104e544d8c in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#12 0x00007f104d19204d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#13 0x0000000000000000 in ?? ()

(gdb) f 4
#4  0x00000000004d6a52 in OSDMap::object_locator_to_pg (this=0x1cb7900, oid=..., loc=...) at osd/OSDMap.h:748
748     osd/OSDMap.h: No such file or directory.
        in osd/OSDMap.h

(gdb) p oid
$1 = (const object_t &) @0x2e0a248: {name = {static npos = <optimized out>, 
    _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x467d678 "munin.rbd"}}}

(gdb) p loc
$2 = (const object_locator_t &) @0x2e0a258: {pool = -1, preferred = -1, key = {static npos = <optimized out>, 
    _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x8e4778 ""}}}
(gdb) f 7
#7  0x00000000004c7c3f in ReplicatedPG::do_op (this=0x2260000, op=0x4711000) at osd/ReplicatedPG.cc:501
501     osd/ReplicatedPG.cc: No such file or directory.
        in osd/ReplicatedPG.cc

(gdb)  p op->oloc
$3 = {pool = 3, preferred = -1, key = {static npos = <optimized out>, 
    _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x8e4778 ""}}}


Files

osd.0.log.gz (8.17 KB) osd.0.log.gz ar Fred, 05/08/2011 11:30 PM
osd.1.log.gz (2.73 KB) osd.1.log.gz ar Fred, 05/08/2011 11:30 PM
osd.2.log.gz (1.96 KB) osd.2.log.gz ar Fred, 05/08/2011 11:30 PM
Actions #1

Updated by Sage Weil almost 13 years ago

  • Category set to OSD
  • Target version set to v0.27.1

added some debug checks in the code to track this one down, 85292b367b0e6e6d8963de32ad198482500c887f

Actions #2

Updated by Sage Weil almost 13 years ago

  • Status changed from New to In Progress
  • Translation missing: en.field_position set to 1
  • Translation missing: en.field_position changed from 1 to 654

Updated by ar Fred almost 13 years ago

Cherry-picked 85292b367b0e6e6d8963de32ad198482500c887f into the stable branch, here are the logs... I kept the core files, so do not hesitate if you need some data from gdb!

thanks!

Actions #4

Updated by Sage Weil almost 13 years ago

  • Assignee set to Sage Weil
Actions #5

Updated by Sage Weil almost 13 years ago

This problem is that the locator stored in the object_info_t on disk is wrong. Can you say anything about when the objects were written? Is this a really old file system that got upgraded by any chance?

This should get up you and running:


diff --git a/src/osd/ReplicatedPG.cc b/src/osd/ReplicatedPG.cc
index 95473f4..15dd2a5 100644
--- a/src/osd/ReplicatedPG.cc
+++ b/src/osd/ReplicatedPG.cc
@@ -2726,6 +2726,13 @@ ReplicatedPG::ObjectContext *ReplicatedPG::get_object_context(const sobject_t& s
     }
     else {
       object_info_t oi(bv);
+
+      // if the on-disk oloc is bad/undefined, set up the pool value
+      if (oi.oloc.get_pool() < 0) {
+       oi.oloc.pool = info.pgid.pool();
+       oi.oloc.preferred = info.pgid.preferred();
+      }
+
       SnapSetContext *ssc = NULL;
       if (can_create)
        ssc = get_snapset_context(soid.oid, true);

Actions #6

Updated by ar Fred almost 13 years ago

Thank you for the patch, compiling right now.

This is indeed an old FS that got created approximately a year ago, and upgraded on a regular basis since that time!

Actions #7

Updated by ar Fred almost 13 years ago

ok, it seems fixed. Now back to #1022

Actions #8

Updated by Sage Weil almost 13 years ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF