Project

General

Profile

Actions

Bug #339

closed

OSD crash: ReplicatedPG::sub_op_modify

Added by Wido den Hollander over 13 years ago. Updated about 13 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Two OSD's got killed by the OOM killer, after restarting both (osd4 and osd5), one crash with the following message:

Core was generated by `/usr/bin/cosd -i 5 -c /etc/ceph/ceph.conf'.
Program terminated with signal 6, Aborted.
#0  0x00007f9bdc464a75 in raise () from /lib/libc.so.6
(gdb) bt
#0  0x00007f9bdc464a75 in raise () from /lib/libc.so.6
#1  0x00007f9bdc4685c0 in abort () from /lib/libc.so.6
#2  0x00007f9bdcd198e5 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/libstdc++.so.6
#3  0x00007f9bdcd17d16 in ?? () from /usr/lib/libstdc++.so.6
#4  0x00007f9bdcd17d43 in std::terminate() () from /usr/lib/libstdc++.so.6
#5  0x00007f9bdcd17e3e in __cxa_throw () from /usr/lib/libstdc++.so.6
#6  0x00000000005c02b8 in ceph::__ceph_assert_fail (assertion=0x5eb240 "!missing.is_missing(soid)", 
    file=<value optimized out>, line=2776, func=<value optimized out>) at common/assert.cc:30
#7  0x00000000004905e9 in ReplicatedPG::sub_op_modify (this=<value optimized out>, op=0x7f9bc402a810)
    at osd/ReplicatedPG.cc:2776
#8  0x00000000004d94a4 in OSD::dequeue_op (this=0xe64120, pg=0xfc8ba0) at osd/OSD.cc:4740
#9  0x00000000005c097f in ThreadPool::worker (this=0xe64600) at common/WorkQueue.cc:44
#10 0x00000000004f89ad in ThreadPool::WorkThread::entry() ()
#11 0x000000000046d32a in Thread::_entry_func (arg=0x4cc0) at ./common/Thread.h:39
#12 0x00007f9bdd2f79ca in start_thread () from /lib/libpthread.so.0
#13 0x00007f9bdc5176cd in clone () from /lib/libc.so.6
#14 0x0000000000000000 in ?? ()
(gdb) 

During this the cluster was degraded since the OSD's had been down for some time.

I've uploaded the logs, core and binary to logger.ceph.widodh.nl into /srv/ceph/issues/osd_crash_ReplicatedPG_sub_op_modify

After this crash i tried to start the OSD again with a higher loglevel (20), but it didn't crash again.

Actions #1

Updated by Sage Weil over 13 years ago

  • Status changed from New to Can't reproduce

The missing map on the replica apparently showed the object missing.

I audited the primary code and it should recover the object before proceeding. Which suggests the replica missing map is somehow out of sync.. but the push code appears to cover that as well.

So.. I basically can't see how this one happened without more log detail! :(

Actions #2

Updated by Henry Chang about 13 years ago

Hit this bug yesterday. The gdb output:

#3  0x00007fd772bfd6c5 in raise () from /lib64/libc.so.6
#4  0x00007fd772bfeea5 in abort () from /lib64/libc.so.6
#5  0x00007fd7734a37b5 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib64/libstdc++.so.6
#6  0x00007fd7734a1886 in ?? () from /usr/lib64/libstdc++.so.6
#7  0x00007fd7734a18b3 in std::terminate() () from /usr/lib64/libstdc++.so.6
#8  0x00007fd7734a19ae in __cxa_throw () from /usr/lib64/libstdc++.so.6
#9  0x00000000005db978 in ceph::__ceph_assert_fail (assertion=0x608171 "!missing.is_missing(soid)", file=<value optimized out>, line=2579, func=<value optimized out>)
    at common/assert.cc:30
#10 0x0000000000491c09 in ReplicatedPG::sub_op_modify (this=0x1968000, op=0x2a1c600) at osd/ReplicatedPG.cc:2579
#11 0x00000000004d5684 in OSD::dequeue_op (this=0x1567000, pg=0x1968000) at osd/OSD.cc:5146
#12 0x00000000005dc5b3 in ThreadPool::worker (this=0x15673e8) at common/WorkQueue.cc:44
#13 0x00000000005065ad in ThreadPool::WorkThread::entry() ()
#14 0x000000000047a42a in Thread::_entry_func (arg=<value optimized out>) at ./common/Thread.h:41
#15 0x00007fd77408ca3a in start_thread () from /lib64/libpthread.so.0
#16 0x00007fd772ca977d in clone () from /lib64/libc.so.6
#17 0x0000000000000000 in ?? ()
(gdb) frame 10
#10 0x0000000000491c09 in ReplicatedPG::sub_op_modify (this=0x1968000, op=0x2a1c600) at osd/ReplicatedPG.cc:2579
2579    osd/ReplicatedPG.cc: No such file or directory.
        in osd/ReplicatedPG.cc
(gdb) print soid
$1 = (const sobject_t &) @0x2a1c798: {oid = {name = "10000004f47.000001b0"}, snap = {val = 18446744073709551614}}
(gdb) print acting
$2 = std::vector of length 2, capacity 2 = {0, 3}
(gdb) print info
$3 = {pgid = {v = {preferred = {v = 65535}, ps = {v = 99}, pool = {v = 0}}}, last_update = {version = 216, epoch = 611, __pad = 0}, last_complete = {version = 137,
    epoch = 569, __pad = 0}, log_tail = {version = 167, epoch = 600, __pad = 0}, log_backlog = true, purged_snaps = {_size = 0, m = std::map with 0 elements},
  stats = {version = {version = 215, epoch = 611, __pad = 0}, reported = {version = 934, epoch = 297, __pad = 0}, state = 0, log_start = {version = 0, epoch = 0,
      __pad = 0}, ondisk_log_start = {version = 0, epoch = 0, __pad = 0}, created = 2, parent = {v = {preferred = {v = 0}, ps = {v = 0}, pool = {v = 0}}},
    parent_split_bits = 0, last_scrub = {version = 96, epoch = 506, __pad = 0}, last_scrub_stamp = {tv = {tv_sec = 1298959219, tv_nsec = 605245000}},
    num_bytes = 75497472, num_kb = 73728, num_objects = 18, num_object_clones = 0, num_object_copies = 0, num_objects_missing_on_primary = 0,
    num_objects_degraded = 0, log_size = 0, ondisk_log_size = 0, num_rd = 843, num_rd_kb = 195984, num_wr = 216, num_wr_kb = 313336, num_objects_unfound = 0,
    up = std::vector of length 0, capacity 0, acting = std::vector of length 0, capacity 0}, history = {epoch_created = 2, last_epoch_started = 611,
    last_epoch_clean = 3, last_epoch_split = 609, same_up_since = 601, same_acting_since = 607, same_primary_since = 297, last_scrub = {version = 96, epoch = 506,
      __pad = 0}, last_scrub_stamp = {tv = {tv_sec = 1298959219, tv_nsec = 605245000}}}}
(gdb) print history
No symbol "history" in current context.
(gdb) print missing
$4 = {missing = std::map with 1 elements = {[{oid = {name = "10000004f47.000001b0"}, snap = {val = 18446744073709551614}}] = {need = {version = 117, epoch = 569,
        __pad = 0}, have = {version = 0, epoch = 0, __pad = 0}}}, rmissing = std::map with 1 elements = {[{version = 117, epoch = 569, __pad = 0}] = {oid = {name =
    "10000004f47.000001b0"}, snap = {val = 18446744073709551614}}}}

I put the log files, cosd binary and core dump on http://www.megaupload.com/?d=GRORERXY

Actions

Also available in: Atom PDF