Project

General

Profile

Actions

Bug #10042

closed

OSD crash doing object recovery with EC pool

Added by Guang Yang over 9 years ago. Updated over 9 years ago.

Status:
Duplicate
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We observed one OSD crash with the following assertion failure:

 0> 2014-11-07 22:17:55.349141 7f59a6b0d700 -1 osd/ECBackend.cc: In function 'void ECBackend::continue_recovery_op(ECBackend::Re
coveryOp&, RecoveryMessages*)' thread 7f59a6b0d700 time 2014-11-07 22:17:55.305306
osd/ECBackend.cc: 529: FAILED assert(pop.data.length() == sinfo.aligned_logical_offset_to_chunk_offset( after_progress.data_recovere
d_to - op.recovery_progress.data_recovered_to))

Sadly there is no verbose log during the crash, following are stack trace:

#11 0x000000000098621a in ECBackend::continue_recovery_op (this=0x834ab40, op=..., m=0x7fe761d06160) at osd/ECBackend.cc:529
    in osd/ECBackend.cc
(gdb) p pop.data
$16 = {_buffers = empty std::list, _len = 0, append_buffer = {_raw = 0x0, _off = 0, _len = 0}, last_p = {bl = 0x18e131c0, ls = 0x18e131c0, off = 0, p = {_raw = , _off = 0, _len = 0}, p_off = 0}}
(gdb) p after_progress.data_recovered_to
$17 = 3407872
(gdb) p op.recovery_progress.data_recovered_to
$18 = 0
(gdb) f 12
#12 0x0000000000987261 in ECBackend::handle_recovery_read_complete (this=0x834ab40, hoid=..., to_read=..., attrs=..., m=0x7fe761d06160) at osd/ECBackend.cc:381
381    in osd/ECBackend.cc
(gdb) p hoid
$21 = (const hobject_t &) @0xf3b40f0: {oid = {name = "default.12615.360_14117903159_5cfc884613_o.jpg"}, snap = {val = 18446744073709551614}, hash = 1258525522, max = false, static POOL_IS_TEMP = -1, pool = 3, nspace = "", 
  key = ""}
(gdb) p to_read
$22 = (
    boost::tuples::tuple<unsigned long, unsigned long, std::map<pg_shard_t, ceph::buffer::list, std::less<pg_shard_t>, std::allocator<std::pair<pg_shard_t const, ceph::buffer::list> > >, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type> &) @0x2fbc3c80: {<boost::tuples::cons<unsigned long, boost::tuples::cons<unsigned long, boost::tuples::cons<std::map<pg_shard_t, ceph::buffer::list, std::less<pg_shard_t>, std::allocator<std::pair<pg_shard_t const, ceph::buffer::list> > >, boost::tuples::null_type> > >> = {head = 0, tail = {head = 8454144, 
      tail = {head = std::map with 1 elements = {[{osd = 8, shard = 5 '\005'}] = {_buffers = empty std::list, _len = 0, append_buffer = {_raw = 0x0, _off = 0, _len = 0}, last_p = {bl = 0xc62d128, ls = 0xc62d128, off = 0, p = {
                _raw = , _off = 0, _len = 0}, p_off = 0}}}}}}, <No data fields>}

However, the shard 5 on OSD 8 has data corruption (file is empty, this is another issue, might be due to filestore transaction lost of mis-configured raid controller).

-bash-4.1$ ll
"obj.jpg__head_4B039352__3_ffffffffffffffff_5" 
-rw-r--r-- 1 root root 0 Oct 29 21:50 obj.jpg__head_4B039352__3_ffffffffffffffff_5

While I think it might be related with http://tracker.ceph.com/issues/8588 with a different code path?

Ceph version: 0.80.4


Related issues 1 (0 open1 closed)

Is duplicate of Ceph - Bug #8588: In the erasure-coded pool, primary OSD will crash at decoding if any data chunk's size is changed Duplicate06/11/2014

Actions
Actions #1

Updated by Loïc Dachary over 9 years ago

  • Category set to OSD
  • Status changed from New to 12
  • Assignee set to Loïc Dachary
  • Priority changed from Normal to Urgent
Actions #2

Updated by Guang Yang over 9 years ago

Hi Loic,
I am still a little bit confused in terms of what happened behind the crash (and what is the relation between this crash and issue 8588), can you explain a little bit further? Thanks!

Actions #3

Updated by Loïc Dachary over 9 years ago

I'm not sure either, investigating.

Actions #4

Updated by Loïc Dachary over 9 years ago

  • Status changed from 12 to Duplicate
Actions

Also available in: Atom PDF