Bug #10042: OSD crash doing object recovery with EC pool - Ceph - Ceph

Actions

Copy link

Bug #10042

closed

OSD crash doing object recovery with EC pool

Added by Guang Yang over 9 years ago. Updated over 9 years ago.

Status:

Duplicate

Priority:

Urgent

Assignee:

Loïc Dachary

Category:

OSD

Target version:

% Done:

Source:

Community (dev)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

We observed one OSD crash with the following assertion failure:

 0> 2014-11-07 22:17:55.349141 7f59a6b0d700 -1 osd/ECBackend.cc: In function 'void ECBackend::continue_recovery_op(ECBackend::Re
coveryOp&, RecoveryMessages*)' thread 7f59a6b0d700 time 2014-11-07 22:17:55.305306
osd/ECBackend.cc: 529: FAILED assert(pop.data.length() == sinfo.aligned_logical_offset_to_chunk_offset( after_progress.data_recovere
d_to - op.recovery_progress.data_recovered_to))

Sadly there is no verbose log during the crash, following are stack trace:

#11 0x000000000098621a in ECBackend::continue_recovery_op (this=0x834ab40, op=..., m=0x7fe761d06160) at osd/ECBackend.cc:529
    in osd/ECBackend.cc
(gdb) p pop.data
$16 = {_buffers = empty std::list, _len = 0, append_buffer = {_raw = 0x0, _off = 0, _len = 0}, last_p = {bl = 0x18e131c0, ls = 0x18e131c0, off = 0, p = {_raw = , _off = 0, _len = 0}, p_off = 0}}
(gdb) p after_progress.data_recovered_to
$17 = 3407872
(gdb) p op.recovery_progress.data_recovered_to
$18 = 0

(gdb) f 12
#12 0x0000000000987261 in ECBackend::handle_recovery_read_complete (this=0x834ab40, hoid=..., to_read=..., attrs=..., m=0x7fe761d06160) at osd/ECBackend.cc:381
381    in osd/ECBackend.cc
(gdb) p hoid
$21 = (const hobject_t &) @0xf3b40f0: {oid = {name = "default.12615.360_14117903159_5cfc884613_o.jpg"}, snap = {val = 18446744073709551614}, hash = 1258525522, max = false, static POOL_IS_TEMP = -1, pool = 3, nspace = "", 
  key = ""}
(gdb) p to_read
$22 = (
    boost::tuples::tuple<unsigned long, unsigned long, std::map<pg_shard_t, ceph::buffer::list, std::less<pg_shard_t>, std::allocator<std::pair<pg_shard_t const, ceph::buffer::list> > >, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type> &) @0x2fbc3c80: {<boost::tuples::cons<unsigned long, boost::tuples::cons<unsigned long, boost::tuples::cons<std::map<pg_shard_t, ceph::buffer::list, std::less<pg_shard_t>, std::allocator<std::pair<pg_shard_t const, ceph::buffer::list> > >, boost::tuples::null_type> > >> = {head = 0, tail = {head = 8454144, 
      tail = {head = std::map with 1 elements = {[{osd = 8, shard = 5 '\005'}] = {_buffers = empty std::list, _len = 0, append_buffer = {_raw = 0x0, _off = 0, _len = 0}, last_p = {bl = 0xc62d128, ls = 0xc62d128, off = 0, p = {
                _raw = , _off = 0, _len = 0}, p_off = 0}}}}}}, <No data fields>}

However, the shard 5 on OSD 8 has data corruption (file is empty, this is another issue, might be due to filestore transaction lost of mis-configured raid controller).

-bash-4.1$ ll
"obj.jpg__head_4B039352__3_ffffffffffffffff_5" 
-rw-r--r-- 1 root root 0 Oct 29 21:50 obj.jpg__head_4B039352__3_ffffffffffffffff_5

While I think it might be related with http://tracker.ceph.com/issues/8588 with a different code path?

Ceph version: 0.80.4

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Loïc Dachary over 9 years ago

Category set to OSD
Status changed from New to 12
Assignee set to Loïc Dachary
Priority changed from Normal to Urgent

Actions

Copy link

Updated by Guang Yang over 9 years ago

Hi Loic,
I am still a little bit confused in terms of what happened behind the crash (and what is the relation between this crash and issue 8588), can you explain a little bit further? Thanks!

Actions

Copy link