Actions
Bug #10042
closedOSD crash doing object recovery with EC pool
% Done:
0%
Source:
Community (dev)
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
We observed one OSD crash with the following assertion failure:
0> 2014-11-07 22:17:55.349141 7f59a6b0d700 -1 osd/ECBackend.cc: In function 'void ECBackend::continue_recovery_op(ECBackend::Re coveryOp&, RecoveryMessages*)' thread 7f59a6b0d700 time 2014-11-07 22:17:55.305306 osd/ECBackend.cc: 529: FAILED assert(pop.data.length() == sinfo.aligned_logical_offset_to_chunk_offset( after_progress.data_recovere d_to - op.recovery_progress.data_recovered_to))
Sadly there is no verbose log during the crash, following are stack trace:
#11 0x000000000098621a in ECBackend::continue_recovery_op (this=0x834ab40, op=..., m=0x7fe761d06160) at osd/ECBackend.cc:529 in osd/ECBackend.cc (gdb) p pop.data $16 = {_buffers = empty std::list, _len = 0, append_buffer = {_raw = 0x0, _off = 0, _len = 0}, last_p = {bl = 0x18e131c0, ls = 0x18e131c0, off = 0, p = {_raw = , _off = 0, _len = 0}, p_off = 0}} (gdb) p after_progress.data_recovered_to $17 = 3407872 (gdb) p op.recovery_progress.data_recovered_to $18 = 0
(gdb) f 12 #12 0x0000000000987261 in ECBackend::handle_recovery_read_complete (this=0x834ab40, hoid=..., to_read=..., attrs=..., m=0x7fe761d06160) at osd/ECBackend.cc:381 381 in osd/ECBackend.cc (gdb) p hoid $21 = (const hobject_t &) @0xf3b40f0: {oid = {name = "default.12615.360_14117903159_5cfc884613_o.jpg"}, snap = {val = 18446744073709551614}, hash = 1258525522, max = false, static POOL_IS_TEMP = -1, pool = 3, nspace = "", key = ""} (gdb) p to_read $22 = ( boost::tuples::tuple<unsigned long, unsigned long, std::map<pg_shard_t, ceph::buffer::list, std::less<pg_shard_t>, std::allocator<std::pair<pg_shard_t const, ceph::buffer::list> > >, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type> &) @0x2fbc3c80: {<boost::tuples::cons<unsigned long, boost::tuples::cons<unsigned long, boost::tuples::cons<std::map<pg_shard_t, ceph::buffer::list, std::less<pg_shard_t>, std::allocator<std::pair<pg_shard_t const, ceph::buffer::list> > >, boost::tuples::null_type> > >> = {head = 0, tail = {head = 8454144, tail = {head = std::map with 1 elements = {[{osd = 8, shard = 5 '\005'}] = {_buffers = empty std::list, _len = 0, append_buffer = {_raw = 0x0, _off = 0, _len = 0}, last_p = {bl = 0xc62d128, ls = 0xc62d128, off = 0, p = { _raw = , _off = 0, _len = 0}, p_off = 0}}}}}}, <No data fields>}
However, the shard 5 on OSD 8 has data corruption (file is empty, this is another issue, might be due to filestore transaction lost of mis-configured raid controller).
-bash-4.1$ ll "obj.jpg__head_4B039352__3_ffffffffffffffff_5" -rw-r--r-- 1 root root 0 Oct 29 21:50 obj.jpg__head_4B039352__3_ffffffffffffffff_5
While I think it might be related with http://tracker.ceph.com/issues/8588 with a different code path?
Ceph version: 0.80.4
Updated by Loïc Dachary over 9 years ago
- Category set to OSD
- Status changed from New to 12
- Assignee set to Loïc Dachary
- Priority changed from Normal to Urgent
Updated by Guang Yang over 9 years ago
Hi Loic,
I am still a little bit confused in terms of what happened behind the crash (and what is the relation between this crash and issue 8588), can you explain a little bit further? Thanks!
Updated by Loïc Dachary over 9 years ago
- Status changed from 12 to Duplicate
Actions