Bug #10018: OSD assertion failure if the hinfo_key xattr is not there (corrupted?) during scrubbing - Ceph - Ceph

Actions

Copy link

Bug #10018

closed

OSD assertion failure if the hinfo_key xattr is not there (corrupted?) during scrubbing

Added by Guang Yang over 9 years ago. Updated about 9 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Loïc Dachary

Category:

OSD

Target version:

% Done:

80%

Source:

other

Tags:

Backport:

firefly, giant

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

We observed an OSD crash during scrubbing on EC pool, the crash happened if the hinfo_key xattr of the file is absent.

Is this by design? Why not leave it as a read error as other corruption (e.g. EIO when reading the file) so that user can trigger a repair for the PG?

Ceph version: 0.80.4
Platform: RHEL6

Related issues 2 (1 open — 1 closed)

Actions

Copy link

Updated by Loïc Dachary over 9 years ago

Status changed from New to Need More Info
Assignee set to Loïc Dachary

Could you please attach the logs of the crashed OSD (the last 20,000 lines would be enough) ?

Actions

Copy link

Updated by Loïc Dachary over 9 years ago

Priority changed from Normal to Urgent

Actions

Copy link

Updated by Guang Yang over 9 years ago

Sorry we don't have verbose log during crashing, but following is the code leading the crash:

ECUtil::HashInfoRef ECBackend::get_hash_info(
  const hobject_t &hoid)
{
  dout(10) << __func__ << ": Getting attr on " << hoid << dendl;
  ECUtil::HashInfoRef ref = unstable_hashinfo_registry.lookup(hoid);
  if (!ref) {
    dout(10) << __func__ << ": not in cache " << hoid << dendl;
    struct stat st;
    int r = store->stat(
      hoid.is_temp() ? temp_coll : coll,
      ghobject_t(hoid, ghobject_t::NO_GEN, get_parent()->whoami_shard().shard),
      &st);
    ECUtil::HashInfo hinfo(ec_impl->get_chunk_count());
    if (r >= 0 && st.st_size > 0) {
      dout(10) << __func__ << ": found on disk, size " << st.st_size << dendl;
      bufferlist bl;
      r = store->getattr(
    hoid.is_temp() ? temp_coll : coll,
    ghobject_t(hoid, ghobject_t::NO_GEN, get_parent()->whoami_shard().shard),
    ECUtil::get_hinfo_key(),
    bl);
      if (r >= 0) {
    bufferlist::iterator bp = bl.begin();
    ::decode(hinfo, bp);
    assert(hinfo.get_total_chunk_size() == (unsigned)st.st_size);
      } else {
    assert(0 == "missing hash attr");
      }
    }
    ref = unstable_hashinfo_registry.lookup_or_create(hoid, hinfo);
  }
  return ref;
}

Here if it failed to get the hinfo_key, it will result in an assertion failure.

Actions

Copy link

Updated by Loïc Dachary over 9 years ago

Status changed from Need More Info to 12

Test case that reproduces the problem: https://github.com/dachary/ceph/commit/5639303646418913ba0929ce73e8a5c611901919

     0> 2014-11-06 14:21:07.217016 7f8c277c9700 -1 osd/ECBackend.cc: In function 'ECUtil::HashInfoRef ECBackend::get_hash_info(const hobject_t&)' thread 7f8c277c9700 time 2014-11-06 14:21:07.202149
osd/ECBackend.cc: 1457: FAILED assert(0 == "missing hash attr")

 ceph version 0.80.4 (7c241cfaa6c8c068bc9da8578ca00b9f4fc7567f)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x95) [0x1890195]
 2: (ECBackend::get_hash_info(hobject_t const&)+0x6ab) [0x16c411b]
 3: (ECBackend::be_deep_scrub(hobject_t const&, ScrubMap::object&, ThreadPool::TPHandle&)+0x2d8) [0x16c5ef6]
 4: (PGBackend::be_scan_list(ScrubMap&, std::vector<hobject_t, std::allocator<hobject_t> > const&, bool, ThreadPool::TPHandle&)+0x423) [0x15eaa29]
 5: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, ThreadPool::TPHandle&)+0x458) [0x1477ac6]
 6: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x1012) [0x147beca]
 7: (PG::scrub(ThreadPool::TPHandle&)+0x71e) [0x147a0d6]
 8: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x28) [0x12d683e]
 9: (ThreadPool::WorkQueue<PG>::_void_process(void*, ThreadPool::TPHandle&)+0x33) [0x135f925]
 10: (ThreadPool::worker(ThreadPool::WorkThread*)+0x734) [0x188267c]
 11: (ThreadPool::WorkThread::entry()+0x23) [0x1884485]
 12: (Thread::_entry_func(void*)+0x23) [0x187af2f]
 13: (()+0x8182) [0x7f8c3d124182]
 14: (clone()+0x6d) [0x7f8c3b4c530d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Actions

Copy link

Updated by Loïc Dachary over 9 years ago

Also on 0.80.7

 ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3)
 1: (ceph::BackTrace::BackTrace(int)+0x2d) [0x164e305]
 2: ceph-osd() [0x1770b14]
 3: (()+0x10340) [0x7f8dafeaf340]
 4: (gsignal()+0x39) [0x7f8dae183f79]
 5: (abort()+0x148) [0x7f8dae187388]
 6: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f8daea8f6b5]
 7: (()+0x5e836) [0x7f8daea8d836]
 8: (()+0x5e863) [0x7f8daea8d863]
 9: (()+0x5eaa2) [0x7f8daea8daa2]
 10: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x342) [0x18b989c]
 11: (ECBackend::get_hash_info(hobject_t const&)+0x6ab) [0x16ebdd9]
 12: (ECBackend::be_deep_scrub(hobject_t const&, ScrubMap::object&, ThreadPool::TPHandle&)+0x2d8) [0x16edbce]
 13: (PGBackend::be_scan_list(ScrubMap&, std::vector<hobject_t, std::allocator<hobject_t> > const&, bool, ThreadPool::TPHandle&)+0x423) [0x160565f]
 14: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, ThreadPool::TPHandle&)+0x471) [0x14a06e9]
 15: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x1012) [0x14a4e34]
 16: (PG::scrub(ThreadPool::TPHandle&)+0x9e9) [0x14a3007]
 17: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x28) [0x12eeaf6]
 18: (ThreadPool::WorkQueue<PG>::_void_process(void*, ThreadPool::TPHandle&)+0x33) [0x137963d]
 19: (ThreadPool::worker(ThreadPool::WorkThread*)+0x734) [0x18ab672]
 20: (ThreadPool::WorkThread::entry()+0x23) [0x18ad8df]
 21: (Thread::entry_wrapper()+0x79) [0x18a3eab]
 22: (Thread::_entry_func(void*)+0x18) [0x18a3e28]
 23: (()+0x8182) [0x7f8dafea7182]
 24: (clone()+0x6d) [0x7f8dae24830d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

but not on master.

Actions

Copy link

Updated by Loïc Dachary over 9 years ago

Actually it happens on master, my test was incorrect and is now fixed : https://github.com/dachary/ceph/commit/312cdad05fc2cafa3d4f6c3f6ae49f53dcc13bf4

osd/ECBackend.cc: 1461: FAILED assert(0 == "missing hash attr")

 ceph version 0.87-610-gefe243d (efe243ddd2fa62c19e455d77cdedcc5dda6bafd6)
 1: (ceph::BackTrace::BackTrace(int)+0x2d) [0x16bc14b]
 2: ceph-osd() [0x17e6775]
 3: (()+0x10340) [0x7f9c73376340]
 4: (gsignal()+0x39) [0x7f9c7168ff79]
 5: (abort()+0x148) [0x7f9c71693388]
 6: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f9c71f9b6b5]
 7: (()+0x5e836) [0x7f9c71f99836]
 8: (()+0x5e863) [0x7f9c71f99863]
 9: (()+0x5eaa2) [0x7f9c71f99aa2]
 10: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x342) [0x1939a5a]
 11: (ECBackend::get_hash_info(hobject_t const&)+0x68c) [0x175bf70]
 12: (ECBackend::be_deep_scrub(hobject_t const&, ScrubMap::object&, ThreadPool::TPHandle&)+0x2dc) [0x175ddd4]
 13: (PGBackend::be_scan_list(ScrubMap&, std::vector<hobject_t, std::allocator<hobject_t> > const&, bool, ThreadPool::TPHandle&)+0x425) [0x167010b]
 14: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, ThreadPool::TPHandle&)+0x471) [0x150ba41]
 15: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x1012) [0x150efd8]
 16: (PG::scrub(ThreadPool::TPHandle&)+0x968) [0x150ded0]
 17: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x28) [0x13451ac]
 18: (ThreadPool::WorkQueue<PG>::_void_process(void*, ThreadPool::TPHandle&)+0x33) [0x13d8f35]
 19: (ThreadPool::worker(ThreadPool::WorkThread*)+0x734) [0x1929636]
 20: (ThreadPool::WorkThread::entry()+0x23) [0x192d7cd]
 21: (Thread::entry_wrapper()+0x84) [0x1921d80]
 22: (Thread::_entry_func(void*)+0x18) [0x1921cf2]
 23: (()+0x8182) [0x7f9c7336e182]
 24: (clone()+0x6d) [0x7f9c7175430d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Actions

Copy link

Updated by Loïc Dachary over 9 years ago

Status changed from 12 to Fix Under Review

https://github.com/ceph/ceph/pull/2872

Actions

Copy link

Updated by Loïc Dachary over 9 years ago

% Done changed from 0 to 50
Backport set to firefly, giant

Actions

Copy link

Updated by Loïc Dachary over 9 years ago

The tests should use the same as #9887 which requires https://github.com/ceph/ceph-qa-suite/compare/wip-dzaddscrub

Actions

Copy link

#10

Updated by Loïc Dachary over 9 years ago

how to use ceph_objectstore_tool https://github.com/ceph/ceph-qa-suite/blob/giant/tasks/ceph_objectstore_tool.py
https://github.com/ceph/ceph-qa-suite/blob/giant/tasks/ceph_manager.py should have high level functions to inject errors using ceph_objectstore_tool

Actions

Copy link

#11

Updated by Loïc Dachary over 9 years ago

Status changed from Fix Under Review to 7

Actions

Copy link

#12

Updated by Loïc Dachary over 9 years ago

Status changed from 7 to In Progress

Actions

Copy link

#13

Updated by Loïc Dachary over 9 years ago

% Done changed from 50 to 80

https://github.com/ceph/ceph-qa-suite/pull/250 teuthology tests

Actions

Copy link

#14

Updated by Loïc Dachary over 9 years ago

Status changed from In Progress to Fix Under Review

Actions

Copy link

#15

Updated by Yuri Weinstein over 9 years ago

Same issue on next in run http://pulpito.ceph.com/teuthology-2014-11-22_17:18:02-upgrade:firefly-x-next-distro-basic-vps/
Job ['615228']

Assertion: osd/ECBackend.cc: 1461: FAILED assert(0 == "missing hash attr")
ceph version 0.88-230-g9ba17a3 (9ba17a321db06d3d76c9295e411c76842194b25c)
 1: (ECBackend::get_hash_info(hobject_t const&)+0xa0a) [0x9dff7a]
 2: (ECBackend::submit_transaction(hobject_t const&, eversion_t const&, PGBackend::PGTransaction*, eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t, std::a
 3: (ReplicatedPG::issue_repop(ReplicatedPG::RepGather*, utime_t)+0x6fe) [0x8e988e]
 4: (ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)+0x1d1b) [0x92bccb]
 5: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0x2b7b) [0x93306b]
 6: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x4da) [0x8b8f6a]
 7: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x178) [0x6649c8]
 8: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x59f) [0x66535f]
 9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x742) [0xa44fd2]
 10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xa488b0]
 11: (()+0x79d1) [0x7f0cf67009d1]
 12: (clone()+0x6d) [0x7f0cf5488b6d]

Actions

Copy link

#16