Project

General

Profile

Actions

Bug #10018

closed

OSD assertion failure if the hinfo_key xattr is not there (corrupted?) during scrubbing

Added by Guang Yang over 9 years ago. Updated about 9 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
-
% Done:

80%

Source:
other
Tags:
Backport:
firefly, giant
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We observed an OSD crash during scrubbing on EC pool, the crash happened if the hinfo_key xattr of the file is absent.

Is this by design? Why not leave it as a read error as other corruption (e.g. EIO when reading the file) so that user can trigger a repair for the PG?

Ceph version: 0.80.4
Platform: RHEL6


Related issues 2 (1 open1 closed)

Related to Ceph - Bug #10017: OSD wrongly marks object as unfound if only the primary is corrupted for EC poolResolvedLoïc Dachary11/05/2014

Actions
Related to Ceph - Feature #10044: ECUtil::HashInfoRef should have a NONE valueNew11/10/2014

Actions
Actions #1

Updated by Loïc Dachary over 9 years ago

  • Status changed from New to Need More Info
  • Assignee set to Loïc Dachary

Could you please attach the logs of the crashed OSD (the last 20,000 lines would be enough) ?

Actions #2

Updated by Loïc Dachary over 9 years ago

  • Priority changed from Normal to Urgent
Actions #3

Updated by Guang Yang over 9 years ago

Sorry we don't have verbose log during crashing, but following is the code leading the crash:

ECUtil::HashInfoRef ECBackend::get_hash_info(
  const hobject_t &hoid)
{
  dout(10) << __func__ << ": Getting attr on " << hoid << dendl;
  ECUtil::HashInfoRef ref = unstable_hashinfo_registry.lookup(hoid);
  if (!ref) {
    dout(10) << __func__ << ": not in cache " << hoid << dendl;
    struct stat st;
    int r = store->stat(
      hoid.is_temp() ? temp_coll : coll,
      ghobject_t(hoid, ghobject_t::NO_GEN, get_parent()->whoami_shard().shard),
      &st);
    ECUtil::HashInfo hinfo(ec_impl->get_chunk_count());
    if (r >= 0 && st.st_size > 0) {
      dout(10) << __func__ << ": found on disk, size " << st.st_size << dendl;
      bufferlist bl;
      r = store->getattr(
    hoid.is_temp() ? temp_coll : coll,
    ghobject_t(hoid, ghobject_t::NO_GEN, get_parent()->whoami_shard().shard),
    ECUtil::get_hinfo_key(),
    bl);
      if (r >= 0) {
    bufferlist::iterator bp = bl.begin();
    ::decode(hinfo, bp);
    assert(hinfo.get_total_chunk_size() == (unsigned)st.st_size);
      } else {
    assert(0 == "missing hash attr");
      }
    }
    ref = unstable_hashinfo_registry.lookup_or_create(hoid, hinfo);
  }
  return ref;
}

Here if it failed to get the hinfo_key, it will result in an assertion failure.

Actions #4

Updated by Loïc Dachary over 9 years ago

  • Status changed from Need More Info to 12

Test case that reproduces the problem: https://github.com/dachary/ceph/commit/5639303646418913ba0929ce73e8a5c611901919

     0> 2014-11-06 14:21:07.217016 7f8c277c9700 -1 osd/ECBackend.cc: In function 'ECUtil::HashInfoRef ECBackend::get_hash_info(const hobject_t&)' thread 7f8c277c9700 time 2014-11-06 14:21:07.202149
osd/ECBackend.cc: 1457: FAILED assert(0 == "missing hash attr")

 ceph version 0.80.4 (7c241cfaa6c8c068bc9da8578ca00b9f4fc7567f)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x95) [0x1890195]
 2: (ECBackend::get_hash_info(hobject_t const&)+0x6ab) [0x16c411b]
 3: (ECBackend::be_deep_scrub(hobject_t const&, ScrubMap::object&, ThreadPool::TPHandle&)+0x2d8) [0x16c5ef6]
 4: (PGBackend::be_scan_list(ScrubMap&, std::vector<hobject_t, std::allocator<hobject_t> > const&, bool, ThreadPool::TPHandle&)+0x423) [0x15eaa29]
 5: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, ThreadPool::TPHandle&)+0x458) [0x1477ac6]
 6: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x1012) [0x147beca]
 7: (PG::scrub(ThreadPool::TPHandle&)+0x71e) [0x147a0d6]
 8: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x28) [0x12d683e]
 9: (ThreadPool::WorkQueue<PG>::_void_process(void*, ThreadPool::TPHandle&)+0x33) [0x135f925]
 10: (ThreadPool::worker(ThreadPool::WorkThread*)+0x734) [0x188267c]
 11: (ThreadPool::WorkThread::entry()+0x23) [0x1884485]
 12: (Thread::_entry_func(void*)+0x23) [0x187af2f]
 13: (()+0x8182) [0x7f8c3d124182]
 14: (clone()+0x6d) [0x7f8c3b4c530d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Actions #5

Updated by Loïc Dachary over 9 years ago

Also on 0.80.7

 ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3)
 1: (ceph::BackTrace::BackTrace(int)+0x2d) [0x164e305]
 2: ceph-osd() [0x1770b14]
 3: (()+0x10340) [0x7f8dafeaf340]
 4: (gsignal()+0x39) [0x7f8dae183f79]
 5: (abort()+0x148) [0x7f8dae187388]
 6: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f8daea8f6b5]
 7: (()+0x5e836) [0x7f8daea8d836]
 8: (()+0x5e863) [0x7f8daea8d863]
 9: (()+0x5eaa2) [0x7f8daea8daa2]
 10: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x342) [0x18b989c]
 11: (ECBackend::get_hash_info(hobject_t const&)+0x6ab) [0x16ebdd9]
 12: (ECBackend::be_deep_scrub(hobject_t const&, ScrubMap::object&, ThreadPool::TPHandle&)+0x2d8) [0x16edbce]
 13: (PGBackend::be_scan_list(ScrubMap&, std::vector<hobject_t, std::allocator<hobject_t> > const&, bool, ThreadPool::TPHandle&)+0x423) [0x160565f]
 14: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, ThreadPool::TPHandle&)+0x471) [0x14a06e9]
 15: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x1012) [0x14a4e34]
 16: (PG::scrub(ThreadPool::TPHandle&)+0x9e9) [0x14a3007]
 17: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x28) [0x12eeaf6]
 18: (ThreadPool::WorkQueue<PG>::_void_process(void*, ThreadPool::TPHandle&)+0x33) [0x137963d]
 19: (ThreadPool::worker(ThreadPool::WorkThread*)+0x734) [0x18ab672]
 20: (ThreadPool::WorkThread::entry()+0x23) [0x18ad8df]
 21: (Thread::entry_wrapper()+0x79) [0x18a3eab]
 22: (Thread::_entry_func(void*)+0x18) [0x18a3e28]
 23: (()+0x8182) [0x7f8dafea7182]
 24: (clone()+0x6d) [0x7f8dae24830d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

but not on master.

Actions #6

Updated by Loïc Dachary over 9 years ago

Actually it happens on master, my test was incorrect and is now fixed : https://github.com/dachary/ceph/commit/312cdad05fc2cafa3d4f6c3f6ae49f53dcc13bf4

osd/ECBackend.cc: 1461: FAILED assert(0 == "missing hash attr")

 ceph version 0.87-610-gefe243d (efe243ddd2fa62c19e455d77cdedcc5dda6bafd6)
 1: (ceph::BackTrace::BackTrace(int)+0x2d) [0x16bc14b]
 2: ceph-osd() [0x17e6775]
 3: (()+0x10340) [0x7f9c73376340]
 4: (gsignal()+0x39) [0x7f9c7168ff79]
 5: (abort()+0x148) [0x7f9c71693388]
 6: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f9c71f9b6b5]
 7: (()+0x5e836) [0x7f9c71f99836]
 8: (()+0x5e863) [0x7f9c71f99863]
 9: (()+0x5eaa2) [0x7f9c71f99aa2]
 10: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x342) [0x1939a5a]
 11: (ECBackend::get_hash_info(hobject_t const&)+0x68c) [0x175bf70]
 12: (ECBackend::be_deep_scrub(hobject_t const&, ScrubMap::object&, ThreadPool::TPHandle&)+0x2dc) [0x175ddd4]
 13: (PGBackend::be_scan_list(ScrubMap&, std::vector<hobject_t, std::allocator<hobject_t> > const&, bool, ThreadPool::TPHandle&)+0x425) [0x167010b]
 14: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, ThreadPool::TPHandle&)+0x471) [0x150ba41]
 15: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x1012) [0x150efd8]
 16: (PG::scrub(ThreadPool::TPHandle&)+0x968) [0x150ded0]
 17: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x28) [0x13451ac]
 18: (ThreadPool::WorkQueue<PG>::_void_process(void*, ThreadPool::TPHandle&)+0x33) [0x13d8f35]
 19: (ThreadPool::worker(ThreadPool::WorkThread*)+0x734) [0x1929636]
 20: (ThreadPool::WorkThread::entry()+0x23) [0x192d7cd]
 21: (Thread::entry_wrapper()+0x84) [0x1921d80]
 22: (Thread::_entry_func(void*)+0x18) [0x1921cf2]
 23: (()+0x8182) [0x7f9c7336e182]
 24: (clone()+0x6d) [0x7f9c7175430d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Actions #7

Updated by Loïc Dachary over 9 years ago

  • Status changed from 12 to Fix Under Review
Actions #8

Updated by Loïc Dachary over 9 years ago

  • % Done changed from 0 to 50
  • Backport set to firefly, giant
Actions #9

Updated by Loïc Dachary over 9 years ago

The tests should use the same as #9887 which requires https://github.com/ceph/ceph-qa-suite/compare/wip-dzaddscrub

Actions #10

Updated by Loïc Dachary over 9 years ago

Actions #11

Updated by Loïc Dachary over 9 years ago

  • Status changed from Fix Under Review to 7
Actions #12

Updated by Loïc Dachary over 9 years ago

  • Status changed from 7 to In Progress
Actions #13

Updated by Loïc Dachary over 9 years ago

  • % Done changed from 50 to 80
Actions #14

Updated by Loïc Dachary over 9 years ago

  • Status changed from In Progress to Fix Under Review
Actions #15

Updated by Yuri Weinstein over 9 years ago

Same issue on next in run http://pulpito.ceph.com/teuthology-2014-11-22_17:18:02-upgrade:firefly-x-next-distro-basic-vps/
Job ['615228']

Assertion: osd/ECBackend.cc: 1461: FAILED assert(0 == "missing hash attr")
ceph version 0.88-230-g9ba17a3 (9ba17a321db06d3d76c9295e411c76842194b25c)
 1: (ECBackend::get_hash_info(hobject_t const&)+0xa0a) [0x9dff7a]
 2: (ECBackend::submit_transaction(hobject_t const&, eversion_t const&, PGBackend::PGTransaction*, eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t, std::a
 3: (ReplicatedPG::issue_repop(ReplicatedPG::RepGather*, utime_t)+0x6fe) [0x8e988e]
 4: (ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)+0x1d1b) [0x92bccb]
 5: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0x2b7b) [0x93306b]
 6: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x4da) [0x8b8f6a]
 7: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x178) [0x6649c8]
 8: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x59f) [0x66535f]
 9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x742) [0xa44fd2]
 10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xa488b0]
 11: (()+0x79d1) [0x7f0cf67009d1]
 12: (clone()+0x6d) [0x7f0cf5488b6d]
Actions #16

Updated by Loïc Dachary over 9 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #18

Updated by Loïc Dachary over 9 years ago

  • Status changed from Pending Backport to Fix Under Review
Actions #19

Updated by Samuel Just over 9 years ago

  • Status changed from Fix Under Review to Resolved
Actions #20

Updated by Samuel Just over 9 years ago

  • Status changed from Resolved to In Progress

Loic: are the tests for this in the regression suite yet?

Actions #22

Updated by Loïc Dachary about 9 years ago

  • Status changed from In Progress to Pending Backport
Actions #23

Updated by Loïc Dachary about 9 years ago

e7faed5 osd: deep scrub must not abort if hinfo is missing (in giant), 73b47db osd: deep scrub must not abort if hinfo is missing (in firefly),

Actions #24

Updated by Loïc Dachary about 9 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF