Bug #10018
closedOSD assertion failure if the hinfo_key xattr is not there (corrupted?) during scrubbing
Added by Guang Yang over 9 years ago. Updated about 9 years ago.
80%
Description
We observed an OSD crash during scrubbing on EC pool, the crash happened if the hinfo_key xattr of the file is absent.
Is this by design? Why not leave it as a read error as other corruption (e.g. EIO when reading the file) so that user can trigger a repair for the PG?
Ceph version: 0.80.4
Platform: RHEL6
Updated by Loïc Dachary over 9 years ago
- Status changed from New to Need More Info
- Assignee set to Loïc Dachary
Could you please attach the logs of the crashed OSD (the last 20,000 lines would be enough) ?
Updated by Guang Yang over 9 years ago
Sorry we don't have verbose log during crashing, but following is the code leading the crash:
ECUtil::HashInfoRef ECBackend::get_hash_info( const hobject_t &hoid) { dout(10) << __func__ << ": Getting attr on " << hoid << dendl; ECUtil::HashInfoRef ref = unstable_hashinfo_registry.lookup(hoid); if (!ref) { dout(10) << __func__ << ": not in cache " << hoid << dendl; struct stat st; int r = store->stat( hoid.is_temp() ? temp_coll : coll, ghobject_t(hoid, ghobject_t::NO_GEN, get_parent()->whoami_shard().shard), &st); ECUtil::HashInfo hinfo(ec_impl->get_chunk_count()); if (r >= 0 && st.st_size > 0) { dout(10) << __func__ << ": found on disk, size " << st.st_size << dendl; bufferlist bl; r = store->getattr( hoid.is_temp() ? temp_coll : coll, ghobject_t(hoid, ghobject_t::NO_GEN, get_parent()->whoami_shard().shard), ECUtil::get_hinfo_key(), bl); if (r >= 0) { bufferlist::iterator bp = bl.begin(); ::decode(hinfo, bp); assert(hinfo.get_total_chunk_size() == (unsigned)st.st_size); } else { assert(0 == "missing hash attr"); } } ref = unstable_hashinfo_registry.lookup_or_create(hoid, hinfo); } return ref; }
Here if it failed to get the hinfo_key, it will result in an assertion failure.
Updated by Loïc Dachary over 9 years ago
- Status changed from Need More Info to 12
Test case that reproduces the problem: https://github.com/dachary/ceph/commit/5639303646418913ba0929ce73e8a5c611901919
0> 2014-11-06 14:21:07.217016 7f8c277c9700 -1 osd/ECBackend.cc: In function 'ECUtil::HashInfoRef ECBackend::get_hash_info(const hobject_t&)' thread 7f8c277c9700 time 2014-11-06 14:21:07.202149 osd/ECBackend.cc: 1457: FAILED assert(0 == "missing hash attr") ceph version 0.80.4 (7c241cfaa6c8c068bc9da8578ca00b9f4fc7567f) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x95) [0x1890195] 2: (ECBackend::get_hash_info(hobject_t const&)+0x6ab) [0x16c411b] 3: (ECBackend::be_deep_scrub(hobject_t const&, ScrubMap::object&, ThreadPool::TPHandle&)+0x2d8) [0x16c5ef6] 4: (PGBackend::be_scan_list(ScrubMap&, std::vector<hobject_t, std::allocator<hobject_t> > const&, bool, ThreadPool::TPHandle&)+0x423) [0x15eaa29] 5: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, ThreadPool::TPHandle&)+0x458) [0x1477ac6] 6: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x1012) [0x147beca] 7: (PG::scrub(ThreadPool::TPHandle&)+0x71e) [0x147a0d6] 8: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x28) [0x12d683e] 9: (ThreadPool::WorkQueue<PG>::_void_process(void*, ThreadPool::TPHandle&)+0x33) [0x135f925] 10: (ThreadPool::worker(ThreadPool::WorkThread*)+0x734) [0x188267c] 11: (ThreadPool::WorkThread::entry()+0x23) [0x1884485] 12: (Thread::_entry_func(void*)+0x23) [0x187af2f] 13: (()+0x8182) [0x7f8c3d124182] 14: (clone()+0x6d) [0x7f8c3b4c530d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Updated by Loïc Dachary over 9 years ago
Also on 0.80.7
ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3) 1: (ceph::BackTrace::BackTrace(int)+0x2d) [0x164e305] 2: ceph-osd() [0x1770b14] 3: (()+0x10340) [0x7f8dafeaf340] 4: (gsignal()+0x39) [0x7f8dae183f79] 5: (abort()+0x148) [0x7f8dae187388] 6: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f8daea8f6b5] 7: (()+0x5e836) [0x7f8daea8d836] 8: (()+0x5e863) [0x7f8daea8d863] 9: (()+0x5eaa2) [0x7f8daea8daa2] 10: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x342) [0x18b989c] 11: (ECBackend::get_hash_info(hobject_t const&)+0x6ab) [0x16ebdd9] 12: (ECBackend::be_deep_scrub(hobject_t const&, ScrubMap::object&, ThreadPool::TPHandle&)+0x2d8) [0x16edbce] 13: (PGBackend::be_scan_list(ScrubMap&, std::vector<hobject_t, std::allocator<hobject_t> > const&, bool, ThreadPool::TPHandle&)+0x423) [0x160565f] 14: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, ThreadPool::TPHandle&)+0x471) [0x14a06e9] 15: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x1012) [0x14a4e34] 16: (PG::scrub(ThreadPool::TPHandle&)+0x9e9) [0x14a3007] 17: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x28) [0x12eeaf6] 18: (ThreadPool::WorkQueue<PG>::_void_process(void*, ThreadPool::TPHandle&)+0x33) [0x137963d] 19: (ThreadPool::worker(ThreadPool::WorkThread*)+0x734) [0x18ab672] 20: (ThreadPool::WorkThread::entry()+0x23) [0x18ad8df] 21: (Thread::entry_wrapper()+0x79) [0x18a3eab] 22: (Thread::_entry_func(void*)+0x18) [0x18a3e28] 23: (()+0x8182) [0x7f8dafea7182] 24: (clone()+0x6d) [0x7f8dae24830d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
but not on master.
Updated by Loïc Dachary over 9 years ago
Actually it happens on master, my test was incorrect and is now fixed : https://github.com/dachary/ceph/commit/312cdad05fc2cafa3d4f6c3f6ae49f53dcc13bf4
osd/ECBackend.cc: 1461: FAILED assert(0 == "missing hash attr") ceph version 0.87-610-gefe243d (efe243ddd2fa62c19e455d77cdedcc5dda6bafd6) 1: (ceph::BackTrace::BackTrace(int)+0x2d) [0x16bc14b] 2: ceph-osd() [0x17e6775] 3: (()+0x10340) [0x7f9c73376340] 4: (gsignal()+0x39) [0x7f9c7168ff79] 5: (abort()+0x148) [0x7f9c71693388] 6: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f9c71f9b6b5] 7: (()+0x5e836) [0x7f9c71f99836] 8: (()+0x5e863) [0x7f9c71f99863] 9: (()+0x5eaa2) [0x7f9c71f99aa2] 10: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x342) [0x1939a5a] 11: (ECBackend::get_hash_info(hobject_t const&)+0x68c) [0x175bf70] 12: (ECBackend::be_deep_scrub(hobject_t const&, ScrubMap::object&, ThreadPool::TPHandle&)+0x2dc) [0x175ddd4] 13: (PGBackend::be_scan_list(ScrubMap&, std::vector<hobject_t, std::allocator<hobject_t> > const&, bool, ThreadPool::TPHandle&)+0x425) [0x167010b] 14: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, ThreadPool::TPHandle&)+0x471) [0x150ba41] 15: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x1012) [0x150efd8] 16: (PG::scrub(ThreadPool::TPHandle&)+0x968) [0x150ded0] 17: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x28) [0x13451ac] 18: (ThreadPool::WorkQueue<PG>::_void_process(void*, ThreadPool::TPHandle&)+0x33) [0x13d8f35] 19: (ThreadPool::worker(ThreadPool::WorkThread*)+0x734) [0x1929636] 20: (ThreadPool::WorkThread::entry()+0x23) [0x192d7cd] 21: (Thread::entry_wrapper()+0x84) [0x1921d80] 22: (Thread::_entry_func(void*)+0x18) [0x1921cf2] 23: (()+0x8182) [0x7f9c7336e182] 24: (clone()+0x6d) [0x7f9c7175430d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Updated by Loïc Dachary over 9 years ago
- Status changed from 12 to Fix Under Review
Updated by Loïc Dachary over 9 years ago
- % Done changed from 0 to 50
- Backport set to firefly, giant
Updated by Loïc Dachary over 9 years ago
The tests should use the same as #9887 which requires https://github.com/ceph/ceph-qa-suite/compare/wip-dzaddscrub
Updated by Loïc Dachary over 9 years ago
- how to use ceph_objectstore_tool https://github.com/ceph/ceph-qa-suite/blob/giant/tasks/ceph_objectstore_tool.py
- https://github.com/ceph/ceph-qa-suite/blob/giant/tasks/ceph_manager.py should have high level functions to inject errors using ceph_objectstore_tool
Updated by Loïc Dachary over 9 years ago
- Status changed from Fix Under Review to 7
Updated by Loïc Dachary over 9 years ago
- % Done changed from 50 to 80
https://github.com/ceph/ceph-qa-suite/pull/250 teuthology tests
Updated by Loïc Dachary over 9 years ago
- Status changed from In Progress to Fix Under Review
Updated by Yuri Weinstein over 9 years ago
Same issue on next in run http://pulpito.ceph.com/teuthology-2014-11-22_17:18:02-upgrade:firefly-x-next-distro-basic-vps/
Job ['615228']
Assertion: osd/ECBackend.cc: 1461: FAILED assert(0 == "missing hash attr") ceph version 0.88-230-g9ba17a3 (9ba17a321db06d3d76c9295e411c76842194b25c) 1: (ECBackend::get_hash_info(hobject_t const&)+0xa0a) [0x9dff7a] 2: (ECBackend::submit_transaction(hobject_t const&, eversion_t const&, PGBackend::PGTransaction*, eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t, std::a 3: (ReplicatedPG::issue_repop(ReplicatedPG::RepGather*, utime_t)+0x6fe) [0x8e988e] 4: (ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)+0x1d1b) [0x92bccb] 5: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0x2b7b) [0x93306b] 6: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x4da) [0x8b8f6a] 7: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x178) [0x6649c8] 8: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x59f) [0x66535f] 9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x742) [0xa44fd2] 10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xa488b0] 11: (()+0x79d1) [0x7f0cf67009d1] 12: (clone()+0x6d) [0x7f0cf5488b6d]
Updated by Loïc Dachary over 9 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Loïc Dachary over 9 years ago
- firefly backport https://github.com/ceph/ceph/pull/3009
- giant backport https://github.com/ceph/ceph/pull/3010
Updated by Loïc Dachary over 9 years ago
- Status changed from Pending Backport to Fix Under Review
Updated by Samuel Just over 9 years ago
- Status changed from Fix Under Review to Resolved
Updated by Samuel Just over 9 years ago
- Status changed from Resolved to In Progress
Loic: are the tests for this in the regression suite yet?
Updated by Loïc Dachary over 9 years ago
It's waiting on https://github.com/ceph/ceph-qa-suite/pull/250
Updated by Loïc Dachary over 9 years ago
- Status changed from In Progress to Pending Backport
Updated by Loïc Dachary about 9 years ago
Updated by Loïc Dachary about 9 years ago
- Status changed from Pending Backport to Resolved