Project

General

Profile

Actions

Bug #38784

closed

osd: FAILED ceph_assert(attrs || !pg_log.get_missing().is_missing(soid) || (it_objects != pg_log.get_log().objects.end() && it_objects->second->op == pg_log_entry_t::LOST_REVERT)) in PrimaryLogPG::get_object_context()

Added by Neha Ojha about 5 years ago. Updated almost 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
luminous,mimic,nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2019-03-14T01:34:27.455 INFO:tasks.ceph.osd.3.smithi131.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.0.1-3842-g08f436c/rpm/el7/BUILD/ceph-14.0.1-3842-g08f436c/src/osd/PrimaryLogPG.cc: In function 'ObjectContextRef PrimaryLogPG::get_object_context(const hobject_t&, bool, const std::map<std::basic_string<char>, ceph::buffer::list>*)' thread 7fe774509700 time 2019-03-14 01:34:27.471330
2019-03-14T01:34:27.456 INFO:tasks.ceph.osd.3.smithi131.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.0.1-3842-g08f436c/rpm/el7/BUILD/ceph-14.0.1-3842-g08f436c/src/osd/PrimaryLogPG.cc: 10998: FAILED ceph_assert(attrs || !pg_log.get_missing().is_missing(soid) || (it_objects != pg_log.get_log().objects.end() && it_objects->second->op == pg_log_entry_t::LOST_REVERT))
2019-03-14T01:34:27.456 INFO:tasks.ceph.osd.3.smithi131.stderr: ceph version 14.0.1-3842-g08f436c (08f436c23590d7ac8ad260a72c6b942440ace5a6) nautilus (dev)
2019-03-14T01:34:27.456 INFO:tasks.ceph.osd.3.smithi131.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a) [0x55a3af940acc]
2019-03-14T01:34:27.456 INFO:tasks.ceph.osd.3.smithi131.stderr: 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x55a3af940c9a]
2019-03-14T01:34:27.457 INFO:tasks.ceph.osd.3.smithi131.stderr: 3: (PrimaryLogPG::get_object_context(hobject_t const&, bool, std::map<std::string, ceph::buffer::list, std::less<std::string>, std::allocator<std::pair<std::string const, ceph::buffer::list> > > const*)+0x62c) [0x55a3afbe427c]
2019-03-14T01:34:27.457 INFO:tasks.ceph.osd.3.smithi131.stderr: 4: (PrimaryLogPG::prep_object_replica_deletes(hobject_t const&, eversion_t, PGBackend::RecoveryHandle*, bool*)+0x84) [0x55a3afbf6934]
2019-03-14T01:34:27.457 INFO:tasks.ceph.osd.3.smithi131.stderr: 5: (PrimaryLogPG::recover_replicas(unsigned long, ThreadPool::TPHandle&, bool*)+0xb74) [0x55a3afbf8e94]
2019-03-14T01:34:27.457 INFO:tasks.ceph.osd.3.smithi131.stderr: 6: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0x1bd) [0x55a3afc44ebd]
2019-03-14T01:34:27.457 INFO:tasks.ceph.osd.3.smithi131.stderr: 7: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x363) [0x55a3afa7cae3]
2019-03-14T01:34:27.457 INFO:tasks.ceph.osd.3.smithi131.stderr: 8: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x19) [0x55a3afd138f9]
2019-03-14T01:34:27.457 INFO:tasks.ceph.osd.3.smithi131.stderr: 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xa0c) [0x55a3afa9a1ac]
2019-03-14T01:34:27.458 INFO:tasks.ceph.osd.3.smithi131.stderr: 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x433) [0x55a3b008c803]
2019-03-14T01:34:27.458 INFO:tasks.ceph.osd.3.smithi131.stderr: 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55a3b008f8a0]
2019-03-14T01:34:27.458 INFO:tasks.ceph.osd.3.smithi131.stderr: 12: (()+0x7dd5) [0x7fe79dd2cdd5]
2019-03-14T01:34:27.458 INFO:tasks.ceph.osd.3.smithi131.stderr: 13: (clone()+0x6d) [0x7fe79cbf2ead]

/a/nojha-2019-03-14_00:36:19-rados:thrash-erasure-code-wip-2-36739-2019-03-13-distro-basic-smithi/3718898/

The problem seems to be that the on the primary-osd.3, we have 1 missing and 1 unfound object, which are not the same object.
In start_recovery_ops(), since num_missing == num_unfound, we end up calling recover_replicas(), before recovering the missing object on the primary. This ends up with !pg_log.get_missing().is_missing(soid) not being true during recovery of the missing object on a replica.

We should account for the unfound object in the missing set of the primary, that way we would have had num_missing=2, and would have recovered the object that is only missing(not unfound) on the primary before recovering it on replicas.


Related issues 3 (0 open3 closed)

Copied to RADOS - Backport #39218: luminous: osd: FAILED ceph_assert(attrs || !pg_log.get_missing().is_missing(soid) || (it_objects != pg_log.get_log().objects.end() && it_objects->second->op == pg_log_entry_t::LOST_REVERT)) in PrimaryLogPG::get_object_context()ResolvedPrashant DActions
Copied to RADOS - Backport #39219: nautilus: osd: FAILED ceph_assert(attrs || !pg_log.get_missing().is_missing(soid) || (it_objects != pg_log.get_log().objects.end() && it_objects->second->op == pg_log_entry_t::LOST_REVERT)) in PrimaryLogPG::get_object_context()ResolvedPrashant DActions
Copied to RADOS - Backport #39220: mimic: osd: FAILED ceph_assert(attrs || !pg_log.get_missing().is_missing(soid) || (it_objects != pg_log.get_log().objects.end() && it_objects->second->op == pg_log_entry_t::LOST_REVERT)) in PrimaryLogPG::get_object_context()ResolvedPrashant DActions
Actions #1

Updated by Neha Ojha about 5 years ago

  • Status changed from New to Fix Under Review
  • Backport set to luminous,mimic,nautilus
  • Pull request ID set to 27205
Actions #2

Updated by xie xingguo about 5 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #3

Updated by Nathan Cutler about 5 years ago

  • Copied to Backport #39218: luminous: osd: FAILED ceph_assert(attrs || !pg_log.get_missing().is_missing(soid) || (it_objects != pg_log.get_log().objects.end() && it_objects->second->op == pg_log_entry_t::LOST_REVERT)) in PrimaryLogPG::get_object_context() added
Actions #4

Updated by Nathan Cutler about 5 years ago

  • Copied to Backport #39219: nautilus: osd: FAILED ceph_assert(attrs || !pg_log.get_missing().is_missing(soid) || (it_objects != pg_log.get_log().objects.end() && it_objects->second->op == pg_log_entry_t::LOST_REVERT)) in PrimaryLogPG::get_object_context() added
Actions #5

Updated by Nathan Cutler about 5 years ago

  • Copied to Backport #39220: mimic: osd: FAILED ceph_assert(attrs || !pg_log.get_missing().is_missing(soid) || (it_objects != pg_log.get_log().objects.end() && it_objects->second->op == pg_log_entry_t::LOST_REVERT)) in PrimaryLogPG::get_object_context() added
Actions #6

Updated by Prashant D almost 5 years ago

  • Status changed from Pending Backport to In Progress
  • Assignee set to Prashant D
Actions #7

Updated by Prashant D almost 5 years ago

  • Status changed from In Progress to Pending Backport
  • Assignee changed from Prashant D to Neha Ojha
Actions #8

Updated by Nathan Cutler almost 5 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF