Project

General

Profile

Bug #18365

failed_push does not update missing set

Added by Sage Weil about 7 years ago. Updated about 7 years ago.

Status:
Duplicate
Priority:
Immediate
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

symptom is an unfound object due to this:

2016-12-29 08:17:06.482842 7fa9bcbc2700 10 osd.2 pg_epoch: 766683 pg[0.340( v 766680'1102981 lc 0'0 (762338'1040321,766680'1102981] local-les=766683 n=11774 ec=1 les/c/f 766683/762355/735559 766681/766682/766682) [2,4,65]/[2,4,41] r=0 lpr=766682 pi=762350-766681/81 rops=1 bft=65 crt=766680'1102981 mlcod 0'0 active+recovering+degraded+remapped m=2] handle_pull_response ObjectRecoveryInfo(MIN@0'0, size: 0, copy_subset: [], clone_subset: {})ObjectRecoveryProgress(first, data_recover
ed_to:0, data_complete:false, omap_recovered_to:, omap_complete:false) data.size() is 0 data_included: []
2016-12-29 08:17:06.482882 7fa9bcbc2700 20 osd.2 pg_epoch: 766683 pg[0.340( v 766680'1102981 lc 0'0 (762338'1040321,766680'1102981] local-les=766683 n=11774 ec=1 les/c/f 766683/762355/735559 766681/766682/766682) [2,4,65]/[2,4,41] r=0 lpr=766682 pi=762350-766681/81 rops=1 bft=65 crt=766680'1102981 mlcod 0'0 active+recovering+degraded+remapped m=2] failed_push: 0:c8cdbbe2:::10002a406b9.00000000:head
2016-12-29 08:17:06.482924 7fa9bcbc2700  0 osd.2 pg_epoch: 766683 pg[0.340( v 766680'1102981 lc 0'0 (762338'1040321,766680'1102981] local-les=766683 n=11774 ec=1 les/c/f 766683/762355/735559 766681/766682/766682) [2,4,65]/[2,4,41] r=0 lpr=766682 pi=762350-766681/81 rops=1 bft=65 crt=766680'1102981 mlcod 0'0 active+recovering+degraded+remapped m=2 u=1] failed_push 0:c8cdbbe2:::10002a406b9.00000000:head from shard 41, reps on  unfound? 1

and then later, if yout ry to mark_unfound_lost revert, you get an assert:
  -567> 2016-12-29 08:04:28.031323 7fe0e4524700 10 osd.2 pg_epoch: 766678 pg[0.340( v 766673'1102980 lc 0'0 (762338'1040321,766673'1102980] local-les=766678 n=11773 ec=1 les/c/f 766678/762355/735559 766676/766677/766677) [2,4,65]/[2,4,41] r=0 lpr=766677 pi=762350-766676/78 bft=65 crt=766673'1102980 mlcod 0'0 active+recovering+degraded+remapped m=1 u=1] all_unfound_are_queried_or_lost all of might_have_unfound 4,8,41,65 have been queried or are marked lost
  -566> 2016-12-29 08:04:28.031366 7fe0e4524700  3 osd.2 pg_epoch: 766678 pg[0.340( v 766673'1102980 lc 0'0 (762338'1040321,766673'1102980] local-les=766678 n=11773 ec=1 les/c/f 766678/762355/735559 766676/766677/766677) [2,4,65]/[2,4,41] r=0 lpr=766677 pi=762350-766676/78 bft=65 crt=766673'1102980 mlcod 0'0 active+recovering+degraded+remapped m=1 u=1] mark_all_unfound_lost l_revert
  -565> 2016-12-29 08:04:28.031393 7fe0e4524700 10 osd.2 pg_epoch: 766678 pg[0.340( v 766673'1102980 lc 0'0 (762338'1040321,766673'1102980] local-les=766678 n=11773 ec=1 les/c/f 766678/762355/735559 766676/766677/766677) [2,4,65]/[2,4,41] r=0 lpr=766677 pi=762350-766676/78 bft=65 crt=766673'1102980 mlcod 0'0 active+recovering+degraded+remapped m=1 u=1] pick_newest_available 0:c8cdbbe2:::10002a406b9.00000000:head 0'0 on osd.2 (local)
  -564> 2016-12-29 08:04:28.031420 7fe0e4524700 10 osd.2 pg_epoch: 766678 pg[0.340( v 766673'1102980 lc 0'0 (762338'1040321,766673'1102980] local-les=766678 n=11773 ec=1 les/c/f 766678/762355/735559 766676/766677/766677) [2,4,65]/[2,4,41] r=0 lpr=766677 pi=762350-766676/78 bft=65 crt=766673'1102980 mlcod 0'0 active+recovering+degraded+remapped m=1 u=1] pick_newest_available 0:c8cdbbe2:::10002a406b9.00000000:head 353630'446930 on osd.4
   -15> 2016-12-29 08:04:28.061301 7fe0e4524700 -1 /build/ceph-11.1.0-6147-g12706d7/src/osd/PrimaryLogPG.cc: In function 'eversion_t PrimaryLogPG::pick_newest_available(const hobject_t&)' thread 7fe0e4524700 time 2016-12-29 08:04:28.031456
/build/ceph-11.1.0-6147-g12706d7/src/osd/PrimaryLogPG.cc: 9815: FAILED assert(is_backfill_targets(peer))

 ceph version 11.1.0-6147-g12706d7 (12706d76225fa1491d00362d3bc04e0541dead73)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7fe10f892b7b]
 2: (PrimaryLogPG::pick_newest_available(hobject_t const&)+0x265) [0x7fe10f35ed75]
 3: (PrimaryLogPG::mark_all_unfound_lost(int, boost::intrusive_ptr<Connection>, unsigned long)+0x83b) [0x7fe10f35fb7b]
 4: (PrimaryLogPG::do_command(std::map<std::string, boost::variant<std::string<bool, long, double, std::vector<std::string, std::allocator<std::string> >, std::vector<long, std::allocator<long> > > >, std::less<std::string>, std::allocator<std::pair<std::string const, std::string<bool, long, double, std::vector<std::string, std::allocator<std::string> >, std::vector<long, std::allocator<long> > > > > >, std::ostream&, ceph::buffer::list&, ceph::buffer::list, boost::intrusive_ptr<Connection>, unsigned long)+0xbf3) [0x7fe10f3c37c3]
 5: (OSD::do_command(Connection*, unsigned long, std::vector<std::string, std::allocator<std::string> >&, ceph::buffer::list&)+0x2947) [0x7fe10f259e07]
 6: (OSD::CommandWQ::_process(OSD::Command*, ThreadPool::TPHandle&)+0x49) [0x7fe10f29abb9]
 7: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb65) [0x7fe10f899655]
 8: (ThreadPool::WorkThread::entry()+0x10) [0x7fe10f89a620]
 9: (()+0x8182) [0x7fe10de8a182]

because it assumes that if it isn't missing on a peer and the object is unfound then it must be a backfill target.

I think the fix is for the failed_push() helper to add the object to the peer's missing set...

(this is pg 0.340 on the lab cluster)


Related issues

Duplicates RADOS - Bug #18165: OSD crash with osd/ReplicatedPG.cc: 8485: FAILED assert(is_backfill_targets(peer)) Resolved 12/07/2016

History

#1 Updated by Samuel Just about 7 years ago

This is kind of wierd. Really, missing_loc is what's supposed to be the location-of-record for the ephemeral state related to who has what objects. We do update that in failed_push. The problem is that once it becomes empty, pick_newest_available goes back to using the missing sets directly. I suppose we can just update the missing sets as well in failed-push, but I'm a bit worried about letting the primary's copy of the replica's missing set diverge from the replica's. I guess nothing for it though.

#2 Updated by Samuel Just about 7 years ago

  • Status changed from 12 to 7

#3 Updated by Samuel Just about 7 years ago

  • Duplicates Bug #18165: OSD crash with osd/ReplicatedPG.cc: 8485: FAILED assert(is_backfill_targets(peer)) added

#4 Updated by Samuel Just about 7 years ago

  • Status changed from 7 to Duplicate

Also available in: Atom PDF