Project

General

Profile

Actions

Bug #18165

closed

OSD crash with osd/ReplicatedPG.cc: 8485: FAILED assert(is_backfill_targets(peer))

Added by Pawel Sadowski over 7 years ago. Updated about 6 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
David Zafman
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
kraken,jewel
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2016-12-02 08:50:07.309361 7f10cd434700  0 -- 10.99.186.66:6800/169872 >> 10.99.41.129:0/439883285 pipe(0x4ffdc000 sd=9923 :6800 s=0 pgs=0 cs=0 l=1 c=0x24941340).accept replacing existing (lossy) channel (new one lossy=1)
2016-12-02 08:50:44.862346 7f115322d700 -1 log_channel(cluster) log [ERR] : 3.1568 has 1 objects unfound and apparently lost
2016-12-02 08:51:00.769950 7f115322d700 -1 log_channel(cluster) log [ERR] : 3.1568 has 1 objects unfound and apparently lost
2016-12-02 08:51:02.721326 7f108e3b4700  0 -- 10.99.186.66:6800/169872 >> 10.99.43.96:0/1289572096 pipe(0x2cd00000 sd=100 :6800 s=0 pgs=0 cs=0 l=1 c=0x4fb8b600).accept replacing existing (lossy) channel (new one lossy=1)
2016-12-02 08:51:03.289494 7f115322d700 -1 log_channel(cluster) log [ERR] : 3.1568 has 1 objects unfound and apparently lost
2016-12-02 08:51:06.628331 7f10c329a700  0 -- 10.99.186.66:6800/169872 >> 10.99.54.194:0/3692079157 pipe(0x5864f000 sd=5569 :6800 s=0 pgs=0 cs=0 l=1 c=0x1a96eb00).accept replacing existing (lossy) channel (new one lossy=1)
2016-12-02 08:51:17.397357 7f115322d700 -1 log_channel(cluster) log [ERR] : 3.1568 has 1 objects unfound and apparently lost
2016-12-02 08:51:22.214365 7f115322d700 -1 log_channel(cluster) log [ERR] : 3.1568 has 1 objects unfound and apparently lost
2016-12-02 08:51:26.824599 7f115322d700 -1 log_channel(cluster) log [ERR] : 3.1568 has 1 objects unfound and apparently lost
2016-12-02 08:51:38.211395 7f101fc8a700  0 -- 10.99.245.64:6800/169872 >> 10.99.243.116:6800/1154798 pipe(0x38aef000 sd=5409 :44998 s=2 pgs=2481 cs=21 l=0 c=0x30e0ba20).fault with nothing to send, going to standby
2016-12-02 08:51:51.859851 7f114ca20700  0 osd.194 1401127 do_command r=0 
2016-12-02 08:51:51.925621 7f114ca20700 -1 osd/ReplicatedPG.cc: In function 'eversion_t ReplicatedPG::pick_newest_available(const hobject_t&)' thread 7f114ca20700 time 2016-12-02 08:51:51.915215
osd/ReplicatedPG.cc: 8485: FAILED assert(is_backfill_targets(peer))

 ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xba8b8b]
 2: (ReplicatedPG::pick_newest_available(hobject_t const&)+0x4bf) [0x83bc0f]
 3: (ReplicatedPG::mark_all_unfound_lost(int)+0x101d) [0x85f45d]
 4: (ReplicatedPG::do_command(std::map<std::string, boost::variant<std::string, bool, long, double, std::vector<std::string, std::allocator<std::string> >, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_>, std::less<std::string>, std::allocator<std::pair<std::string const, boost::variant<std::string, bool, long, double, std::vector<std::string, std::allocator<std::string> >, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> > > >, std::ostream&, ceph::buffer::list&, ceph::buffer::list&)+0xc3c) [0x82c34c]
 5: (OSD::do_command(Connection*, unsigned long, std::vector<std::string, std::allocator<std::string> >&, ceph::buffer::list&)+0x2417) [0x69eed7]
 6: (OSD::CommandWQ::_process(OSD::Command*)+0x59) [0x6fead9]
 7: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e) [0xb995ee]
 8: (ThreadPool::WorkThread::entry()+0x10) [0xb9a690]
 9: (()+0x8182) [0x7f117f40d182]
 10: (clone()+0x6d) [0x7f117d97830d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

I just had this assert on Hammer, 0.94.6. It was reproducible on every mark_unfound_lost revert. Due to revert not working I tried to delete and some other OSD restarted. After removing second unfound another one OSD crashed and didn't start.
Complete timeline of events:

cluster grown by 50% of OSDs
some problems during rebalance
during troubleshooting all OSDs were restarted, two OSD died with FAILED assert(oi.version i->first)' #17916
few hours later another OSD crashed with #17916 symptoms -- killed by OOM killer
cluster almost finished rebalance except two unfound objects &lt;scope of this bug begins&gt;
tried ceph pg 3.1568 mark_unfound_lost revert -- OSD 194 is crashing, logs attached (from all three in this PG: 194,301,202)
tried ceph pg 3.1568 mark_unfound_lost delete -- OSD 413 crashed (FAILED assert(info.last_complete info.last_update)) and restarted - it was part of acting set
cluster recovered quickly
tried ceph pg 3.3e66 mark_unfound_lost delete -- OSD 206 crashed (FAILED assert(head_obc)) and didn't start again

mon-05-ee0664c2-3510-4d97-bd00-4706e316f2a3:~ # ceph health detail | grep unfound
HEALTH_WARN 2 pgs degraded; 2 pgs recovering; 2 pgs stuck degraded; 2 pgs stuck unclean; recovery 11/122963096 objects degraded (0.000%); recovery 9046/122963096 objects misplaced (0.007%); recovery 2/40986191 unfound (0.000%); noscrub,nodeep-scrub flag(s) set; mon.mon-06-ee0664c2-3510-4d97-bd00-4706e316f2a3 store is getting too big! 96841 MB >= 15360 MB; mon.mon-10-ee0664c2-3510-4d97-bd00-4706e316f2a3 store is getting too big! 50985 MB >= 15360 MB; mon.mon-05-ee0664c2-3510-4d97-bd00-4706e316f2a3 store is getting too big! 66341 MB >= 15360 MB
pg 3.3e66 is active+recovering+degraded+remapped, acting [206,146,371], 1 unfound
pg 3.1568 is active+recovering+degraded+remapped, acting [194,301,202], 1 unfound
recovery 2/40986191 unfound (0.000%)
Logs uploaded with ceph-post-file:

ceph-post-file: 43bdffdf-f531-4779-81da-dabe429bef16
ceph-post-file: e8a59ba9-0c02-437d-899d-ccd3b4edf316

Related issues 4 (0 open4 closed)

Has duplicate Ceph - Bug #18365: failed_push does not update missing setDuplicateSamuel Just12/29/2016

Actions
Has duplicate Ceph - Bug #16259: ceph osd processes crashes when doing a revert on unfound objectDuplicate06/13/2016

Actions
Copied to Ceph - Backport #18567: kraken: OSD crash with osd/ReplicatedPG.cc: 8485: FAILED assert(is_backfill_targets(peer))ResolvedSamuel JustActions
Copied to Ceph - Backport #18568: jewel: OSD crash with osd/ReplicatedPG.cc: 8485: FAILED assert(is_backfill_targets(peer))RejectedDavid ZafmanActions
Actions #1

Updated by Samuel Just over 7 years ago

  • Status changed from New to Won't Fix

This machinery was rewritten in jewel to make it work properly in this case. The fix can't really be backported to hammer, so I'm marking this one won't fix. Please reopen if reproducible on jewel.

Actions #2

Updated by Samuel Just over 7 years ago

  • Status changed from Won't Fix to 12
  • Priority changed from Normal to High

The bug is still present and fairly straightforward: it's possible that the newest version isn't on an osd in the current up or acting sets. It might be as simple as removing this assert.

Actions #3

Updated by Samuel Just over 7 years ago

If anyone else hits this, you can work around it by extracting the object from the osd which has it, using mark_unfound_lost delete, and using the rados tool to replace it.

Actions #4

Updated by Samuel Just over 7 years ago

  • Priority changed from High to Immediate

I looked at it more closely. This is kind of wierd. Really, missing_loc is what's supposed to be the location-of-record for the ephemeral state related to who has what objects. We do update that in failed_push. The problem is that once it becomes empty, pick_newest_available goes back to using the missing sets directly. I suppose we can just update the missing sets as well in failed-push, but I'm a bit worried about letting the primary's copy of the replica's missing set diverge from the replica's. I guess nothing for it though.

Actions #5

Updated by Samuel Just over 7 years ago

  • Has duplicate Bug #18365: failed_push does not update missing set added
Actions #6

Updated by Samuel Just over 7 years ago

  • Status changed from 12 to 7
Actions #7

Updated by Samuel Just over 7 years ago

  • Assignee set to Samuel Just
Actions #8

Updated by Sage Weil over 7 years ago

  • Status changed from 7 to Pending Backport
  • Priority changed from Immediate to Urgent
  • Backport set to kraken,jewel
Actions #10

Updated by Samuel Just over 7 years ago

  • Status changed from Pending Backport to Resolved
Actions #11

Updated by Nathan Cutler over 7 years ago

  • Status changed from Resolved to Pending Backport

Sam, this issue has "Backport: kraken, jewel" set. Have the backports been done already?

Actions #12

Updated by Nathan Cutler over 7 years ago

  • Copied to Backport #18567: kraken: OSD crash with osd/ReplicatedPG.cc: 8485: FAILED assert(is_backfill_targets(peer)) added
Actions #13

Updated by Nathan Cutler over 7 years ago

  • Copied to Backport #18568: jewel: OSD crash with osd/ReplicatedPG.cc: 8485: FAILED assert(is_backfill_targets(peer)) added
Actions #14

Updated by Samuel Just about 7 years ago

  • Status changed from Pending Backport to 12

Nope, that fix didn't work. Backfill doesn't put objects into the needs_recovery_map. Reverting.

Actions #15

Updated by Samuel Just about 7 years ago

  • Assignee changed from Samuel Just to David Zafman
Actions #16

Updated by David Zafman almost 7 years ago

  • Has duplicate Bug #16259: ceph osd processes crashes when doing a revert on unfound object added
Actions #17

Updated by David Zafman almost 7 years ago

  • Status changed from 12 to In Progress
Actions #18

Updated by Greg Farnum almost 7 years ago

  • Project changed from Ceph to RADOS
  • Category set to Correctness/Safety
  • Component(RADOS) OSD added
Actions #19

Updated by Greg Farnum almost 7 years ago

David, anything up with this? Is it an urgent bug?

Actions #21

Updated by David Zafman almost 7 years ago

  • Status changed from In Progress to Resolved
Actions #22

Updated by Nathan Cutler about 6 years ago

  • Status changed from Resolved to Pending Backport

This should not have been marked Resolved when one of the backports was still open.

Actions #23

Updated by David Zafman about 6 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF