Project

General

Profile

Bug #18165

OSD crash with osd/ReplicatedPG.cc: 8485: FAILED assert(is_backfill_targets(peer))

Added by Pawel Sadowski 7 months ago. Updated about 12 hours ago.

Status:
In Progress
Priority:
Urgent
Assignee:
Category:
Correctness/Safety
Target version:
-
Start date:
12/07/2016
Due date:
% Done:

0%

Source:
Tags:
Backport:
kraken,jewel
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Release:
hammer
Needs Doc:
No
Component(RADOS):
OSD

Description

2016-12-02 08:50:07.309361 7f10cd434700  0 -- 10.99.186.66:6800/169872 >> 10.99.41.129:0/439883285 pipe(0x4ffdc000 sd=9923 :6800 s=0 pgs=0 cs=0 l=1 c=0x24941340).accept replacing existing (lossy) channel (new one lossy=1)
2016-12-02 08:50:44.862346 7f115322d700 -1 log_channel(cluster) log [ERR] : 3.1568 has 1 objects unfound and apparently lost
2016-12-02 08:51:00.769950 7f115322d700 -1 log_channel(cluster) log [ERR] : 3.1568 has 1 objects unfound and apparently lost
2016-12-02 08:51:02.721326 7f108e3b4700  0 -- 10.99.186.66:6800/169872 >> 10.99.43.96:0/1289572096 pipe(0x2cd00000 sd=100 :6800 s=0 pgs=0 cs=0 l=1 c=0x4fb8b600).accept replacing existing (lossy) channel (new one lossy=1)
2016-12-02 08:51:03.289494 7f115322d700 -1 log_channel(cluster) log [ERR] : 3.1568 has 1 objects unfound and apparently lost
2016-12-02 08:51:06.628331 7f10c329a700  0 -- 10.99.186.66:6800/169872 >> 10.99.54.194:0/3692079157 pipe(0x5864f000 sd=5569 :6800 s=0 pgs=0 cs=0 l=1 c=0x1a96eb00).accept replacing existing (lossy) channel (new one lossy=1)
2016-12-02 08:51:17.397357 7f115322d700 -1 log_channel(cluster) log [ERR] : 3.1568 has 1 objects unfound and apparently lost
2016-12-02 08:51:22.214365 7f115322d700 -1 log_channel(cluster) log [ERR] : 3.1568 has 1 objects unfound and apparently lost
2016-12-02 08:51:26.824599 7f115322d700 -1 log_channel(cluster) log [ERR] : 3.1568 has 1 objects unfound and apparently lost
2016-12-02 08:51:38.211395 7f101fc8a700  0 -- 10.99.245.64:6800/169872 >> 10.99.243.116:6800/1154798 pipe(0x38aef000 sd=5409 :44998 s=2 pgs=2481 cs=21 l=0 c=0x30e0ba20).fault with nothing to send, going to standby
2016-12-02 08:51:51.859851 7f114ca20700  0 osd.194 1401127 do_command r=0 
2016-12-02 08:51:51.925621 7f114ca20700 -1 osd/ReplicatedPG.cc: In function 'eversion_t ReplicatedPG::pick_newest_available(const hobject_t&)' thread 7f114ca20700 time 2016-12-02 08:51:51.915215
osd/ReplicatedPG.cc: 8485: FAILED assert(is_backfill_targets(peer))

 ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xba8b8b]
 2: (ReplicatedPG::pick_newest_available(hobject_t const&)+0x4bf) [0x83bc0f]
 3: (ReplicatedPG::mark_all_unfound_lost(int)+0x101d) [0x85f45d]
 4: (ReplicatedPG::do_command(std::map<std::string, boost::variant<std::string, bool, long, double, std::vector<std::string, std::allocator<std::string> >, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_>, std::less<std::string>, std::allocator<std::pair<std::string const, boost::variant<std::string, bool, long, double, std::vector<std::string, std::allocator<std::string> >, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> > > >, std::ostream&, ceph::buffer::list&, ceph::buffer::list&)+0xc3c) [0x82c34c]
 5: (OSD::do_command(Connection*, unsigned long, std::vector<std::string, std::allocator<std::string> >&, ceph::buffer::list&)+0x2417) [0x69eed7]
 6: (OSD::CommandWQ::_process(OSD::Command*)+0x59) [0x6fead9]
 7: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e) [0xb995ee]
 8: (ThreadPool::WorkThread::entry()+0x10) [0xb9a690]
 9: (()+0x8182) [0x7f117f40d182]
 10: (clone()+0x6d) [0x7f117d97830d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

I just had this assert on Hammer, 0.94.6. It was reproducible on every mark_unfound_lost revert. Due to revert not working I tried to delete and some other OSD restarted. After removing second unfound another one OSD crashed and didn't start.
Complete timeline of events:

cluster grown by 50% of OSDs
some problems during rebalance
during troubleshooting all OSDs were restarted, two OSD died with FAILED assert(oi.version i->first)' #17916
few hours later another OSD crashed with #17916 symptoms -- killed by OOM killer
cluster almost finished rebalance except two unfound objects &lt;scope of this bug begins&gt;
tried ceph pg 3.1568 mark_unfound_lost revert -- OSD 194 is crashing, logs attached (from all three in this PG: 194,301,202)
tried ceph pg 3.1568 mark_unfound_lost delete -- OSD 413 crashed (FAILED assert(info.last_complete info.last_update)) and restarted - it was part of acting set
cluster recovered quickly
tried ceph pg 3.3e66 mark_unfound_lost delete -- OSD 206 crashed (FAILED assert(head_obc)) and didn't start again

mon-05-ee0664c2-3510-4d97-bd00-4706e316f2a3:~ # ceph health detail | grep unfound
HEALTH_WARN 2 pgs degraded; 2 pgs recovering; 2 pgs stuck degraded; 2 pgs stuck unclean; recovery 11/122963096 objects degraded (0.000%); recovery 9046/122963096 objects misplaced (0.007%); recovery 2/40986191 unfound (0.000%); noscrub,nodeep-scrub flag(s) set; mon.mon-06-ee0664c2-3510-4d97-bd00-4706e316f2a3 store is getting too big! 96841 MB >= 15360 MB; mon.mon-10-ee0664c2-3510-4d97-bd00-4706e316f2a3 store is getting too big! 50985 MB >= 15360 MB; mon.mon-05-ee0664c2-3510-4d97-bd00-4706e316f2a3 store is getting too big! 66341 MB >= 15360 MB
pg 3.3e66 is active+recovering+degraded+remapped, acting [206,146,371], 1 unfound
pg 3.1568 is active+recovering+degraded+remapped, acting [194,301,202], 1 unfound
recovery 2/40986191 unfound (0.000%)
Logs uploaded with ceph-post-file:

ceph-post-file: 43bdffdf-f531-4779-81da-dabe429bef16
ceph-post-file: e8a59ba9-0c02-437d-899d-ccd3b4edf316

Related issues

Duplicated by Ceph - Bug #18365: failed_push does not update missing set Duplicate 12/29/2016
Duplicated by Ceph - Bug #16259: ceph osd processes crashes when doing a revert on unfound object Duplicate 06/13/2016
Copied to Ceph - Backport #18567: kraken: OSD crash with osd/ReplicatedPG.cc: 8485: FAILED assert(is_backfill_targets(peer)) Resolved
Copied to Ceph - Backport #18568: jewel: OSD crash with osd/ReplicatedPG.cc: 8485: FAILED assert(is_backfill_targets(peer)) New

History

#1 Updated by Samuel Just 6 months ago

  • Status changed from New to Won't Fix

This machinery was rewritten in jewel to make it work properly in this case. The fix can't really be backported to hammer, so I'm marking this one won't fix. Please reopen if reproducible on jewel.

#2 Updated by Samuel Just 6 months ago

  • Status changed from Won't Fix to Verified
  • Priority changed from Normal to High

The bug is still present and fairly straightforward: it's possible that the newest version isn't on an osd in the current up or acting sets. It might be as simple as removing this assert.

#3 Updated by Samuel Just 6 months ago

If anyone else hits this, you can work around it by extracting the object from the osd which has it, using mark_unfound_lost delete, and using the rados tool to replace it.

#4 Updated by Samuel Just 6 months ago

  • Priority changed from High to Immediate

I looked at it more closely. This is kind of wierd. Really, missing_loc is what's supposed to be the location-of-record for the ephemeral state related to who has what objects. We do update that in failed_push. The problem is that once it becomes empty, pick_newest_available goes back to using the missing sets directly. I suppose we can just update the missing sets as well in failed-push, but I'm a bit worried about letting the primary's copy of the replica's missing set diverge from the replica's. I guess nothing for it though.

#5 Updated by Samuel Just 6 months ago

  • Duplicated by Bug #18365: failed_push does not update missing set added

#6 Updated by Samuel Just 6 months ago

  • Status changed from Verified to Testing

#7 Updated by Samuel Just 6 months ago

  • Assignee set to Samuel Just

#8 Updated by Sage Weil 6 months ago

  • Status changed from Testing to Pending Backport
  • Priority changed from Immediate to Urgent
  • Backport set to kraken,jewel

#10 Updated by Samuel Just 6 months ago

  • Status changed from Pending Backport to Resolved

#11 Updated by Nathan Cutler 6 months ago

  • Status changed from Resolved to Pending Backport

Sam, this issue has "Backport: kraken, jewel" set. Have the backports been done already?

#12 Updated by Nathan Cutler 5 months ago

  • Copied to Backport #18567: kraken: OSD crash with osd/ReplicatedPG.cc: 8485: FAILED assert(is_backfill_targets(peer)) added

#13 Updated by Nathan Cutler 5 months ago

  • Copied to Backport #18568: jewel: OSD crash with osd/ReplicatedPG.cc: 8485: FAILED assert(is_backfill_targets(peer)) added

#14 Updated by Samuel Just 5 months ago

  • Status changed from Pending Backport to Verified

Nope, that fix didn't work. Backfill doesn't put objects into the needs_recovery_map. Reverting.

#15 Updated by Samuel Just 5 months ago

  • Assignee changed from Samuel Just to David Zafman

#16 Updated by David Zafman about 1 month ago

  • Duplicated by Bug #16259: ceph osd processes crashes when doing a revert on unfound object added

#17 Updated by David Zafman about 1 month ago

  • Status changed from Verified to In Progress

#18 Updated by Greg Farnum 15 days ago

  • Project changed from Ceph to RADOS
  • Category set to Correctness/Safety
  • Component(RADOS) OSD added

#19 Updated by Greg Farnum about 12 hours ago

David, anything up with this? Is it an urgent bug?

Also available in: Atom PDF