Bug #18165: OSD crash with osd/ReplicatedPG.cc: 8485: FAILED assert(is_backfill_targets(peer)) - RADOS - Ceph

Actions

Copy link

Bug #18165

closed

OSD crash with osd/ReplicatedPG.cc: 8485: FAILED assert(is_backfill_targets(peer))

Added by Pawel Sadowski over 7 years ago. Updated about 6 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

David Zafman

Category:

Correctness/Safety

Target version:

% Done:

Source:

Tags:

Backport:

kraken,jewel

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v0.94.6

ceph-qa-suite:

Component(RADOS):

OSD

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

2016-12-02 08:50:07.309361 7f10cd434700  0 -- 10.99.186.66:6800/169872 >> 10.99.41.129:0/439883285 pipe(0x4ffdc000 sd=9923 :6800 s=0 pgs=0 cs=0 l=1 c=0x24941340).accept replacing existing (lossy) channel (new one lossy=1)
2016-12-02 08:50:44.862346 7f115322d700 -1 log_channel(cluster) log [ERR] : 3.1568 has 1 objects unfound and apparently lost
2016-12-02 08:51:00.769950 7f115322d700 -1 log_channel(cluster) log [ERR] : 3.1568 has 1 objects unfound and apparently lost
2016-12-02 08:51:02.721326 7f108e3b4700  0 -- 10.99.186.66:6800/169872 >> 10.99.43.96:0/1289572096 pipe(0x2cd00000 sd=100 :6800 s=0 pgs=0 cs=0 l=1 c=0x4fb8b600).accept replacing existing (lossy) channel (new one lossy=1)
2016-12-02 08:51:03.289494 7f115322d700 -1 log_channel(cluster) log [ERR] : 3.1568 has 1 objects unfound and apparently lost
2016-12-02 08:51:06.628331 7f10c329a700  0 -- 10.99.186.66:6800/169872 >> 10.99.54.194:0/3692079157 pipe(0x5864f000 sd=5569 :6800 s=0 pgs=0 cs=0 l=1 c=0x1a96eb00).accept replacing existing (lossy) channel (new one lossy=1)
2016-12-02 08:51:17.397357 7f115322d700 -1 log_channel(cluster) log [ERR] : 3.1568 has 1 objects unfound and apparently lost
2016-12-02 08:51:22.214365 7f115322d700 -1 log_channel(cluster) log [ERR] : 3.1568 has 1 objects unfound and apparently lost
2016-12-02 08:51:26.824599 7f115322d700 -1 log_channel(cluster) log [ERR] : 3.1568 has 1 objects unfound and apparently lost
2016-12-02 08:51:38.211395 7f101fc8a700  0 -- 10.99.245.64:6800/169872 >> 10.99.243.116:6800/1154798 pipe(0x38aef000 sd=5409 :44998 s=2 pgs=2481 cs=21 l=0 c=0x30e0ba20).fault with nothing to send, going to standby
2016-12-02 08:51:51.859851 7f114ca20700  0 osd.194 1401127 do_command r=0 
2016-12-02 08:51:51.925621 7f114ca20700 -1 osd/ReplicatedPG.cc: In function 'eversion_t ReplicatedPG::pick_newest_available(const hobject_t&)' thread 7f114ca20700 time 2016-12-02 08:51:51.915215
osd/ReplicatedPG.cc: 8485: FAILED assert(is_backfill_targets(peer))

 ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xba8b8b]
 2: (ReplicatedPG::pick_newest_available(hobject_t const&)+0x4bf) [0x83bc0f]
 3: (ReplicatedPG::mark_all_unfound_lost(int)+0x101d) [0x85f45d]
 4: (ReplicatedPG::do_command(std::map<std::string, boost::variant<std::string, bool, long, double, std::vector<std::string, std::allocator<std::string> >, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_>, std::less<std::string>, std::allocator<std::pair<std::string const, boost::variant<std::string, bool, long, double, std::vector<std::string, std::allocator<std::string> >, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> > > >, std::ostream&, ceph::buffer::list&, ceph::buffer::list&)+0xc3c) [0x82c34c]
 5: (OSD::do_command(Connection*, unsigned long, std::vector<std::string, std::allocator<std::string> >&, ceph::buffer::list&)+0x2417) [0x69eed7]
 6: (OSD::CommandWQ::_process(OSD::Command*)+0x59) [0x6fead9]
 7: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e) [0xb995ee]
 8: (ThreadPool::WorkThread::entry()+0x10) [0xb9a690]
 9: (()+0x8182) [0x7f117f40d182]
 10: (clone()+0x6d) [0x7f117d97830d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

I just had this assert on Hammer, 0.94.6. It was reproducible on every mark_unfound_lost revert. Due to revert not working I tried to delete and some other OSD restarted. After removing second unfound another one OSD crashed and didn't start.
Complete timeline of events:

cluster grown by 50% of OSDs
    some problems during rebalance
    during troubleshooting all OSDs were restarted, two OSD died with FAILED assert(oi.version i->first)' #17916
    few hours later another OSD crashed with #17916 symptoms -- killed by OOM killer
    cluster almost finished rebalance except two unfound objects &lt;scope of this bug begins&gt;
    tried ceph pg 3.1568 mark_unfound_lost revert -- OSD 194 is crashing, logs attached (from all three in this PG: 194,301,202)
    tried ceph pg 3.1568 mark_unfound_lost delete -- OSD 413 crashed (FAILED assert(info.last_complete info.last_update)) and restarted - it was part of acting set
    cluster recovered quickly
    tried ceph pg 3.3e66 mark_unfound_lost delete -- OSD 206 crashed (FAILED assert(head_obc)) and didn't start again

mon-05-ee0664c2-3510-4d97-bd00-4706e316f2a3:~ # ceph health detail | grep unfound
HEALTH_WARN 2 pgs degraded; 2 pgs recovering; 2 pgs stuck degraded; 2 pgs stuck unclean; recovery 11/122963096 objects degraded (0.000%); recovery 9046/122963096 objects misplaced (0.007%); recovery 2/40986191 unfound (0.000%); noscrub,nodeep-scrub flag(s) set; mon.mon-06-ee0664c2-3510-4d97-bd00-4706e316f2a3 store is getting too big! 96841 MB >= 15360 MB; mon.mon-10-ee0664c2-3510-4d97-bd00-4706e316f2a3 store is getting too big! 50985 MB >= 15360 MB; mon.mon-05-ee0664c2-3510-4d97-bd00-4706e316f2a3 store is getting too big! 66341 MB >= 15360 MB
pg 3.3e66 is active+recovering+degraded+remapped, acting [206,146,371], 1 unfound
pg 3.1568 is active+recovering+degraded+remapped, acting [194,301,202], 1 unfound
recovery 2/40986191 unfound (0.000%)
Logs uploaded with ceph-post-file:

ceph-post-file: 43bdffdf-f531-4779-81da-dabe429bef16
    ceph-post-file: e8a59ba9-0c02-437d-899d-ccd3b4edf316

Related issues 4 (0 open — 4 closed)

Actions

Copy link

Updated by Samuel Just over 7 years ago

Status changed from New to Won't Fix

This machinery was rewritten in jewel to make it work properly in this case. The fix can't really be backported to hammer, so I'm marking this one won't fix. Please reopen if reproducible on jewel.

Actions

Copy link

Updated by Samuel Just over 7 years ago

Status changed from Won't Fix to 12
Priority changed from Normal to High

The bug is still present and fairly straightforward: it's possible that the newest version isn't on an osd in the current up or acting sets. It might be as simple as removing this assert.

Actions

Copy link

Updated by Samuel Just over 7 years ago

If anyone else hits this, you can work around it by extracting the object from the osd which has it, using mark_unfound_lost delete, and using the rados tool to replace it.

Actions

Copy link

Updated by Samuel Just over 7 years ago

Priority changed from High to Immediate

I looked at it more closely. This is kind of wierd. Really, missing_loc is what's supposed to be the location-of-record for the ephemeral state related to who has what objects. We do update that in failed_push. The problem is that once it becomes empty, pick_newest_available goes back to using the missing sets directly. I suppose we can just update the missing sets as well in failed-push, but I'm a bit worried about letting the primary's copy of the replica's missing set diverge from the replica's. I guess nothing for it though.

Actions

Copy link