Project

General

Profile

Actions

Bug #18165

closed

OSD crash with osd/ReplicatedPG.cc: 8485: FAILED assert(is_backfill_targets(peer))

Added by Pawel Sadowski over 7 years ago. Updated about 6 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
David Zafman
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
kraken,jewel
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2016-12-02 08:50:07.309361 7f10cd434700  0 -- 10.99.186.66:6800/169872 >> 10.99.41.129:0/439883285 pipe(0x4ffdc000 sd=9923 :6800 s=0 pgs=0 cs=0 l=1 c=0x24941340).accept replacing existing (lossy) channel (new one lossy=1)
2016-12-02 08:50:44.862346 7f115322d700 -1 log_channel(cluster) log [ERR] : 3.1568 has 1 objects unfound and apparently lost
2016-12-02 08:51:00.769950 7f115322d700 -1 log_channel(cluster) log [ERR] : 3.1568 has 1 objects unfound and apparently lost
2016-12-02 08:51:02.721326 7f108e3b4700  0 -- 10.99.186.66:6800/169872 >> 10.99.43.96:0/1289572096 pipe(0x2cd00000 sd=100 :6800 s=0 pgs=0 cs=0 l=1 c=0x4fb8b600).accept replacing existing (lossy) channel (new one lossy=1)
2016-12-02 08:51:03.289494 7f115322d700 -1 log_channel(cluster) log [ERR] : 3.1568 has 1 objects unfound and apparently lost
2016-12-02 08:51:06.628331 7f10c329a700  0 -- 10.99.186.66:6800/169872 >> 10.99.54.194:0/3692079157 pipe(0x5864f000 sd=5569 :6800 s=0 pgs=0 cs=0 l=1 c=0x1a96eb00).accept replacing existing (lossy) channel (new one lossy=1)
2016-12-02 08:51:17.397357 7f115322d700 -1 log_channel(cluster) log [ERR] : 3.1568 has 1 objects unfound and apparently lost
2016-12-02 08:51:22.214365 7f115322d700 -1 log_channel(cluster) log [ERR] : 3.1568 has 1 objects unfound and apparently lost
2016-12-02 08:51:26.824599 7f115322d700 -1 log_channel(cluster) log [ERR] : 3.1568 has 1 objects unfound and apparently lost
2016-12-02 08:51:38.211395 7f101fc8a700  0 -- 10.99.245.64:6800/169872 >> 10.99.243.116:6800/1154798 pipe(0x38aef000 sd=5409 :44998 s=2 pgs=2481 cs=21 l=0 c=0x30e0ba20).fault with nothing to send, going to standby
2016-12-02 08:51:51.859851 7f114ca20700  0 osd.194 1401127 do_command r=0 
2016-12-02 08:51:51.925621 7f114ca20700 -1 osd/ReplicatedPG.cc: In function 'eversion_t ReplicatedPG::pick_newest_available(const hobject_t&)' thread 7f114ca20700 time 2016-12-02 08:51:51.915215
osd/ReplicatedPG.cc: 8485: FAILED assert(is_backfill_targets(peer))

 ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xba8b8b]
 2: (ReplicatedPG::pick_newest_available(hobject_t const&)+0x4bf) [0x83bc0f]
 3: (ReplicatedPG::mark_all_unfound_lost(int)+0x101d) [0x85f45d]
 4: (ReplicatedPG::do_command(std::map<std::string, boost::variant<std::string, bool, long, double, std::vector<std::string, std::allocator<std::string> >, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_>, std::less<std::string>, std::allocator<std::pair<std::string const, boost::variant<std::string, bool, long, double, std::vector<std::string, std::allocator<std::string> >, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> > > >, std::ostream&, ceph::buffer::list&, ceph::buffer::list&)+0xc3c) [0x82c34c]
 5: (OSD::do_command(Connection*, unsigned long, std::vector<std::string, std::allocator<std::string> >&, ceph::buffer::list&)+0x2417) [0x69eed7]
 6: (OSD::CommandWQ::_process(OSD::Command*)+0x59) [0x6fead9]
 7: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e) [0xb995ee]
 8: (ThreadPool::WorkThread::entry()+0x10) [0xb9a690]
 9: (()+0x8182) [0x7f117f40d182]
 10: (clone()+0x6d) [0x7f117d97830d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

I just had this assert on Hammer, 0.94.6. It was reproducible on every mark_unfound_lost revert. Due to revert not working I tried to delete and some other OSD restarted. After removing second unfound another one OSD crashed and didn't start.
Complete timeline of events:

cluster grown by 50% of OSDs
some problems during rebalance
during troubleshooting all OSDs were restarted, two OSD died with FAILED assert(oi.version i->first)' #17916
few hours later another OSD crashed with #17916 symptoms -- killed by OOM killer
cluster almost finished rebalance except two unfound objects &lt;scope of this bug begins&gt;
tried ceph pg 3.1568 mark_unfound_lost revert -- OSD 194 is crashing, logs attached (from all three in this PG: 194,301,202)
tried ceph pg 3.1568 mark_unfound_lost delete -- OSD 413 crashed (FAILED assert(info.last_complete info.last_update)) and restarted - it was part of acting set
cluster recovered quickly
tried ceph pg 3.3e66 mark_unfound_lost delete -- OSD 206 crashed (FAILED assert(head_obc)) and didn't start again

mon-05-ee0664c2-3510-4d97-bd00-4706e316f2a3:~ # ceph health detail | grep unfound
HEALTH_WARN 2 pgs degraded; 2 pgs recovering; 2 pgs stuck degraded; 2 pgs stuck unclean; recovery 11/122963096 objects degraded (0.000%); recovery 9046/122963096 objects misplaced (0.007%); recovery 2/40986191 unfound (0.000%); noscrub,nodeep-scrub flag(s) set; mon.mon-06-ee0664c2-3510-4d97-bd00-4706e316f2a3 store is getting too big! 96841 MB >= 15360 MB; mon.mon-10-ee0664c2-3510-4d97-bd00-4706e316f2a3 store is getting too big! 50985 MB >= 15360 MB; mon.mon-05-ee0664c2-3510-4d97-bd00-4706e316f2a3 store is getting too big! 66341 MB >= 15360 MB
pg 3.3e66 is active+recovering+degraded+remapped, acting [206,146,371], 1 unfound
pg 3.1568 is active+recovering+degraded+remapped, acting [194,301,202], 1 unfound
recovery 2/40986191 unfound (0.000%)
Logs uploaded with ceph-post-file:

ceph-post-file: 43bdffdf-f531-4779-81da-dabe429bef16
ceph-post-file: e8a59ba9-0c02-437d-899d-ccd3b4edf316

Related issues 4 (0 open4 closed)

Has duplicate Ceph - Bug #18365: failed_push does not update missing setDuplicateSamuel Just12/29/2016

Actions
Has duplicate Ceph - Bug #16259: ceph osd processes crashes when doing a revert on unfound objectDuplicate06/13/2016

Actions
Copied to Ceph - Backport #18567: kraken: OSD crash with osd/ReplicatedPG.cc: 8485: FAILED assert(is_backfill_targets(peer))ResolvedSamuel JustActions
Copied to Ceph - Backport #18568: jewel: OSD crash with osd/ReplicatedPG.cc: 8485: FAILED assert(is_backfill_targets(peer))RejectedDavid ZafmanActions
Actions

Also available in: Atom PDF