Bug #18165
closedOSD crash with osd/ReplicatedPG.cc: 8485: FAILED assert(is_backfill_targets(peer))
0%
Description
2016-12-02 08:50:07.309361 7f10cd434700 0 -- 10.99.186.66:6800/169872 >> 10.99.41.129:0/439883285 pipe(0x4ffdc000 sd=9923 :6800 s=0 pgs=0 cs=0 l=1 c=0x24941340).accept replacing existing (lossy) channel (new one lossy=1) 2016-12-02 08:50:44.862346 7f115322d700 -1 log_channel(cluster) log [ERR] : 3.1568 has 1 objects unfound and apparently lost 2016-12-02 08:51:00.769950 7f115322d700 -1 log_channel(cluster) log [ERR] : 3.1568 has 1 objects unfound and apparently lost 2016-12-02 08:51:02.721326 7f108e3b4700 0 -- 10.99.186.66:6800/169872 >> 10.99.43.96:0/1289572096 pipe(0x2cd00000 sd=100 :6800 s=0 pgs=0 cs=0 l=1 c=0x4fb8b600).accept replacing existing (lossy) channel (new one lossy=1) 2016-12-02 08:51:03.289494 7f115322d700 -1 log_channel(cluster) log [ERR] : 3.1568 has 1 objects unfound and apparently lost 2016-12-02 08:51:06.628331 7f10c329a700 0 -- 10.99.186.66:6800/169872 >> 10.99.54.194:0/3692079157 pipe(0x5864f000 sd=5569 :6800 s=0 pgs=0 cs=0 l=1 c=0x1a96eb00).accept replacing existing (lossy) channel (new one lossy=1) 2016-12-02 08:51:17.397357 7f115322d700 -1 log_channel(cluster) log [ERR] : 3.1568 has 1 objects unfound and apparently lost 2016-12-02 08:51:22.214365 7f115322d700 -1 log_channel(cluster) log [ERR] : 3.1568 has 1 objects unfound and apparently lost 2016-12-02 08:51:26.824599 7f115322d700 -1 log_channel(cluster) log [ERR] : 3.1568 has 1 objects unfound and apparently lost 2016-12-02 08:51:38.211395 7f101fc8a700 0 -- 10.99.245.64:6800/169872 >> 10.99.243.116:6800/1154798 pipe(0x38aef000 sd=5409 :44998 s=2 pgs=2481 cs=21 l=0 c=0x30e0ba20).fault with nothing to send, going to standby 2016-12-02 08:51:51.859851 7f114ca20700 0 osd.194 1401127 do_command r=0 2016-12-02 08:51:51.925621 7f114ca20700 -1 osd/ReplicatedPG.cc: In function 'eversion_t ReplicatedPG::pick_newest_available(const hobject_t&)' thread 7f114ca20700 time 2016-12-02 08:51:51.915215 osd/ReplicatedPG.cc: 8485: FAILED assert(is_backfill_targets(peer)) ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xba8b8b] 2: (ReplicatedPG::pick_newest_available(hobject_t const&)+0x4bf) [0x83bc0f] 3: (ReplicatedPG::mark_all_unfound_lost(int)+0x101d) [0x85f45d] 4: (ReplicatedPG::do_command(std::map<std::string, boost::variant<std::string, bool, long, double, std::vector<std::string, std::allocator<std::string> >, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_>, std::less<std::string>, std::allocator<std::pair<std::string const, boost::variant<std::string, bool, long, double, std::vector<std::string, std::allocator<std::string> >, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> > > >, std::ostream&, ceph::buffer::list&, ceph::buffer::list&)+0xc3c) [0x82c34c] 5: (OSD::do_command(Connection*, unsigned long, std::vector<std::string, std::allocator<std::string> >&, ceph::buffer::list&)+0x2417) [0x69eed7] 6: (OSD::CommandWQ::_process(OSD::Command*)+0x59) [0x6fead9] 7: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e) [0xb995ee] 8: (ThreadPool::WorkThread::entry()+0x10) [0xb9a690] 9: (()+0x8182) [0x7f117f40d182] 10: (clone()+0x6d) [0x7f117d97830d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
I just had this assert on Hammer, 0.94.6. It was reproducible on every mark_unfound_lost revert. Due to revert not working I tried to delete and some other OSD restarted. After removing second unfound another one OSD crashed and didn't start.
Complete timeline of events:
cluster grown by 50% of OSDs
some problems during rebalance
during troubleshooting all OSDs were restarted, two OSD died with FAILED assert(oi.version i->first)' #17916
few hours later another OSD crashed with #17916 symptoms -- killed by OOM killer
cluster almost finished rebalance except two unfound objects <scope of this bug begins>
tried ceph pg 3.1568 mark_unfound_lost revert -- OSD 194 is crashing, logs attached (from all three in this PG: 194,301,202)
tried ceph pg 3.1568 mark_unfound_lost delete -- OSD 413 crashed (FAILED assert(info.last_complete info.last_update)) and restarted - it was part of acting set
cluster recovered quickly
tried ceph pg 3.3e66 mark_unfound_lost delete -- OSD 206 crashed (FAILED assert(head_obc)) and didn't start again
mon-05-ee0664c2-3510-4d97-bd00-4706e316f2a3:~ # ceph health detail | grep unfound
HEALTH_WARN 2 pgs degraded; 2 pgs recovering; 2 pgs stuck degraded; 2 pgs stuck unclean; recovery 11/122963096 objects degraded (0.000%); recovery 9046/122963096 objects misplaced (0.007%); recovery 2/40986191 unfound (0.000%); noscrub,nodeep-scrub flag(s) set; mon.mon-06-ee0664c2-3510-4d97-bd00-4706e316f2a3 store is getting too big! 96841 MB >= 15360 MB; mon.mon-10-ee0664c2-3510-4d97-bd00-4706e316f2a3 store is getting too big! 50985 MB >= 15360 MB; mon.mon-05-ee0664c2-3510-4d97-bd00-4706e316f2a3 store is getting too big! 66341 MB >= 15360 MB
pg 3.3e66 is active+recovering+degraded+remapped, acting [206,146,371], 1 unfound
pg 3.1568 is active+recovering+degraded+remapped, acting [194,301,202], 1 unfound
recovery 2/40986191 unfound (0.000%)
Logs uploaded with ceph-post-file:
ceph-post-file: 43bdffdf-f531-4779-81da-dabe429bef16
ceph-post-file: e8a59ba9-0c02-437d-899d-ccd3b4edf316