Bug #10405
closed
osd crashes on osd/ReplicatedPG.cc: 9025: FAILED assert(is_backfill_targets(peer))
Added by YingHsun Kao over 9 years ago.
Updated over 7 years ago.
Description
version: firefly 0.80.5
When unfound object is reported with osd.118 crashes due to other error, after restart osd.118 and trying to execute
ceph pg 2.4f3 mark_unfound_lost revert
osd.118 crashes again with error osd/ReplicatedPG.cc: 9025: FAILED assert(is_backfill_targets(peer))
- zgrep FAIL ceph-osd.118.log*
ceph-osd.118.log:osd/ReplicatedPG.cc: 9025: FAILED assert(is_backfill_targets(peer))
ceph-osd.118.log:osd/ReplicatedPG.cc: 9025: FAILED assert(is_backfill_targets(peer))
ceph-osd.118.log:osd/ReplicatedPG.cc: 9025: FAILED assert(is_backfill_targets(peer))
ceph-osd.118.log:osd/ReplicatedPG.cc: 9025: FAILED assert(is_backfill_targets(peer))
ceph-osd.118.log-20141222.gz:osd/PGLog.cc: 512: FAILED assert(log.head >= olog.tail && olog.head >= log.tail)
ceph-osd.118.log-20141222.gz:osd/PGLog.cc: 512: FAILED assert(log.head >= olog.tail && olog.head >= log.tail)
ceph-osd.118.log-20141222.gz:osd/PG.h: 382: FAILED assert(i->second.need j->second.need)
ceph-osd.118.log-20141222.gz:osd/PG.h: 382: FAILED assert(i->second.need j->second.need)
ceph-osd.118.log-20141222.gz:osd/ReplicatedPG.cc: 9025: FAILED assert(is_backfill_targets(peer))
ceph-osd.118.log-20141222.gz:osd/ReplicatedPG.cc: 9025: FAILED assert(is_backfill_targets(peer))
Files
try to restart osd during mark_unfound_lost is running, the following message is displayed but still have unfound objects after recovery
$ ceph pg 2.4f3 mark_unfound_lost revert
2014-12-22 11:50:07.981414 7ff5d64f9700 0 -- 10.137.36.30:0/1052769 >> 10.137.36.35:6806/61523 pipe(0x7ff5cc034360 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7ff5cc0345d0).fault
pg has no unfound objects
- Category set to OSD
- Priority changed from Normal to Urgent
- Status changed from New to Can't reproduce
Firefly is EOL - if this is happening on hammer please reopen.
I just had this assert on Hammer, 0.94.6. It was reproducible on every mark_unfound_lost revert. Due to revert not working I tried to delete and some other OSD restarted. After removing second unfound another one OSD crashed and didn't start.
Complete timeline of events:
- cluster grown by 50% of OSDs
- some problems during rebalance
- during troubleshooting all OSDs were restarted, two OSD died with FAILED assert(oi.version i->first)' #17916
- few hours later another OSD crashed with #17916 symptoms -- killed by OOM killer
- cluster almost finished rebalance except two unfound objects <scope of this bug begins>
- tried ceph pg 3.1568 mark_unfound_lost revert -- OSD 194 is crashing, logs attached (from all three in this PG: 194,301,202)
- tried ceph pg 3.1568 mark_unfound_lost delete -- OSD 413 crashed (FAILED assert(info.last_complete info.last_update)) and restarted - it was part of acting set
- cluster recovered quickly
- tried ceph pg 3.3e66 mark_unfound_lost delete -- OSD 206 crashed (FAILED assert(head_obc)) and didn't start again
mon-05-ee0664c2-3510-4d97-bd00-4706e316f2a3:~ # ceph health detail | grep unfound
HEALTH_WARN 2 pgs degraded; 2 pgs recovering; 2 pgs stuck degraded; 2 pgs stuck unclean; recovery 11/122963096 objects degraded (0.000%); recovery 9046/122963096 objects misplaced (0.007%); recovery 2/40986191 unfound (0.000%); noscrub,nodeep-scrub flag(s) set; mon.mon-06-ee0664c2-3510-4d97-bd00-4706e316f2a3 store is getting too big! 96841 MB >= 15360 MB; mon.mon-10-ee0664c2-3510-4d97-bd00-4706e316f2a3 store is getting too big! 50985 MB >= 15360 MB; mon.mon-05-ee0664c2-3510-4d97-bd00-4706e316f2a3 store is getting too big! 66341 MB >= 15360 MB
pg 3.3e66 is active+recovering+degraded+remapped, acting [206,146,371], 1 unfound
pg 3.1568 is active+recovering+degraded+remapped, acting [194,301,202], 1 unfound
recovery 2/40986191 unfound (0.000%)
Logs uploaded with ceph-post-file:
- ceph-post-file: 43bdffdf-f531-4779-81da-dabe429bef16
- ceph-post-file: e8a59ba9-0c02-437d-899d-ccd3b4edf316
- Status changed from Can't reproduce to 12
- Release set to hammer
Yeah, this has all been rewritten in Jewel. I may be able to look into this at some point if I have time, but there's some other more urgent stuff at the moment.
- Status changed from 12 to Can't reproduce
Actually, neither of those crashes is related to this bug, please open a new one.
Also available in: Atom
PDF