Bug #10405: osd crashes on osd/ReplicatedPG.cc: 9025: FAILED assert(is_backfill_targets(peer)) - Ceph - Ceph

Actions

Copy link

Bug #10405

closed

osd crashes on osd/ReplicatedPG.cc: 9025: FAILED assert(is_backfill_targets(peer))

Added by YingHsun Kao over 9 years ago. Updated over 7 years ago.

Status:

Can't reproduce

Priority:

Urgent

Assignee:

Category:

OSD

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

version: firefly 0.80.5

When unfound object is reported with osd.118 crashes due to other error, after restart osd.118 and trying to execute
ceph pg 2.4f3 mark_unfound_lost revert

osd.118 crashes again with error osd/ReplicatedPG.cc: 9025: FAILED assert(is_backfill_targets(peer))

zgrep FAIL ceph-osd.118.log*
ceph-osd.118.log:osd/ReplicatedPG.cc: 9025: FAILED assert(is_backfill_targets(peer))
ceph-osd.118.log:osd/ReplicatedPG.cc: 9025: FAILED assert(is_backfill_targets(peer))
ceph-osd.118.log:osd/ReplicatedPG.cc: 9025: FAILED assert(is_backfill_targets(peer))
ceph-osd.118.log:osd/ReplicatedPG.cc: 9025: FAILED assert(is_backfill_targets(peer))
ceph-osd.118.log-20141222.gz:osd/PGLog.cc: 512: FAILED assert(log.head >= olog.tail && olog.head >= log.tail)
ceph-osd.118.log-20141222.gz:osd/PGLog.cc: 512: FAILED assert(log.head >= olog.tail && olog.head >= log.tail)
ceph-osd.118.log-20141222.gz:osd/PG.h: 382: FAILED assert(i->second.need j->second.need)
ceph-osd.118.log-20141222.gz:osd/PG.h: 382: FAILED assert(i->second.need j->second.need)
ceph-osd.118.log-20141222.gz:osd/ReplicatedPG.cc: 9025: FAILED assert(is_backfill_targets(peer))
ceph-osd.118.log-20141222.gz:osd/ReplicatedPG.cc: 9025: FAILED assert(is_backfill_targets(peer))

Files

ceph-osd.118.log-20141222.gz (792 KB) ceph-osd.118.log-20141222.gz

YingHsun Kao, 12/21/2014 10:05 PM

Actions

Copy link

Updated by YingHsun Kao over 9 years ago

try to restart osd during mark_unfound_lost is running, the following message is displayed but still have unfound objects after recovery

$ ceph pg 2.4f3 mark_unfound_lost revert
2014-12-22 11:50:07.981414 7ff5d64f9700 0 -- 10.137.36.30:0/1052769 >> 10.137.36.35:6806/61523 pipe(0x7ff5cc034360 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7ff5cc0345d0).fault
pg has no unfound objects

Actions

Copy link

Updated by Sage Weil over 9 years ago

Category set to OSD
Priority changed from Normal to Urgent

Actions

Copy link

Updated by Josh Durgin over 7 years ago

Status changed from New to Can't reproduce

Firefly is EOL - if this is happening on hammer please reopen.

Actions

Copy link

Updated by Pawel Sadowski over 7 years ago

I just had this assert on Hammer, 0.94.6. It was reproducible on every mark_unfound_lost revert. Due to revert not working I tried to delete and some other OSD restarted. After removing second unfound another one OSD crashed and didn't start.

Complete timeline of events:

cluster grown by 50% of OSDs
some problems during rebalance
during troubleshooting all OSDs were restarted, two OSD died with FAILED assert(oi.version i->first)' #17916
few hours later another OSD crashed with #17916 symptoms -- killed by OOM killer
cluster almost finished rebalance except two unfound objects <scope of this bug begins>
tried ceph pg 3.1568 mark_unfound_lost revert -- OSD 194 is crashing, logs attached (from all three in this PG: 194,301,202)
tried ceph pg 3.1568 mark_unfound_lost delete -- OSD 413 crashed (FAILED assert(info.last_complete info.last_update)) and restarted - it was part of acting set
cluster recovered quickly
tried ceph pg 3.3e66 mark_unfound_lost delete -- OSD 206 crashed (FAILED assert(head_obc)) and didn't start again

mon-05-ee0664c2-3510-4d97-bd00-4706e316f2a3:~ # ceph health detail | grep unfound
HEALTH_WARN 2 pgs degraded; 2 pgs recovering; 2 pgs stuck degraded; 2 pgs stuck unclean; recovery 11/122963096 objects degraded (0.000%); recovery 9046/122963096 objects misplaced (0.007%); recovery 2/40986191 unfound (0.000%); noscrub,nodeep-scrub flag(s) set; mon.mon-06-ee0664c2-3510-4d97-bd00-4706e316f2a3 store is getting too big! 96841 MB >= 15360 MB; mon.mon-10-ee0664c2-3510-4d97-bd00-4706e316f2a3 store is getting too big! 50985 MB >= 15360 MB; mon.mon-05-ee0664c2-3510-4d97-bd00-4706e316f2a3 store is getting too big! 66341 MB >= 15360 MB
pg 3.3e66 is active+recovering+degraded+remapped, acting [206,146,371], 1 unfound
pg 3.1568 is active+recovering+degraded+remapped, acting [194,301,202], 1 unfound
recovery 2/40986191 unfound (0.000%)

Logs uploaded with ceph-post-file:

ceph-post-file: 43bdffdf-f531-4779-81da-dabe429bef16
ceph-post-file: e8a59ba9-0c02-437d-899d-ccd3b4edf316

Actions

Copy link

Updated by Piotr Dalek over 7 years ago

Status changed from Can't reproduce to 12
Release set to hammer

Actions

Copy link

Updated by Samuel Just over 7 years ago

Yeah, this has all been rewritten in Jewel. I may be able to look into this at some point if I have time, but there's some other more urgent stuff at the moment.

Actions

Copy link