Project

General

Profile

Actions

Bug #10405

closed

osd crashes on osd/ReplicatedPG.cc: 9025: FAILED assert(is_backfill_targets(peer))

Added by YingHsun Kao over 9 years ago. Updated over 7 years ago.

Status:
Can't reproduce
Priority:
Urgent
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

version: firefly 0.80.5

When unfound object is reported with osd.118 crashes due to other error, after restart osd.118 and trying to execute
ceph pg 2.4f3 mark_unfound_lost revert

osd.118 crashes again with error osd/ReplicatedPG.cc: 9025: FAILED assert(is_backfill_targets(peer))

  1. zgrep FAIL ceph-osd.118.log*
    ceph-osd.118.log:osd/ReplicatedPG.cc: 9025: FAILED assert(is_backfill_targets(peer))
    ceph-osd.118.log:osd/ReplicatedPG.cc: 9025: FAILED assert(is_backfill_targets(peer))
    ceph-osd.118.log:osd/ReplicatedPG.cc: 9025: FAILED assert(is_backfill_targets(peer))
    ceph-osd.118.log:osd/ReplicatedPG.cc: 9025: FAILED assert(is_backfill_targets(peer))
    ceph-osd.118.log-20141222.gz:osd/PGLog.cc: 512: FAILED assert(log.head >= olog.tail && olog.head >= log.tail)
    ceph-osd.118.log-20141222.gz:osd/PGLog.cc: 512: FAILED assert(log.head >= olog.tail && olog.head >= log.tail)
    ceph-osd.118.log-20141222.gz:osd/PG.h: 382: FAILED assert(i->second.need j->second.need)
    ceph-osd.118.log-20141222.gz:osd/PG.h: 382: FAILED assert(i->second.need j->second.need)
    ceph-osd.118.log-20141222.gz:osd/ReplicatedPG.cc: 9025: FAILED assert(is_backfill_targets(peer))
    ceph-osd.118.log-20141222.gz:osd/ReplicatedPG.cc: 9025: FAILED assert(is_backfill_targets(peer))

Files

ceph-osd.118.log-20141222.gz (792 KB) ceph-osd.118.log-20141222.gz YingHsun Kao, 12/21/2014 10:05 PM
Actions #1

Updated by YingHsun Kao over 9 years ago

try to restart osd during mark_unfound_lost is running, the following message is displayed but still have unfound objects after recovery

$ ceph pg 2.4f3 mark_unfound_lost revert
2014-12-22 11:50:07.981414 7ff5d64f9700 0 -- 10.137.36.30:0/1052769 >> 10.137.36.35:6806/61523 pipe(0x7ff5cc034360 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7ff5cc0345d0).fault
pg has no unfound objects

Actions #2

Updated by Sage Weil over 9 years ago

  • Category set to OSD
  • Priority changed from Normal to Urgent
Actions #3

Updated by Josh Durgin over 7 years ago

  • Status changed from New to Can't reproduce

Firefly is EOL - if this is happening on hammer please reopen.

Actions #4

Updated by Pawel Sadowski over 7 years ago

I just had this assert on Hammer, 0.94.6. It was reproducible on every mark_unfound_lost revert. Due to revert not working I tried to delete and some other OSD restarted. After removing second unfound another one OSD crashed and didn't start.

Complete timeline of events:
  • cluster grown by 50% of OSDs
  • some problems during rebalance
  • during troubleshooting all OSDs were restarted, two OSD died with FAILED assert(oi.version i->first)' #17916
  • few hours later another OSD crashed with #17916 symptoms -- killed by OOM killer
  • cluster almost finished rebalance except two unfound objects <scope of this bug begins>
  • tried ceph pg 3.1568 mark_unfound_lost revert -- OSD 194 is crashing, logs attached (from all three in this PG: 194,301,202)
  • tried ceph pg 3.1568 mark_unfound_lost delete -- OSD 413 crashed (FAILED assert(info.last_complete info.last_update)) and restarted - it was part of acting set
  • cluster recovered quickly
  • tried ceph pg 3.3e66 mark_unfound_lost delete -- OSD 206 crashed (FAILED assert(head_obc)) and didn't start again

mon-05-ee0664c2-3510-4d97-bd00-4706e316f2a3:~ # ceph health detail | grep unfound
HEALTH_WARN 2 pgs degraded; 2 pgs recovering; 2 pgs stuck degraded; 2 pgs stuck unclean; recovery 11/122963096 objects degraded (0.000%); recovery 9046/122963096 objects misplaced (0.007%); recovery 2/40986191 unfound (0.000%); noscrub,nodeep-scrub flag(s) set; mon.mon-06-ee0664c2-3510-4d97-bd00-4706e316f2a3 store is getting too big! 96841 MB >= 15360 MB; mon.mon-10-ee0664c2-3510-4d97-bd00-4706e316f2a3 store is getting too big! 50985 MB >= 15360 MB; mon.mon-05-ee0664c2-3510-4d97-bd00-4706e316f2a3 store is getting too big! 66341 MB >= 15360 MB
pg 3.3e66 is active+recovering+degraded+remapped, acting [206,146,371], 1 unfound
pg 3.1568 is active+recovering+degraded+remapped, acting [194,301,202], 1 unfound
recovery 2/40986191 unfound (0.000%)

Logs uploaded with ceph-post-file:
  • ceph-post-file: 43bdffdf-f531-4779-81da-dabe429bef16
  • ceph-post-file: e8a59ba9-0c02-437d-899d-ccd3b4edf316
Actions #5

Updated by Piotr Dalek over 7 years ago

  • Status changed from Can't reproduce to 12
  • Release set to hammer
Actions #6

Updated by Samuel Just over 7 years ago

Yeah, this has all been rewritten in Jewel. I may be able to look into this at some point if I have time, but there's some other more urgent stuff at the moment.

Actions #7

Updated by Samuel Just over 7 years ago

  • Status changed from 12 to Can't reproduce

Actually, neither of those crashes is related to this bug, please open a new one.

Actions #8

Updated by Pawel Sadowski over 7 years ago

Created as #18165

Actions

Also available in: Atom PDF