Project

General

Profile

Actions

Bug #18369

closed

osd_recovery_incomplete: failed assert not manager.is_recovered()

Added by Sage Weil over 7 years ago. Updated about 7 years ago.

Status:
Resolved
Priority:
Immediate
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
kraken,jewel
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):


Related issues 2 (0 open2 closed)

Copied to Ceph - Backport #18485: jewel: osd_recovery_incomplete: failed assert not manager.is_recovered()ResolvedAlexey SheplyakovActions
Copied to Ceph - Backport #18497: kraken: osd_recovery_incomplete: failed assert not manager.is_recovered()ResolvedAlexey SheplyakovActions
Actions #1

Updated by Sage Weil over 7 years ago

It looks like teh PGs are all active+remapped, as expected, but this satisfies the ceph_manager get_num_active_recovered() check, which looks like

        for pg in pgs:
            if (pg['state'].count('active') and
                    not pg['state'].count('recover') and
                    not pg['state'].count('backfill') and
                    not pg['state'].count('stale')):
                num += 1

I think we can simply drop is_recovered; is_clean is sufficient for this test.

There are tons of callers to wait_for_recovery(), though, which relies on this. I think they are fine, though...

Actions #2

Updated by Sage Weil over 7 years ago

No, they're not supposed to be active+remapped...

Actions #3

Updated by Sage Weil over 7 years ago

  • Status changed from New to Fix Under Review

Okay, the problem seems to just be that the PG went into a backfill state but didn't tell the mon. e.g., in run

/a/sage-2016-12-29_20:50:13-rados-wip-sage-testing---basic-smithi/675453

pg 0.f did

2016-12-30 00:49:35.607752 7fe5d2c70700 15 osd.2 pg_epoch: 14 pg[0.f( v 11'1324 (11'1300,11'1324] local-les=14 n=1324 ec=1 les/c/f 14/9/0 13/13/5) [0,1]/[2,3] r=0 lpr=13 pi=8-12/2 bft=0,1 crt=11'1324 lcod 11'1323 mlcod 0'0 active+remapped] publish_stats_to_osd 14: no change since 2016-12-30 00:49:35.607509
...
2016-12-30 00:49:35.612789 7fe5d3471700 10 osd.2 pg_epoch: 14 pg[0.f( v 11'1324 (11'1300,11'1324] local-les=14 n=1324 ec=1 les/c/f 14/9/0 13/13/5) [0,1]/[2,3] r=0 lpr=13 pi=8-12/2 bft=0,1 crt=11'1324 lcod 11'1323 mlcod 0'0 active+remapped+backfill_wait] queue_recovery -- queuing

but no stat updates that reflect the backfill_wait state bit.

https://github.com/ceph/ceph/pull/12727

Actions #4

Updated by Sage Weil over 7 years ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to kraken,jewel
Actions #5

Updated by Alexey Sheplyakov over 7 years ago

  • Copied to Backport #18485: jewel: osd_recovery_incomplete: failed assert not manager.is_recovered() added
Actions #6

Updated by Loïc Dachary over 7 years ago

  • Copied to Backport #18497: kraken: osd_recovery_incomplete: failed assert not manager.is_recovered() added
Actions #7

Updated by Nathan Cutler about 7 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF