Project

General

Profile

Bug #18369

osd_recovery_incomplete: failed assert not manager.is_recovered()

Added by Sage Weil 8 months ago. Updated 4 months ago.

Status:
Resolved
Priority:
Immediate
Assignee:
-
Category:
-
Target version:
-
Start date:
12/30/2016
Due date:
% Done:

0%

Source:
Tags:
Backport:
kraken,jewel
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Release:
Needs Doc:
No


Related issues

Copied to Ceph - Backport #18485: jewel: osd_recovery_incomplete: failed assert not manager.is_recovered() Resolved
Copied to Ceph - Backport #18497: kraken: osd_recovery_incomplete: failed assert not manager.is_recovered() Resolved

History

#1 Updated by Sage Weil 8 months ago

It looks like teh PGs are all active+remapped, as expected, but this satisfies the ceph_manager get_num_active_recovered() check, which looks like

        for pg in pgs:
            if (pg['state'].count('active') and
                    not pg['state'].count('recover') and
                    not pg['state'].count('backfill') and
                    not pg['state'].count('stale')):
                num += 1

I think we can simply drop is_recovered; is_clean is sufficient for this test.

There are tons of callers to wait_for_recovery(), though, which relies on this. I think they are fine, though...

#2 Updated by Sage Weil 8 months ago

No, they're not supposed to be active+remapped...

#3 Updated by Sage Weil 8 months ago

  • Status changed from New to Need Review

Okay, the problem seems to just be that the PG went into a backfill state but didn't tell the mon. e.g., in run

/a/sage-2016-12-29_20:50:13-rados-wip-sage-testing---basic-smithi/675453

pg 0.f did

2016-12-30 00:49:35.607752 7fe5d2c70700 15 osd.2 pg_epoch: 14 pg[0.f( v 11'1324 (11'1300,11'1324] local-les=14 n=1324 ec=1 les/c/f 14/9/0 13/13/5) [0,1]/[2,3] r=0 lpr=13 pi=8-12/2 bft=0,1 crt=11'1324 lcod 11'1323 mlcod 0'0 active+remapped] publish_stats_to_osd 14: no change since 2016-12-30 00:49:35.607509
...
2016-12-30 00:49:35.612789 7fe5d3471700 10 osd.2 pg_epoch: 14 pg[0.f( v 11'1324 (11'1300,11'1324] local-les=14 n=1324 ec=1 les/c/f 14/9/0 13/13/5) [0,1]/[2,3] r=0 lpr=13 pi=8-12/2 bft=0,1 crt=11'1324 lcod 11'1323 mlcod 0'0 active+remapped+backfill_wait] queue_recovery -- queuing

but no stat updates that reflect the backfill_wait state bit.

https://github.com/ceph/ceph/pull/12727

#4 Updated by Sage Weil 7 months ago

  • Status changed from Need Review to Pending Backport
  • Backport set to kraken,jewel

#5 Updated by Alexey Sheplyakov 7 months ago

  • Copied to Backport #18485: jewel: osd_recovery_incomplete: failed assert not manager.is_recovered() added

#6 Updated by Loic Dachary 7 months ago

  • Copied to Backport #18497: kraken: osd_recovery_incomplete: failed assert not manager.is_recovered() added

#7 Updated by Nathan Cutler 4 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF