Bug #18369: osd_recovery_incomplete: failed assert not manager.is_recovered() - Ceph - Ceph

Actions

Copy link

Bug #18369

closed

osd_recovery_incomplete: failed assert not manager.is_recovered()

Added by Sage Weil over 7 years ago. Updated about 7 years ago.

Status:

Resolved

Priority:

Immediate

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

kraken,jewel

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

http://pulpito.ceph.com/sage-2016-12-29_20:50:13-rados-wip-sage-testing---basic-smithi/675453

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by Sage Weil over 7 years ago

It looks like teh PGs are all active+remapped, as expected, but this satisfies the ceph_manager get_num_active_recovered() check, which looks like

        for pg in pgs:
            if (pg['state'].count('active') and
                    not pg['state'].count('recover') and
                    not pg['state'].count('backfill') and
                    not pg['state'].count('stale')):
                num += 1

I think we can simply drop is_recovered; is_clean is sufficient for this test.

There are tons of callers to wait_for_recovery(), though, which relies on this. I think they are fine, though...

Actions

Copy link

Updated by Sage Weil over 7 years ago

No, they're not supposed to be active+remapped...

Actions

Copy link

Updated by Sage Weil over 7 years ago

Status changed from New to Fix Under Review

Okay, the problem seems to just be that the PG went into a backfill state but didn't tell the mon. e.g., in run

/a/sage-2016-12-29_20:50:13-rados-wip-sage-testing---basic-smithi/675453

pg 0.f did

2016-12-30 00:49:35.607752 7fe5d2c70700 15 osd.2 pg_epoch: 14 pg[0.f( v 11'1324 (11'1300,11'1324] local-les=14 n=1324 ec=1 les/c/f 14/9/0 13/13/5) [0,1]/[2,3] r=0 lpr=13 pi=8-12/2 bft=0,1 crt=11'1324 lcod 11'1323 mlcod 0'0 active+remapped] publish_stats_to_osd 14: no change since 2016-12-30 00:49:35.607509
...
2016-12-30 00:49:35.612789 7fe5d3471700 10 osd.2 pg_epoch: 14 pg[0.f( v 11'1324 (11'1300,11'1324] local-les=14 n=1324 ec=1 les/c/f 14/9/0 13/13/5) [0,1]/[2,3] r=0 lpr=13 pi=8-12/2 bft=0,1 crt=11'1324 lcod 11'1323 mlcod 0'0 active+remapped+backfill_wait] queue_recovery -- queuing

but no stat updates that reflect the backfill_wait state bit.

https://github.com/ceph/ceph/pull/12727

Actions

Copy link