Feature #8141: Nice if we had a state for when a pg can't recover because all missing objects are unfound and we can't make progress - RADOS - Ceph

Actions

Copy link

Feature #8141

open

Nice if we had a state for when a pg can't recover because all missing objects are unfound and we can't make progress

Added by David Zafman about 10 years ago. Updated almost 5 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

Reviewed:

Affected Versions:

Component(RADOS):

Pull request ID:

Description

I put a pg into the following state by taking down 2 OSDs at just the right time after peering but before recovery completed. There are 30 missing objects that are also unfound. The pg is stuck in recovering even though it isn't actively doing anything. Maybe it could be marked degraded.


    cluster 01520da2-e482-45ec-b6c5-3778658674d1
     health HEALTH_WARN 24 pgs degraded; 1 pgs recovering; 25 pgs stuck unclean; recovery 152/770 objects degraded (19.740%); 30/359 unfound (8.357%)
     monmap e1: 3 mons at {a=127.0.0.1:6789/0,b=127.0.0.1:6790/0,c=127.0.0.1:6791/0}, election epoch 36, quorum 0,1,2 a,b,c
     mdsmap e44: 3/3/3 up {0=c=up:active,1=b=up:active,2=a=up:active}
     osdmap e78: 4 osds: 2 up, 2 in
      pgmap v690: 32 pgs, 4 pools, 1220 MB data, 359 objects
            29860 MB used, 8921 MB / 40829 MB avail
            152/770 objects degraded (19.740%); 30/359 unfound (8.357%)
                  24 active+degraded
                   7 active+clean
                   1 active+recovering

3.5     40      30      100     30      163577868       40      40      active+recovering       2014-04-17 14:49:56.903242      40'40   78:322  [1,0]   1       [1,0]   1       0'0     2014-04-17 14:13:16.740899      0'0     2014-04-17 14:13:16.740899

2014-04-17 14:50:06.377415 7f8ac5603700 10 osd.1 pg_epoch: 78 pg[3.5( v 40'40 lc 40'10 (0'0,40'40] local-les=78 n=40 ec=37 les/c 78/72 76/76/76) [1,0] r=0 lpr=78 pi=63-75/4 crt=40'40 mlcod 0'0 active+recovering m=30 u=30] start_recovery_ops missing_loc: {}
2014-04-17 14:50:06.377451 7f8ac5603700 10 osd.1 pg_epoch: 78 pg[3.5( v 40'40 lc 40'10 (0'0,40'40] local-les=78 n=40 ec=37 les/c 78/72 76/76/76) [1,0] r=0 lpr=78 pi=63-75/4 crt=40'40 mlcod 0'0 active+recovering m=30 u=30]  still have 30 unfound
2014-04-17 14:50:06.377476 7f8ac5603700 10 osd.1 78 do_recovery started 0/5 on pg[3.5( v 40'40 lc 40'10 (0'0,40'40] local-les=78 n=40 ec=37 les/c 78/72 76/76/76) [1,0] r=0 lpr=78 pi=63-75/4 crt=40'40 mlcod 0'0 active+recovering m=30 u=30]
2014-04-17 14:50:06.377521 7f8ac5603700 10 osd.1 pg_epoch: 78 pg[3.5( v 40'40 lc 40'10 (0'0,40'40] local-les=78 n=40 ec=37 les/c 78/72 76/76/76) [1,0] r=0 lpr=78 pi=63-75/4 crt=40'40 mlcod 0'0 active+recovering m=30 u=30] discover_all_missing 30 missing, 30 unfound
2014-04-17 14:50:06.377559 7f8ac5603700 20 osd.1 pg_epoch: 78 pg[3.5( v 40'40 lc 40'10 (0'0,40'40] local-les=78 n=40 ec=37 les/c 78/72 76/76/76) [1,0] r=0 lpr=78 pi=63-75/4 crt=40'40 mlcod 0'0 active+recovering m=30 u=30] discover_all_missing: osd.0: we already have pg_missing_t
2014-04-17 14:50:06.377584 7f8ac5603700 20 osd.1 pg_epoch: 78 pg[3.5( v 40'40 lc 40'10 (0'0,40'40] local-les=78 n=40 ec=37 les/c 78/72 76/76/76) [1,0] r=0 lpr=78 pi=63-75/4 crt=40'40 mlcod 0'0 active+recovering m=30 u=30] discover_all_missing skipping down osd.2
2014-04-17 14:50:06.377608 7f8ac5603700 20 osd.1 pg_epoch: 78 pg[3.5( v 40'40 lc 40'10 (0'0,40'40] local-les=78 n=40 ec=37 les/c 78/72 76/76/76) [1,0] r=0 lpr=78 pi=63-75/4 crt=40'40 mlcod 0'0 active+recovering m=30 u=30] discover_all_missing skipping down osd.3
2014-04-17 14:50:06.377632 7f8ac5603700 10 osd.1 78 do_recovery  no luck, giving up on this pg for now
2014-04-17 14:50:06.377636 7f8ac5603700 10 log is not dirty