Project

General

Profile

Bug #37439

Degraded PG does not discover remapped data on originating OSD

Added by Jonas Jelten 8 months ago. Updated 3 months ago.

Status:
Pending Backport
Priority:
Urgent
Assignee:
-
Category:
Backfill/Recovery
Target version:
-
Start date:
11/28/2018
Due date:
% Done:

0%

Source:
Tags:
Backport:
nautilus, mimic, luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:

Description

There seems to be an issue that an OSD is not queried for missing objects that were remapped, but the OSD for this is up. This happened in two different scenarios for us. In both, data is stored in EC pools (8+3).

Scenario 0

To remove a broken disk (e.g. osd.22), it is weighted to 0 with ceph osd out 22. Objects are remapped normally. During object movement, osd.22 is restarted (or crashes and then starts again). Now the bug shows up: Objects will become degraded and stay degraded, because osd.22 is not queried. ceph pq query shows:

    "might_have_unfound": [
      {
        "osd": "22(3)",
        "status": "not queried" 
      }
    ],

A workaround is to in the broken-disk osd temporarily. The osd is then queried and missing objects are discovered. Then, out the osd again. No objects are degraded any more and disk will be emptied.

Scenario 1

Add new disks to the cluster. Data is remapped to be transferred from the old disks (e.g. osd.19) to new disks (e.g. > osd.42).
When there is a restart an OSD of the old disks (or it restarts because of a crash), objects become degraded. The missing data is on the osd.19 but again it is not queried. ceph pg query shows:

    "might_have_unfound": [
      {
        "osd": "19(6)",
        "status": "not queried" 
      }
    ],

Only remapped data seems to be missing, if osd.19 is taken down, much more data is degraded. Mind that osd.19 is missing in the acting set in the current state of this PG:

    "up": [38, 36, 28, 17, 13, 39, 48, 10, 29, 5, 47],
    "acting": [36, 15, 28, 17, 13, 32, 2147483647, 10, 29, 5, 20],
    "backfill_targets": [
        "36(1)",
        "38(0)",
        "39(5)",
        "47(10)",
        "48(6)" 
    ],
    "acting_recovery_backfill": [
        "5(9)",
        "10(7)",
        "13(4)",
        "15(1)",
        "17(3)",
        "20(10)",
        "28(2)",
        "29(8)",
        "32(5)",
        "36(0)",
        "36(1)",
        "38(0)",
        "39(5)",
        "47(10)",
        "48(6)" 
    ],

For this scenario, I have not found a workaround yet. The cluster remains degraded until it has recovered by restoring the data.

So, overall I suspect there is a bug which prevents remapped pg data to be discovered. The PG already knows which OSD is the correct candidate, but does not query it.

ceph-osd.18.log.xz - log level 20 of OSD 18 (27.4 KB) Jonas Jelten, 12/13/2018 12:59 PM

ceph-osd.38.log.xz - log of primary for pg 6.65 (314 KB) Jonas Jelten, 04/01/2019 10:55 PM


Related issues

Copied to RADOS - Backport #39431: luminous: Degraded PG does not discover remapped data on originating OSD In Progress
Copied to RADOS - Backport #39432: nautilus: Degraded PG does not discover remapped data on originating OSD Resolved
Copied to RADOS - Backport #39433: mimic: Degraded PG does not discover remapped data on originating OSD Resolved

History

#1 Updated by Jonas Jelten 8 months ago

As I can't edit the post...

To clarify: With missing I mean the parts of the erasure coded object so the object becomes degraded.

#2 Updated by Greg Farnum 8 months ago

  • Priority changed from Normal to High

The first scenario definitely looks like an issue; perhaps we are improperly filtering for out rather than down during peering?

If I've read the second one right it looks like there's actual missing data on the down OSD, and if the OSD is down we obviously can't query it, so I think that's expected?

#3 Updated by Jonas Jelten 8 months ago

In the second scenario, the cluster was completely healthy before new disks were added. My guess is that non-remapped PGs are found on that OSD, but remapped ones are not queried. The OSD is up, of course.

#4 Updated by Greg Farnum 8 months ago

See also the ceph-devel mailing list thread "Degraded PG does not discover remapped data on originating OSD".

#5 Updated by Neha Ojha 7 months ago

  • Priority changed from High to Urgent

#6 Updated by Jonas Jelten 7 months ago

Easy steps to reproduce seem to be:

  • Have a healthy cluster
  • ceph osd set pause # make sure no writes mess up the test
  • ceph osd set nobackfill
  • ceph osd set norecover # make sure the error is not recovered but instead stays
  • ceph tell 'osd.*' injectargs '--debug_osd=20/20' # turn up logging
  • ceph osd out $osdid # take out a random osd
  • observe the state, now objects are degraded already, check pg query.
    In my test, I observe that $osdid was "already probed" but it does have the data, the cluster was completely healthy before.
  • ceph osd down $osdid # repeer this osd
  • observe the state again, even more objects are degraded now, check pg query.
    In my test, $osdid is now "not queried"
  • ceph osd in 0 # everything turns back to normal and healthy
  • ceph tell 'osd.*' injectargs '--debug_osd=1/5' # silence logging again

In summary: while preventing recovery, an out osd produces degraded objects. An out and repeered OSD produces even more degraded objects. Taking it in again will discover all missing object copies.

#7 Updated by Jonas Jelten 7 months ago

please please let us edit issues and comments...

I made a mistake in the above post: please ignore the ceph osd set noup line, it doesn't matter.

thanks, now I can edit the posts :)

#8 Updated by Jonas Jelten 7 months ago

Tested on a 5-node cluster with 20 OSDs and 14 3-replica pools.

Here's the log file (level 20) of OSD 18, which is the new primary of PG 1.3a and 3.3.

The following output was taken after repeering OSD 0.
All OSDs are up again, OSD 0 is out (and also up).

OSD_STAT USED    AVAIL   TOTAL   HB_PEERS                                        PG_SUM PRIMARY_PG_SUM
19       1.3 GiB 8.7 GiB  10 GiB                [0,1,2,4,5,6,7,9,10,11,12,15,18]     54             12
18       1.3 GiB 8.7 GiB  10 GiB     [0,1,2,3,4,5,6,7,8,10,11,12,13,14,15,17,19]     52             21
17       1.2 GiB 8.8 GiB  10 GiB     [0,1,2,3,4,5,7,8,9,10,11,12,13,14,15,16,18]     55             19
16       1.2 GiB 8.8 GiB  10 GiB      [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,17]     56             16
15       1.2 GiB 8.8 GiB  10 GiB                        [1,3,4,5,6,8,9,14,16,19]     53              8
14       1.2 GiB 8.8 GiB  10 GiB   [0,1,2,3,4,5,6,7,8,9,10,11,13,15,16,17,18,19]     53             23
13       1.3 GiB 8.7 GiB  10 GiB     [0,1,2,3,4,5,6,8,9,10,11,12,14,16,17,18,19]     59             21
12       1.3 GiB 8.7 GiB  10 GiB           [0,1,2,3,4,5,7,8,9,10,11,13,16,17,19]     58             17
11       1.3 GiB 8.7 GiB  10 GiB             [2,3,5,6,7,10,12,13,14,15,16,18,19]     51             15
10       1.3 GiB 8.7 GiB  10 GiB  [0,1,2,3,4,5,6,7,9,11,12,13,14,15,16,17,18,19]     52             20
3        1.2 GiB 8.7 GiB  10 GiB [0,2,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19]     59             20
2        1.3 GiB 8.7 GiB  10 GiB          [0,1,3,5,6,7,8,9,11,12,13,14,17,18,19]     57             11
1        1.3 GiB 8.7 GiB  10 GiB   [0,2,4,5,6,7,8,10,11,12,13,14,15,16,17,18,19]     54             23
0        1.3 GiB 8.7 GiB  10 GiB                          [1,2,3,4,5,6,7,8,9,19]      0              0
4        1.3 GiB 8.7 GiB  10 GiB      [0,1,2,3,5,8,9,10,11,13,14,15,16,17,18,19]     56             17
5        1.2 GiB 8.8 GiB  10 GiB [0,1,2,3,4,6,8,9,10,11,12,13,14,15,16,17,18,19]     56             21
6        1.3 GiB 8.7 GiB  10 GiB    [0,1,2,3,5,7,8,9,10,12,13,14,15,16,17,18,19]     51             18
7        1.2 GiB 8.8 GiB  10 GiB     [0,1,2,3,6,8,10,11,12,13,14,15,16,17,18,19]     49             19
8        1.2 GiB 8.7 GiB  10 GiB       [0,1,2,3,4,6,7,9,12,13,14,15,16,17,18,19]     53             22
9        1.2 GiB 8.8 GiB  10 GiB    [1,2,3,4,5,6,7,8,10,12,13,14,15,16,17,18,19]     51             21

  cluster:
    id:     562ea1a0-3f7f-42ec-876c-2fae6a90ea0e
    health: HEALTH_WARN
            pauserd,pausewr,nobackfill,norecover flag(s) set
            29/3912 objects misplaced (0.741%)
            Degraded data redundancy: 239/3912 objects degraded (6.109%), 35 pgs degraded, 3 pgs undersized

  services:
    mon: 3 daemons, quorum vl-srv1,vl-srv2,vl-srv3
    mgr: vl-srv1(active), standbys: vl-srv2, vl-srv3
    mds: lolfs-1/1/1 up  {0=vl-srv1=up:active}, 4 up:standby
    osd: 20 osds: 20 up, 19 in; 4 remapped pgs
         flags pauserd,pausewr,nobackfill,norecover
    rgw: 2 daemons active

  data:
    pools:   14 pools, 344 pgs
    objects: 1.30 k objects, 1.6 GiB
    usage:   25 GiB used, 175 GiB / 200 GiB avail
    pgs:     239/3912 objects degraded (6.109%)
             29/3912 objects misplaced (0.741%)
             307 active+clean
             28  active+recovery_wait+degraded
             4   active+recovering+degraded
             3   active+undersized+degraded+remapped+backfill_wait
             1   active+recovering
             1   active+remapped+backfill_wait

OSDMAP_FLAGS pauserd,pausewr,nobackfill,norecover flag(s) set
OBJECT_MISPLACED 29/3912 objects misplaced (0.741%)
PG_DEGRADED Degraded data redundancy: 239/3912 objects degraded (6.109%), 35 pgs degraded, 3 pgs undersized
    pg 1.2 is active+recovery_wait+degraded, acting [12,17,4]
    pg 1.11 is active+recovery_wait+degraded, acting [4,3,19]
    pg 1.17 is active+recovering+degraded, acting [13,10,19]
    pg 1.1f is active+recovery_wait+degraded, acting [19,2,10]
    pg 1.2f is active+recovery_wait+degraded, acting [14,4,10]
    pg 1.30 is active+recovery_wait+degraded, acting [12,11,3]
    pg 1.33 is active+recovery_wait+degraded, acting [10,17,15]
    pg 1.39 is active+recovery_wait+degraded, acting [5,11,12]
    pg 1.3a is active+recovery_wait+degraded, acting [18,11,13]
    pg 1.3b is active+recovery_wait+degraded, acting [3,13,18]
    pg 1.3c is active+recovery_wait+degraded, acting [19,9,4]
    pg 1.3d is active+recovery_wait+degraded, acting [2,12,7]
    pg 1.50 is active+recovery_wait+degraded, acting [8,13,16]
    pg 1.51 is active+recovery_wait+degraded, acting [4,1,13]
    pg 1.5a is active+recovery_wait+degraded, acting [12,4,19]
    pg 1.5d is active+recovery_wait+degraded, acting [12,2,10]
    pg 1.60 is active+recovery_wait+degraded, acting [16,14,10]
    pg 1.61 is active+recovery_wait+degraded, acting [17,11,7]
    pg 1.6b is active+recovery_wait+degraded, acting [6,15,17]
    pg 1.75 is active+recovery_wait+degraded, acting [10,12,4]
    pg 1.7e is active+recovery_wait+degraded, acting [14,8,17]
    pg 2.1 is active+recovery_wait+degraded, acting [14,9,2]
    pg 2.5 is active+recovery_wait+degraded, acting [8,4,16]
    pg 3.3 is active+recovering+degraded, acting [18,4,11]
    pg 3.6 is active+recovery_wait+degraded, acting [13,19,4]
    pg 5.3 is stuck undersized for 7118.187866, current state active+undersized+degraded+remapped+backfill_wait, last acting [7,17]
    pg 5.5 is stuck undersized for 7118.196629, current state active+undersized+degraded+remapped+backfill_wait, last acting [19,15]
    pg 8.1d is active+recovery_wait+degraded, acting [1,14,11]
    pg 8.2d is active+recovery_wait+degraded, acting [4,3,11]
    pg 9.1b is active+recovery_wait+degraded, acting [13,5,3]
    pg 10.4 is active+recovering+degraded, acting [8,15,18]
    pg 10.5 is active+recovery_wait+degraded, acting [13,5,9]
    pg 11.0 is active+recovery_wait+degraded, acting [1,4,8]
    pg 12.1 is stuck undersized for 7118.182018, current state active+undersized+degraded+remapped+backfill_wait, last acting [6,14]
    pg 14.6 is active+recovering+degraded, acting [12,17,3]

ceph pg 3.3 query | jq -C .recovery_state | less
  {
    "name": "Started/Primary/Active",
    "enter_time": "2018-12-13 12:09:41.053312",
    "might_have_unfound": [
      {
        "osd": "0",
        "status": "not queried" 
      },
      {
        "osd": "4",
        "status": "already probed" 
      },
      {
        "osd": "11",
        "status": "already probed" 
      }
    ],

#9 Updated by Jonas Jelten 4 months ago

More findings, now on Nautilus 14.2.0:

OSD.62 once was part of pg 6.65, but content on it got remapped. A restart of OSD.62 once again results in degraded data.

OSD.38 is the primary of 6.65, below is its log (level 10) when OSD.62 comes back online:

2019-04-01 02:38:54.467 7fb19fd7e700  7 osd.38 36469 handle_fast_pg_notify pg_notify((query:36469 sent:36469 6.65s8( v 36208'736652 (34791'733595,36208'736652] local-lis/les=36342/36343 n=198964 ec=26179/18934 lis/c 36342/35479 les/c/f 36343/35480/27050 36444/36445/36275) 8->0)=([35479,36444] intervals=([36255,36258] acting 7(3),12(1),17(6),25(4),38(0),39(5),41(7),51(10),62(8),63(9)),([36267,36273] acting 7(3),12(1),17(6),25(4),36(2),39(5),41(7),51(10),62(8),63(9)),([36278,36281] acting 7(3),12(1),17(6),25(4),36(2),38(0),41(7),51(10),62(8),63(9)),([36285,36291] acting 7(3),12(1),17(6),25(4),36(2),38(0),39(5),51(10),62(8),63(9)),([36340,36341] acting 7(3),12(1),17(6),25(4),36(2),38(0),39(5),41(7),62(8),63(9)),([36441,36443] acting 7(3),12(1),17(6),25(4),36(2),38(0),39(5),41(7),51(10))) epoch 36469) v6 from osd.62
2019-04-01 02:38:54.467 7fb184acb700 10 osd.38 pg_epoch: 36469 pg[6.65s0( v 36208'736652 (34791'733595,36208'736652] local-lis/les=36445/36446 n=198964 ec=26179/18934 lis/c 36445/35479 les/c/f 36446/35480/27050 36444/36445/36275) [38,12,36,7,25,39,17,41,21,63,51]/[38,12,36,7,25,39,17,41,2147483647,63,51]p38(0) backfill=[21(8)] r=0 lpr=36445 pi=[35479,36445)/6 rops=1 crt=36208'736652 lcod 0'0 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={0={},1={},2={},3={},4={},5={},6={},7={},8={},9={},10={}}] do_peering_event: epoch_sent: 36469 epoch_requested: 36469 MNotifyRec 6.65s0 from 62(8) notify: (query:36469 sent:36469 6.65s8( v 36208'736652 (34791'733595,36208'736652] local-lis/les=36342/36343 n=198964 ec=26179/18934 lis/c 36342/35479 les/c/f 36343/35480/27050 36444/36445/36275) 8->0) features: 0x3ffddff8ffacffff ([35479,36444] intervals=([36255,36258] acting 7(3),12(1),17(6),25(4),38(0),39(5),41(7),51(10),62(8),63(9)),([36267,36273] acting 7(3),12(1),17(6),25(4),36(2),39(5),41(7),51(10),62(8),63(9)),([36278,36281] acting 7(3),12(1),17(6),25(4),36(2),38(0),41(7),51(10),62(8),63(9)),([36285,36291] acting 7(3),12(1),17(6),25(4),36(2),38(0),39(5),51(10),62(8),63(9)),([36340,36341] acting 7(3),12(1),17(6),25(4),36(2),38(0),39(5),41(7),62(8),63(9)),([36441,36443] acting 7(3),12(1),17(6),25(4),36(2),38(0),39(5),41(7),51(10))) +create_info
2019-04-01 02:38:54.467 7fb184acb700 10 osd.38 pg_epoch: 36469 pg[6.65s0( v 36208'736652 (34791'733595,36208'736652] local-lis/les=36445/36446 n=198964 ec=26179/18934 lis/c 36445/35479 les/c/f 36446/35480/27050 36444/36445/36275) [38,12,36,7,25,39,17,41,21,63,51]/[38,12,36,7,25,39,17,41,2147483647,63,51]p38(0) backfill=[21(8)] r=0 lpr=36445 pi=[35479,36445)/6 rops=1 crt=36208'736652 lcod 0'0 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={0={},1={},2={},3={},4={},5={},6={},7={},8={},9={},10={}}] state<Started/Primary/Active>: Active: got notify from 62(8), calling proc_replica_info and discover_all_missing
2019-04-01 02:38:54.467 7fb184acb700 10 osd.38 pg_epoch: 36469 pg[6.65s0( v 36208'736652 (34791'733595,36208'736652] local-lis/les=36445/36446 n=198964 ec=26179/18934 lis/c 36445/35479 les/c/f 36446/35480/27050 36444/36445/36275) [38,12,36,7,25,39,17,41,21,63,51]/[38,12,36,7,25,39,17,41,2147483647,63,51]p38(0) backfill=[21(8)] r=0 lpr=36445 pi=[35479,36445)/6 rops=1 crt=36208'736652 lcod 0'0 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={0={},1={},2={},3={},4={},5={},6={},7={},8={},9={},10={}}]  got osd.62(8) 6.65s8( v 36208'736652 (34791'733595,36208'736652] local-lis/les=36342/36343 n=198964 ec=26179/18934 lis/c 36342/35479 les/c/f 36343/35480/27050 36444/36445/36275)
2019-04-01 02:38:54.467 7fb184acb700 10 osd.38 pg_epoch: 36469 pg[6.65s0( v 36208'736652 (34791'733595,36208'736652] local-lis/les=36445/36446 n=198964 ec=26179/18934 lis/c 36445/35479 les/c/f 36446/35480/27050 36444/36445/36275) [38,12,36,7,25,39,17,41,21,63,51]/[38,12,36,7,25,39,17,41,2147483647,63,51]p38(0) backfill=[21(8)] r=0 lpr=36445 pi=[35479,36445)/6 rops=1 crt=36208'736652 lcod 0'0 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={0={},1={},2={},3={},4={},5={},6={},7={},8={},9={},10={}}] reg_next_scrub pg 6.65s0 register next scrub, scrub time 2019-04-04 02:38:13.629967, must = 0
2019-04-01 02:38:54.467 7fb184acb700 10 osd.38 pg_epoch: 36469 pg[6.65s0( v 36208'736652 (34791'733595,36208'736652] local-lis/les=36445/36446 n=198964 ec=26179/18934 lis/c 36445/35479 les/c/f 36446/35480/27050 36444/36445/36275) [38,12,36,7,25,39,17,41,21,63,51]/[38,12,36,7,25,39,17,41,2147483647,63,51]p38(0) backfill=[21(8)] r=0 lpr=36445 pi=[35479,36445)/6 rops=1 crt=36208'736652 lcod 0'0 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={0={},1={},2={},3={},4={},5={},6={},7={},8={},9={},10={}}]  osd.62(8) has stray content: 6.65s8( v 36208'736652 (34791'733595,36208'736652] local-lis/les=36342/36343 n=198964 ec=26179/18934 lis/c 36342/35479 les/c/f 36343/35480/27050 36444/36445/36275)
2019-04-01 02:38:54.467 7fb184acb700 10 osd.38 pg_epoch: 36469 pg[6.65s0( v 36208'736652 (34791'733595,36208'736652] local-lis/les=36445/36446 n=198964 ec=26179/18934 lis/c 36445/35479 les/c/f 36446/35480/27050 36444/36445/36275) [38,12,36,7,25,39,17,41,21,63,51]/[38,12,36,7,25,39,17,41,2147483647,63,51]p38(0) backfill=[21(8)] r=0 lpr=36445 pi=[35479,36445)/6 rops=1 crt=36208'736652 lcod 0'0 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={0={},1={},2={},3={},4={},5={},6={},7={},8={},9={},10={}}] update_heartbeat_peers 7,12,17,21,25,36,38,39,41,51,63 -> 7,12,17,21,25,36,38,39,41,51,62,63
2019-04-01 02:38:54.467 7fb184acb700 10 log is not dirty
2019-04-01 02:38:54.495 7fb19fd7e700  7 osd.38 36469 handle_fast_pg_notify pg_notify((query:36469 sent:36469 6.2cs5( v 36208'738637 (34791'735590,36208'738637] local-lis/les=36462/36463 n=199469 ec=18977/18934 lis/c 36466/36462 les/c/f 36467/36463/27050 36469/36469/36274) 5->0)=([36462,36468] intervals=([36466,36468] acting 15(2),19(7),38(0),39(1),41(10),44(9),46(8),54(3),55(6),69(4))) epoch 36469) v6 from osd.62
2019-04-01 02:38:54.495 7fb1832c8700 10 osd.38 pg_epoch: 36469 pg[6.2cs0( v 36208'738637 (34791'735590,36208'738637] local-lis/les=36466/36467 n=199469 ec=18977/18934 lis/c 36466/36462 les/c/f 36467/36463/27050 36469/36469/36274) [38,39,15,54,69,62,55,19,46,44,41]p38(0) r=0 lpr=36469 pi=[36462,36469)/1 crt=36208'738637 lcod 0'0 mlcod 0'0 peering mbc={}] do_peering_event: epoch_sent: 36469 epoch_requested: 36469 MNotifyRec 6.2cs0 from 62(5) notify: (query:36469 sent:36469 6.2cs5( v 36208'738637 (34791'735590,36208'738637] local-lis/les=36462/36463 n=199469 ec=18977/18934 lis/c 36466/36462 les/c/f 36467/36463/27050 36469/36469/36274) 5->0) features: 0x3ffddff8ffacffff ([36462,36468] intervals=([36466,36468] acting 15(2),19(7),38(0),39(1),41(10),44(9),46(8),54(3),55(6),69(4))) +create_info
2019-04-01 02:38:54.495 7fb1832c8700  7 osd.38 pg_epoch: 36469 pg[6.2cs0( v 36208'738637 (34791'735590,36208'738637] local-lis/les=36466/36467 n=199469 ec=18977/18934 lis/c 36466/36462 les/c/f 36467/36463/27050 36469/36469/36274) [38,39,15,54,69,62,55,19,46,44,41]p38(0) r=0 lpr=36469 pi=[36462,36469)/1 crt=36208'738637 lcod 0'0 mlcod 0'0 peering mbc={}] state<Started/Primary>: handle_pg_notify from osd.62(5)
2019-04-01 02:38:54.495 7fb1832c8700 10 osd.38 pg_epoch: 36469 pg[6.2cs0( v 36208'738637 (34791'735590,36208'738637] local-lis/les=36466/36467 n=199469 ec=18977/18934 lis/c 36466/36462 les/c/f 36467/36463/27050 36469/36469/36274) [38,39,15,54,69,62,55,19,46,44,41]p38(0) r=0 lpr=36469 pi=[36462,36469)/1 crt=36208'738637 lcod 0'0 mlcod 0'0 peering mbc={}]  got dup osd.62(5) info 6.2cs5( v 36208'738637 (34791'735590,36208'738637] local-lis/les=36462/36463 n=199469 ec=18977/18934 lis/c 36466/36462 les/c/f 36467/36463/27050 36469/36469/36274), identical to ours
2019-04-01 02:38:54.495 7fb1832c8700 10 log is not dirty
{
  "state": "active+undersized+degraded+remapped+backfilling",
  "snap_trimq": "[]",
  "snap_trimq_len": 0,
  "epoch": 36476,
  "up": [
    38, 12, 36, 7, 25, 39, 17, 41, 21, 63, 51
  ],
  "acting": [
    38, 12, 36, 7, 25, 39, 17, 41, 2147483647, 63, 51
  ],
  "backfill_targets": [
    "21(8)" 
  ],
  "acting_recovery_backfill": [
    "7(3)",
    "12(1)",
    "17(6)",
    "21(8)",
    "25(4)",
    "36(2)",
    "38(0)",
    "39(5)",
    "41(7)",
    "51(10)",
    "63(9)" 
  ],
  "info": {
...
  }
  "recovery_state": [
    {
      "name": "Started/Primary/Active",
      "enter_time": "2019-04-01 01:47:03.518615",
      "might_have_unfound": [
        {
          "osd": "62(8)",
          "status": "not queried" 
        }
      ],
      "recovery_progress": {
        "backfill_targets": [
          "21(8)" 
        ],
        "waiting_on_backfill": [],
...

We see there that 62 announces itself to the primary (38) and 38 detects 62 got stray data.
The log says calling proc_replica_info and discover_all_missing, meaning we are at PG::RecoveryState::Active::react.
The next log line hints that pg->proc_replica_info is called ( got osd.62(8) 6.65s8... ).
But then reg_next_scrub pg 6.65s0 register next scrub follows, which means pg->discover_all_missing was not called as it would have printed discover_all_missing ... missing ... unfound.

This means that in this function pg->have_unfound() did not return true, even though the PG needs recovery (and 62 has the missing data!). discover_all_missing is somehow not called and thus 62 is not queried for its data, leading to this bug.

#10 Updated by Neha Ojha 4 months ago

  • Status changed from New to Need Review
  • Pull request ID set to 27288

#11 Updated by Neha Ojha 4 months ago

Hi Jonas, thanks for creating a fix for this bug. Could you please upload the latest logs from nautilus, that you have have analyzed above.

#12 Updated by Jonas Jelten 4 months ago

  • File ceph-osd.38.log.xz added
  • Affected Versions v14.2.0 added
  • Affected Versions deleted (v13.2.2)

My proposal to fix this bug is to call discover_all_missing not only if there are missing objects, but also when the PG is degraded.

#13 Updated by Sage Weil 3 months ago

  • Status changed from Need Review to Pending Backport
  • Backport set to nautilus, mimic, luminous

#14 Updated by Nathan Cutler 3 months ago

  • Copied to Backport #39431: luminous: Degraded PG does not discover remapped data on originating OSD added

#15 Updated by Nathan Cutler 3 months ago

  • Copied to Backport #39432: nautilus: Degraded PG does not discover remapped data on originating OSD added

#16 Updated by Nathan Cutler 3 months ago

  • Copied to Backport #39433: mimic: Degraded PG does not discover remapped data on originating OSD added

Also available in: Atom PDF