Bug #37439
Degraded PG does not discover remapped data on originating OSD
0%
Description
There seems to be an issue that an OSD is not queried for missing objects that were remapped, but the OSD for this is up. This happened in two different scenarios for us. In both, data is stored in EC pools (8+3).
Scenario 0¶
To remove a broken disk (e.g. osd.22
), it is weighted to 0
with ceph osd out 22
. Objects are remapped normally. During object movement, osd.22
is restarted (or crashes and then starts again). Now the bug shows up: Objects will become degraded and stay degraded, because osd.22
is not queried. ceph pq query
shows:
"might_have_unfound": [ { "osd": "22(3)", "status": "not queried" } ],
A workaround is to
in
the broken-disk osd temporarily. The osd is then queried and missing objects are discovered. Then, out
the osd again. No objects are degraded any more and disk will be emptied.
Scenario 1¶
Add new disks to the cluster. Data is remapped to be transferred from the old disks (e.g. osd.19
) to new disks (e.g. > osd.42
).
When there is a restart an OSD of the old disks (or it restarts because of a crash), objects become degraded. The missing data is on the osd.19
but again it is not queried. ceph pg query
shows:
"might_have_unfound": [ { "osd": "19(6)", "status": "not queried" } ],
Only remapped data seems to be missing, if osd.19
is taken down, much more data is degraded. Mind that osd.19
is missing in the acting set in the current state of this PG:
"up": [38, 36, 28, 17, 13, 39, 48, 10, 29, 5, 47], "acting": [36, 15, 28, 17, 13, 32, 2147483647, 10, 29, 5, 20], "backfill_targets": [ "36(1)", "38(0)", "39(5)", "47(10)", "48(6)" ], "acting_recovery_backfill": [ "5(9)", "10(7)", "13(4)", "15(1)", "17(3)", "20(10)", "28(2)", "29(8)", "32(5)", "36(0)", "36(1)", "38(0)", "39(5)", "47(10)", "48(6)" ],
For this scenario, I have not found a workaround yet. The cluster remains degraded until it has recovered by restoring the data.
So, overall I suspect there is a bug which prevents remapped pg data to be discovered. The PG already knows which OSD is the correct candidate, but does not query it.
Related issues
History
#1 Updated by Jonas Jelten about 5 years ago
As I can't edit the post...
To clarify: With missing I mean the parts of the erasure coded object so the object becomes degraded.
#2 Updated by Greg Farnum about 5 years ago
- Priority changed from Normal to High
The first scenario definitely looks like an issue; perhaps we are improperly filtering for out rather than down during peering?
If I've read the second one right it looks like there's actual missing data on the down OSD, and if the OSD is down we obviously can't query it, so I think that's expected?
#3 Updated by Jonas Jelten about 5 years ago
In the second scenario, the cluster was completely healthy before new disks were added. My guess is that non-remapped PGs are found on that OSD, but remapped ones are not queried. The OSD is up, of course.
#4 Updated by Greg Farnum almost 5 years ago
See also the ceph-devel mailing list thread "Degraded PG does not discover remapped data on originating OSD".
#5 Updated by Neha Ojha almost 5 years ago
- Priority changed from High to Urgent
#6 Updated by Jonas Jelten almost 5 years ago
Easy steps to reproduce seem to be:
- Have a healthy cluster
ceph osd set pause # make sure no writes mess up the test
ceph osd set nobackfill
ceph osd set norecover # make sure the error is not recovered but instead stays
ceph tell 'osd.*' injectargs '--debug_osd=20/20' # turn up logging
ceph osd out $osdid # take out a random osd
- observe the state, now objects are degraded already, check pg query.
In my test, I observe that$osdid
was "already probed" but it does have the data, the cluster was completely healthy before. ceph osd down $osdid # repeer this osd
- observe the state again, even more objects are degraded now, check pg query.
In my test,$osdid
is now "not queried" ceph osd in 0 # everything turns back to normal and healthy
ceph tell 'osd.*' injectargs '--debug_osd=1/5' # silence logging again
In summary: while preventing recovery, an out osd produces degraded objects. An out and repeered OSD produces even more degraded objects. Taking it in again will discover all missing object copies.
#7 Updated by Jonas Jelten almost 5 years ago
please please let us edit issues and comments...
I made a mistake in the above post: please ignore the ceph osd set noup
line, it doesn't matter.
thanks, now I can edit the posts :)
#8 Updated by Jonas Jelten almost 5 years ago
- File ceph-osd.18.log.xz added
Tested on a 5-node cluster with 20 OSDs and 14 3-replica pools.
Here's the log file (level 20) of OSD 18, which is the new primary of PG 1.3a and 3.3.
The following output was taken after repeering OSD 0.
All OSDs are up again, OSD 0 is out (and also up).
OSD_STAT USED AVAIL TOTAL HB_PEERS PG_SUM PRIMARY_PG_SUM 19 1.3 GiB 8.7 GiB 10 GiB [0,1,2,4,5,6,7,9,10,11,12,15,18] 54 12 18 1.3 GiB 8.7 GiB 10 GiB [0,1,2,3,4,5,6,7,8,10,11,12,13,14,15,17,19] 52 21 17 1.2 GiB 8.8 GiB 10 GiB [0,1,2,3,4,5,7,8,9,10,11,12,13,14,15,16,18] 55 19 16 1.2 GiB 8.8 GiB 10 GiB [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,17] 56 16 15 1.2 GiB 8.8 GiB 10 GiB [1,3,4,5,6,8,9,14,16,19] 53 8 14 1.2 GiB 8.8 GiB 10 GiB [0,1,2,3,4,5,6,7,8,9,10,11,13,15,16,17,18,19] 53 23 13 1.3 GiB 8.7 GiB 10 GiB [0,1,2,3,4,5,6,8,9,10,11,12,14,16,17,18,19] 59 21 12 1.3 GiB 8.7 GiB 10 GiB [0,1,2,3,4,5,7,8,9,10,11,13,16,17,19] 58 17 11 1.3 GiB 8.7 GiB 10 GiB [2,3,5,6,7,10,12,13,14,15,16,18,19] 51 15 10 1.3 GiB 8.7 GiB 10 GiB [0,1,2,3,4,5,6,7,9,11,12,13,14,15,16,17,18,19] 52 20 3 1.2 GiB 8.7 GiB 10 GiB [0,2,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19] 59 20 2 1.3 GiB 8.7 GiB 10 GiB [0,1,3,5,6,7,8,9,11,12,13,14,17,18,19] 57 11 1 1.3 GiB 8.7 GiB 10 GiB [0,2,4,5,6,7,8,10,11,12,13,14,15,16,17,18,19] 54 23 0 1.3 GiB 8.7 GiB 10 GiB [1,2,3,4,5,6,7,8,9,19] 0 0 4 1.3 GiB 8.7 GiB 10 GiB [0,1,2,3,5,8,9,10,11,13,14,15,16,17,18,19] 56 17 5 1.2 GiB 8.8 GiB 10 GiB [0,1,2,3,4,6,8,9,10,11,12,13,14,15,16,17,18,19] 56 21 6 1.3 GiB 8.7 GiB 10 GiB [0,1,2,3,5,7,8,9,10,12,13,14,15,16,17,18,19] 51 18 7 1.2 GiB 8.8 GiB 10 GiB [0,1,2,3,6,8,10,11,12,13,14,15,16,17,18,19] 49 19 8 1.2 GiB 8.7 GiB 10 GiB [0,1,2,3,4,6,7,9,12,13,14,15,16,17,18,19] 53 22 9 1.2 GiB 8.8 GiB 10 GiB [1,2,3,4,5,6,7,8,10,12,13,14,15,16,17,18,19] 51 21 cluster: id: 562ea1a0-3f7f-42ec-876c-2fae6a90ea0e health: HEALTH_WARN pauserd,pausewr,nobackfill,norecover flag(s) set 29/3912 objects misplaced (0.741%) Degraded data redundancy: 239/3912 objects degraded (6.109%), 35 pgs degraded, 3 pgs undersized services: mon: 3 daemons, quorum vl-srv1,vl-srv2,vl-srv3 mgr: vl-srv1(active), standbys: vl-srv2, vl-srv3 mds: lolfs-1/1/1 up {0=vl-srv1=up:active}, 4 up:standby osd: 20 osds: 20 up, 19 in; 4 remapped pgs flags pauserd,pausewr,nobackfill,norecover rgw: 2 daemons active data: pools: 14 pools, 344 pgs objects: 1.30 k objects, 1.6 GiB usage: 25 GiB used, 175 GiB / 200 GiB avail pgs: 239/3912 objects degraded (6.109%) 29/3912 objects misplaced (0.741%) 307 active+clean 28 active+recovery_wait+degraded 4 active+recovering+degraded 3 active+undersized+degraded+remapped+backfill_wait 1 active+recovering 1 active+remapped+backfill_wait OSDMAP_FLAGS pauserd,pausewr,nobackfill,norecover flag(s) set OBJECT_MISPLACED 29/3912 objects misplaced (0.741%) PG_DEGRADED Degraded data redundancy: 239/3912 objects degraded (6.109%), 35 pgs degraded, 3 pgs undersized pg 1.2 is active+recovery_wait+degraded, acting [12,17,4] pg 1.11 is active+recovery_wait+degraded, acting [4,3,19] pg 1.17 is active+recovering+degraded, acting [13,10,19] pg 1.1f is active+recovery_wait+degraded, acting [19,2,10] pg 1.2f is active+recovery_wait+degraded, acting [14,4,10] pg 1.30 is active+recovery_wait+degraded, acting [12,11,3] pg 1.33 is active+recovery_wait+degraded, acting [10,17,15] pg 1.39 is active+recovery_wait+degraded, acting [5,11,12] pg 1.3a is active+recovery_wait+degraded, acting [18,11,13] pg 1.3b is active+recovery_wait+degraded, acting [3,13,18] pg 1.3c is active+recovery_wait+degraded, acting [19,9,4] pg 1.3d is active+recovery_wait+degraded, acting [2,12,7] pg 1.50 is active+recovery_wait+degraded, acting [8,13,16] pg 1.51 is active+recovery_wait+degraded, acting [4,1,13] pg 1.5a is active+recovery_wait+degraded, acting [12,4,19] pg 1.5d is active+recovery_wait+degraded, acting [12,2,10] pg 1.60 is active+recovery_wait+degraded, acting [16,14,10] pg 1.61 is active+recovery_wait+degraded, acting [17,11,7] pg 1.6b is active+recovery_wait+degraded, acting [6,15,17] pg 1.75 is active+recovery_wait+degraded, acting [10,12,4] pg 1.7e is active+recovery_wait+degraded, acting [14,8,17] pg 2.1 is active+recovery_wait+degraded, acting [14,9,2] pg 2.5 is active+recovery_wait+degraded, acting [8,4,16] pg 3.3 is active+recovering+degraded, acting [18,4,11] pg 3.6 is active+recovery_wait+degraded, acting [13,19,4] pg 5.3 is stuck undersized for 7118.187866, current state active+undersized+degraded+remapped+backfill_wait, last acting [7,17] pg 5.5 is stuck undersized for 7118.196629, current state active+undersized+degraded+remapped+backfill_wait, last acting [19,15] pg 8.1d is active+recovery_wait+degraded, acting [1,14,11] pg 8.2d is active+recovery_wait+degraded, acting [4,3,11] pg 9.1b is active+recovery_wait+degraded, acting [13,5,3] pg 10.4 is active+recovering+degraded, acting [8,15,18] pg 10.5 is active+recovery_wait+degraded, acting [13,5,9] pg 11.0 is active+recovery_wait+degraded, acting [1,4,8] pg 12.1 is stuck undersized for 7118.182018, current state active+undersized+degraded+remapped+backfill_wait, last acting [6,14] pg 14.6 is active+recovering+degraded, acting [12,17,3] ceph pg 3.3 query | jq -C .recovery_state | less { "name": "Started/Primary/Active", "enter_time": "2018-12-13 12:09:41.053312", "might_have_unfound": [ { "osd": "0", "status": "not queried" }, { "osd": "4", "status": "already probed" }, { "osd": "11", "status": "already probed" } ],
#9 Updated by Jonas Jelten over 4 years ago
More findings, now on Nautilus 14.2.0:
OSD.62 once was part of pg 6.65, but content on it got remapped. A restart of OSD.62 once again results in degraded data.
OSD.38 is the primary of 6.65, below is its log (level 10) when OSD.62 comes back online:
2019-04-01 02:38:54.467 7fb19fd7e700 7 osd.38 36469 handle_fast_pg_notify pg_notify((query:36469 sent:36469 6.65s8( v 36208'736652 (34791'733595,36208'736652] local-lis/les=36342/36343 n=198964 ec=26179/18934 lis/c 36342/35479 les/c/f 36343/35480/27050 36444/36445/36275) 8->0)=([35479,36444] intervals=([36255,36258] acting 7(3),12(1),17(6),25(4),38(0),39(5),41(7),51(10),62(8),63(9)),([36267,36273] acting 7(3),12(1),17(6),25(4),36(2),39(5),41(7),51(10),62(8),63(9)),([36278,36281] acting 7(3),12(1),17(6),25(4),36(2),38(0),41(7),51(10),62(8),63(9)),([36285,36291] acting 7(3),12(1),17(6),25(4),36(2),38(0),39(5),51(10),62(8),63(9)),([36340,36341] acting 7(3),12(1),17(6),25(4),36(2),38(0),39(5),41(7),62(8),63(9)),([36441,36443] acting 7(3),12(1),17(6),25(4),36(2),38(0),39(5),41(7),51(10))) epoch 36469) v6 from osd.62 2019-04-01 02:38:54.467 7fb184acb700 10 osd.38 pg_epoch: 36469 pg[6.65s0( v 36208'736652 (34791'733595,36208'736652] local-lis/les=36445/36446 n=198964 ec=26179/18934 lis/c 36445/35479 les/c/f 36446/35480/27050 36444/36445/36275) [38,12,36,7,25,39,17,41,21,63,51]/[38,12,36,7,25,39,17,41,2147483647,63,51]p38(0) backfill=[21(8)] r=0 lpr=36445 pi=[35479,36445)/6 rops=1 crt=36208'736652 lcod 0'0 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={0={},1={},2={},3={},4={},5={},6={},7={},8={},9={},10={}}] do_peering_event: epoch_sent: 36469 epoch_requested: 36469 MNotifyRec 6.65s0 from 62(8) notify: (query:36469 sent:36469 6.65s8( v 36208'736652 (34791'733595,36208'736652] local-lis/les=36342/36343 n=198964 ec=26179/18934 lis/c 36342/35479 les/c/f 36343/35480/27050 36444/36445/36275) 8->0) features: 0x3ffddff8ffacffff ([35479,36444] intervals=([36255,36258] acting 7(3),12(1),17(6),25(4),38(0),39(5),41(7),51(10),62(8),63(9)),([36267,36273] acting 7(3),12(1),17(6),25(4),36(2),39(5),41(7),51(10),62(8),63(9)),([36278,36281] acting 7(3),12(1),17(6),25(4),36(2),38(0),41(7),51(10),62(8),63(9)),([36285,36291] acting 7(3),12(1),17(6),25(4),36(2),38(0),39(5),51(10),62(8),63(9)),([36340,36341] acting 7(3),12(1),17(6),25(4),36(2),38(0),39(5),41(7),62(8),63(9)),([36441,36443] acting 7(3),12(1),17(6),25(4),36(2),38(0),39(5),41(7),51(10))) +create_info 2019-04-01 02:38:54.467 7fb184acb700 10 osd.38 pg_epoch: 36469 pg[6.65s0( v 36208'736652 (34791'733595,36208'736652] local-lis/les=36445/36446 n=198964 ec=26179/18934 lis/c 36445/35479 les/c/f 36446/35480/27050 36444/36445/36275) [38,12,36,7,25,39,17,41,21,63,51]/[38,12,36,7,25,39,17,41,2147483647,63,51]p38(0) backfill=[21(8)] r=0 lpr=36445 pi=[35479,36445)/6 rops=1 crt=36208'736652 lcod 0'0 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={0={},1={},2={},3={},4={},5={},6={},7={},8={},9={},10={}}] state<Started/Primary/Active>: Active: got notify from 62(8), calling proc_replica_info and discover_all_missing 2019-04-01 02:38:54.467 7fb184acb700 10 osd.38 pg_epoch: 36469 pg[6.65s0( v 36208'736652 (34791'733595,36208'736652] local-lis/les=36445/36446 n=198964 ec=26179/18934 lis/c 36445/35479 les/c/f 36446/35480/27050 36444/36445/36275) [38,12,36,7,25,39,17,41,21,63,51]/[38,12,36,7,25,39,17,41,2147483647,63,51]p38(0) backfill=[21(8)] r=0 lpr=36445 pi=[35479,36445)/6 rops=1 crt=36208'736652 lcod 0'0 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={0={},1={},2={},3={},4={},5={},6={},7={},8={},9={},10={}}] got osd.62(8) 6.65s8( v 36208'736652 (34791'733595,36208'736652] local-lis/les=36342/36343 n=198964 ec=26179/18934 lis/c 36342/35479 les/c/f 36343/35480/27050 36444/36445/36275) 2019-04-01 02:38:54.467 7fb184acb700 10 osd.38 pg_epoch: 36469 pg[6.65s0( v 36208'736652 (34791'733595,36208'736652] local-lis/les=36445/36446 n=198964 ec=26179/18934 lis/c 36445/35479 les/c/f 36446/35480/27050 36444/36445/36275) [38,12,36,7,25,39,17,41,21,63,51]/[38,12,36,7,25,39,17,41,2147483647,63,51]p38(0) backfill=[21(8)] r=0 lpr=36445 pi=[35479,36445)/6 rops=1 crt=36208'736652 lcod 0'0 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={0={},1={},2={},3={},4={},5={},6={},7={},8={},9={},10={}}] reg_next_scrub pg 6.65s0 register next scrub, scrub time 2019-04-04 02:38:13.629967, must = 0 2019-04-01 02:38:54.467 7fb184acb700 10 osd.38 pg_epoch: 36469 pg[6.65s0( v 36208'736652 (34791'733595,36208'736652] local-lis/les=36445/36446 n=198964 ec=26179/18934 lis/c 36445/35479 les/c/f 36446/35480/27050 36444/36445/36275) [38,12,36,7,25,39,17,41,21,63,51]/[38,12,36,7,25,39,17,41,2147483647,63,51]p38(0) backfill=[21(8)] r=0 lpr=36445 pi=[35479,36445)/6 rops=1 crt=36208'736652 lcod 0'0 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={0={},1={},2={},3={},4={},5={},6={},7={},8={},9={},10={}}] osd.62(8) has stray content: 6.65s8( v 36208'736652 (34791'733595,36208'736652] local-lis/les=36342/36343 n=198964 ec=26179/18934 lis/c 36342/35479 les/c/f 36343/35480/27050 36444/36445/36275) 2019-04-01 02:38:54.467 7fb184acb700 10 osd.38 pg_epoch: 36469 pg[6.65s0( v 36208'736652 (34791'733595,36208'736652] local-lis/les=36445/36446 n=198964 ec=26179/18934 lis/c 36445/35479 les/c/f 36446/35480/27050 36444/36445/36275) [38,12,36,7,25,39,17,41,21,63,51]/[38,12,36,7,25,39,17,41,2147483647,63,51]p38(0) backfill=[21(8)] r=0 lpr=36445 pi=[35479,36445)/6 rops=1 crt=36208'736652 lcod 0'0 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={0={},1={},2={},3={},4={},5={},6={},7={},8={},9={},10={}}] update_heartbeat_peers 7,12,17,21,25,36,38,39,41,51,63 -> 7,12,17,21,25,36,38,39,41,51,62,63 2019-04-01 02:38:54.467 7fb184acb700 10 log is not dirty 2019-04-01 02:38:54.495 7fb19fd7e700 7 osd.38 36469 handle_fast_pg_notify pg_notify((query:36469 sent:36469 6.2cs5( v 36208'738637 (34791'735590,36208'738637] local-lis/les=36462/36463 n=199469 ec=18977/18934 lis/c 36466/36462 les/c/f 36467/36463/27050 36469/36469/36274) 5->0)=([36462,36468] intervals=([36466,36468] acting 15(2),19(7),38(0),39(1),41(10),44(9),46(8),54(3),55(6),69(4))) epoch 36469) v6 from osd.62 2019-04-01 02:38:54.495 7fb1832c8700 10 osd.38 pg_epoch: 36469 pg[6.2cs0( v 36208'738637 (34791'735590,36208'738637] local-lis/les=36466/36467 n=199469 ec=18977/18934 lis/c 36466/36462 les/c/f 36467/36463/27050 36469/36469/36274) [38,39,15,54,69,62,55,19,46,44,41]p38(0) r=0 lpr=36469 pi=[36462,36469)/1 crt=36208'738637 lcod 0'0 mlcod 0'0 peering mbc={}] do_peering_event: epoch_sent: 36469 epoch_requested: 36469 MNotifyRec 6.2cs0 from 62(5) notify: (query:36469 sent:36469 6.2cs5( v 36208'738637 (34791'735590,36208'738637] local-lis/les=36462/36463 n=199469 ec=18977/18934 lis/c 36466/36462 les/c/f 36467/36463/27050 36469/36469/36274) 5->0) features: 0x3ffddff8ffacffff ([36462,36468] intervals=([36466,36468] acting 15(2),19(7),38(0),39(1),41(10),44(9),46(8),54(3),55(6),69(4))) +create_info 2019-04-01 02:38:54.495 7fb1832c8700 7 osd.38 pg_epoch: 36469 pg[6.2cs0( v 36208'738637 (34791'735590,36208'738637] local-lis/les=36466/36467 n=199469 ec=18977/18934 lis/c 36466/36462 les/c/f 36467/36463/27050 36469/36469/36274) [38,39,15,54,69,62,55,19,46,44,41]p38(0) r=0 lpr=36469 pi=[36462,36469)/1 crt=36208'738637 lcod 0'0 mlcod 0'0 peering mbc={}] state<Started/Primary>: handle_pg_notify from osd.62(5) 2019-04-01 02:38:54.495 7fb1832c8700 10 osd.38 pg_epoch: 36469 pg[6.2cs0( v 36208'738637 (34791'735590,36208'738637] local-lis/les=36466/36467 n=199469 ec=18977/18934 lis/c 36466/36462 les/c/f 36467/36463/27050 36469/36469/36274) [38,39,15,54,69,62,55,19,46,44,41]p38(0) r=0 lpr=36469 pi=[36462,36469)/1 crt=36208'738637 lcod 0'0 mlcod 0'0 peering mbc={}] got dup osd.62(5) info 6.2cs5( v 36208'738637 (34791'735590,36208'738637] local-lis/les=36462/36463 n=199469 ec=18977/18934 lis/c 36466/36462 les/c/f 36467/36463/27050 36469/36469/36274), identical to ours 2019-04-01 02:38:54.495 7fb1832c8700 10 log is not dirty
{ "state": "active+undersized+degraded+remapped+backfilling", "snap_trimq": "[]", "snap_trimq_len": 0, "epoch": 36476, "up": [ 38, 12, 36, 7, 25, 39, 17, 41, 21, 63, 51 ], "acting": [ 38, 12, 36, 7, 25, 39, 17, 41, 2147483647, 63, 51 ], "backfill_targets": [ "21(8)" ], "acting_recovery_backfill": [ "7(3)", "12(1)", "17(6)", "21(8)", "25(4)", "36(2)", "38(0)", "39(5)", "41(7)", "51(10)", "63(9)" ], "info": { ... } "recovery_state": [ { "name": "Started/Primary/Active", "enter_time": "2019-04-01 01:47:03.518615", "might_have_unfound": [ { "osd": "62(8)", "status": "not queried" } ], "recovery_progress": { "backfill_targets": [ "21(8)" ], "waiting_on_backfill": [], ...
We see there that 62 announces itself to the primary (38) and 38 detects 62 got stray data.
The log says calling proc_replica_info and discover_all_missing
, meaning we are at PG::RecoveryState::Active::react
.
The next log line hints that pg->proc_replica_info
is called ( got osd.62(8) 6.65s8...
).
But then reg_next_scrub pg 6.65s0 register next scrub
follows, which means pg->discover_all_missing
was not called as it would have printed discover_all_missing ... missing ... unfound
.
This means that in this function pg->have_unfound()
did not return true, even though the PG needs recovery (and 62 has the missing data!). discover_all_missing
is somehow not called and thus 62 is not queried for its data, leading to this bug.
#10 Updated by Neha Ojha over 4 years ago
- Status changed from New to Fix Under Review
- Pull request ID set to 27288
#11 Updated by Neha Ojha over 4 years ago
Hi Jonas, thanks for creating a fix for this bug. Could you please upload the latest logs from nautilus, that you have have analyzed above.
#12 Updated by Jonas Jelten over 4 years ago
- File ceph-osd.38.log.xz added
- Affected Versions v14.2.0 added
- Affected Versions deleted (
v13.2.2)
My proposal to fix this bug is to call discover_all_missing
not only if there are missing objects, but also when the PG is degraded.
#13 Updated by Sage Weil over 4 years ago
- Status changed from Fix Under Review to Pending Backport
- Backport set to nautilus, mimic, luminous
#14 Updated by Nathan Cutler over 4 years ago
- Copied to Backport #39431: luminous: Degraded PG does not discover remapped data on originating OSD added
#15 Updated by Nathan Cutler over 4 years ago
- Copied to Backport #39432: nautilus: Degraded PG does not discover remapped data on originating OSD added
#16 Updated by Nathan Cutler over 4 years ago
- Copied to Backport #39433: mimic: Degraded PG does not discover remapped data on originating OSD added
#17 Updated by Greg Farnum over 4 years ago
- Status changed from Pending Backport to Resolved
#18 Updated by Jonas Jelten over 3 years ago
- Related to Bug #46847: Loss of placement information on OSD reboot added