Bug #51847
openA PG in "incomplete" state may end up in a backfill loop.
0%
Description
This was one of the issued observed in:
/a/nojha-2021-07-22_18:00:50-rados-wip-sseshasa-testing-2021-07-14-1320-distro-basic-smithi/6287310
One of the rados API tests did not complete and the test failed due to a timeout in waiting for recovery to complete.
2021-07-22T19:41:18.285 INFO:tasks.ceph.ceph_manager.ceph:PG 7.2 is not active+clean 2021-07-22T19:41:18.285 INFO:tasks.ceph.ceph_manager.ceph:{'pgid': '7.2', 'version': "0'0", 'reported_seq': 2181, 'reported_epoch': 2405, 'state': 'active+clean+scrubbing', 'last_fresh': '2021-07-22T19:41:15.062512+0000', 'last_change': '2021-07-22T19:41:15.062512+0000', 'last_active': '2021-07-22T19:41:15.062512+0000', 'last_peered': '2021-07-22T19:41:15.062512+0000', 'last_clean': '2021-07-22T19:41:15.062512+0000', 'last_became_active': '2021-07-22T18:39:48.880762+0000', 'last_became_peered': '2021-07-22T18:39:48.880762+0000', 'last_unstale': '2021-07-22T19:41:15.062512+0000', 'last_undegraded': '2021-07-22T19:41:15.062512+0000', 'last_fullsized': '2021-07-22T19:41:15.062512+0000', 'mapping_epoch': 54, 'log_start': "0'0", 'ondisk_log_start': "0'0", 'created': 54, 'last_epoch_clean': 57, 'parent': '0.0', 'parent_split_bits': 0, 'last_scrub': "0'0", 'last_scrub_stamp': '2021-07-22T18:39:46.618079+0000', 'last_deep_scrub': "0'0", 'last_deep_scrub_stamp': '2021-07-22T18:39:46.618079+0000', 'last_clean_scrub_stamp': '2021-07-22T18:39:46.618079+0000', 'log_size': 0, 'ondisk_log_size': 0, 'stats_invalid': False, 'dirty_stats_invalid': False, 'omap_stats_invalid': False, 'hitset_stats_invalid': False, 'hitset_bytes_stats_invalid': False, 'pin_stats_invalid': False, 'manifest_stats_invalid': False, 'snaptrimq_len': 0, 'stat_sum': {'num_bytes': 0, 'num_objects': 0, 'num_object_clones': 0, 'num_object_copies': 0, 'num_objects_missing_on_primary': 0, 'num_objects_missing': 0, 'num_objects_degraded': 0, 'num_objects_misplaced': 0, 'num_objects_unfound': 0, 'num_objects_dirty': 0, 'num_whiteouts': 0, 'num_read': 0, 'num_read_kb': 0, 'num_write': 0, 'num_write_kb': 0, 'num_scrub_errors': 0, 'num_shallow_scrub_errors': 0, 'num_deep_scrub_errors': 0, 'num_objects_recovered': 0, 'num_bytes_recovered': 0, 'num_keys_recovered': 0, 'num_objects_omap': 0, 'num_objects_hit_set_archive': 0, 'num_bytes_hit_set_archive': 0, 'num_flush': 0, 'num_flush_kb': 0, 'num_evict': 0, 'num_evict_kb': 0, 'num_promote': 0, 'num_flush_mode_high': 0, 'num_flush_mode_low': 0, 'num_evict_mode_some': 0, 'num_evict_mode_full': 0, 'num_objects_pinned': 0, 'num_legacy_snapsets': 0, 'num_large_omap_objects': 0, 'num_objects_manifest': 0, 'num_omap_bytes': 0, 'num_omap_keys': 0, 'num_objects_repaired': 0}, 'up': [4, 6, 7], 'acting': [4, 6, 7], 'avail_no_missing': [], 'object_location_counts': [], 'blocked_by': [], 'up_primary': 4, 'acting_primary': 4, 'purged_snaps': []} 2021-07-22T19:41:18.285 INFO:tasks.ceph.ceph_manager.ceph:PG 1.3 is not active+clean 2021-07-22T19:41:18.286 INFO:tasks.ceph.ceph_manager.ceph:{'pgid': '1.3', 'version': "20'117", 'reported_seq': 2467, 'reported_epoch': 2405, 'state': 'active+remapped+backfilling', 'last_fresh': '2021-07-22T19:41:12.951218+0000', 'last_change': '2021-07-22T18:56:16.734244+0000', 'last_active': '2021-07-22T19:41:12.951218+0000', 'last_peered': '2021-07-22T19:41:12.951218+0000', 'last_clean': '2021-07-22T18:40:15.901391+0000', 'last_became_active': '2021-07-22T18:40:21.271111+0000', 'last_became_peered': '2021-07-22T18:40:21.271111+0000', 'last_unstale': '2021-07-22T19:41:12.951218+0000', 'last_undegraded': '2021-07-22T19:41:12.951218+0000', 'last_fullsized': '2021-07-22T19:41:12.951218+0000', 'mapping_epoch': 85, 'log_start': "19'114", 'ondisk_log_start': "19'114", 'created': 30, 'last_epoch_clean': 66, 'parent': '0.0', 'parent_split_bits': 3, 'last_scrub': "20'117", 'last_scrub_stamp': '2021-07-22T18:39:44.980942+0000', 'last_deep_scrub': "0'0", 'last_deep_scrub_stamp': '2021-07-22T18:38:46.269805+0000', 'last_clean_scrub_stamp': '2021-07-22T18:39:44.980942+0000', 'log_size': 3, 'ondisk_log_size': 3, 'stats_invalid': False, 'dirty_stats_invalid': False, 'omap_stats_invalid': False, 'hitset_stats_invalid': False, 'hitset_bytes_stats_invalid': False, 'pin_stats_invalid': False, 'manifest_stats_invalid': False, 'snaptrimq_len': 0, 'stat_sum': {'num_bytes': 0, 'num_objects': 0, 'num_object_clones': 0, 'num_object_copies': 0, 'num_objects_missing_on_primary': 0, 'num_objects_missing': 0, 'num_objects_degraded': 0, 'num_objects_misplaced': 0, 'num_objects_unfound': 0, 'num_objects_dirty': 0, 'num_whiteouts': 0, 'num_read': 0, 'num_read_kb': 0, 'num_write': 0, 'num_write_kb': 0, 'num_scrub_errors': 0, 'num_shallow_scrub_errors': 0, 'num_deep_scrub_errors': 0, 'num_objects_recovered': 0, 'num_bytes_recovered': 0, 'num_keys_recovered': 0, 'num_objects_omap': 0, 'num_objects_hit_set_archive': 0, 'num_bytes_hit_set_archive': 0, 'num_flush': 0, 'num_flush_kb': 0, 'num_evict': 0, 'num_evict_kb': 0, 'num_promote': 0, 'num_flush_mode_high': 0, 'num_flush_mode_low': 0, 'num_evict_mode_some': 0, 'num_evict_mode_full': 0, 'num_objects_pinned': 0, 'num_legacy_snapsets': 0, 'num_large_omap_objects': 0, 'num_objects_manifest': 0, 'num_omap_bytes': 0, 'num_omap_keys': 0, 'num_objects_repaired': 0}, 'up': [4, 1, 2], 'acting': [4, 2, 7], 'avail_no_missing': ['4', '1', '2', '7'], 'object_location_counts': [], 'blocked_by': [], 'up_primary': 4, 'acting_primary': 4, 'purged_snaps': []} 2021-07-22T19:41:18.286 INFO:tasks.thrashosds.thrasher:Traceback (most recent call last): File "/home/teuthworker/src/git.ceph.com_ceph-c_f81345ca72228f2b22752a42ef3d4cfb048530bb/qa/tasks/ceph_manager.py", line 188, in wrapper return func(self) File "/home/teuthworker/src/git.ceph.com_ceph-c_f81345ca72228f2b22752a42ef3d4cfb048530bb/qa/tasks/ceph_manager.py", line 1404, in _do_thrash timeout=self.config.get('timeout') File "/home/teuthworker/src/git.ceph.com_ceph-c_f81345ca72228f2b22752a42ef3d4cfb048530bb/qa/tasks/ceph_manager.py", line 2804, in wait_for_recovery 'wait_for_recovery: failed before timeout expired' AssertionError: wait_for_recovery: failed before timeout expired
The logs below shows the sequence from when the concerned PG (pg1.3) goes into the "backfill_wait" state and then into "backfilling" on osd.4. osd.1 was the backfill target:
2021-07-22T18:56:16.432+0000 7f1f80833700 10 osd.4 pg_epoch: 866 pg[1.3( v 20'117 (19'114,20'117] local-lis/les=85/86 n=0 ec=30/14 lis/c=85/62 les/c/f=86/66/0 sis=85) [4,1,2]/[4,2,7] backfill=[1] r=0 lpr=85 pi=[62,85)/1 crt=20'117 lcod 0'0 mlcod 0'0 active+remapped+backfill_wait mbc={}] state<Started/Primary/Active/WaitRemoteBackfillReserved>: react num_bytes 0 o2021-07-22T18:56:16.432+0000 7f1f80833700 20 osd.4 pg_epoch: 866 pg[1.3( v 20'117 (19'114,20'117] local-lis/les=85/86 n=0 ec=30/14 lis/c=85/62 les/c/f=86/66/0 sis=85) [4,1,2]/[4,2,7] backfill=[1] r=0 lpr=85 pi=[62,85)/1 crt=20'117 lcod 0'0 mlcod 0'0 active+remapped+backfill_wait mbc={}] get_backfill_priority backfill priority is 110 o2021-07-22T18:56:16.492+0000 7f1f80833700 10 osd.4 pg_epoch: 866 pg[1.3( v 20'117 (19'114,20'117] local-lis/les=85/86 n=0 ec=30/14 lis/c=85/62 les/c/f=86/66/0 sis=85) [4,1,2]/[4,2,7] backfill=[1] r=0 lpr=85 pi=[62,85)/1 crt=20'117 lcod 0'0 mlcod 0'0 active+remapped+backfill_wait mbc={}] do_peering_event: epoch_sent: 866 epoch_requested: 866 DeferBackfill: delay 0 o2021-07-22T18:56:16.733+0000 7f1f80833700 10 osd.4 pg_epoch: 866 pg[1.3( v 20'117 (19'114,20'117] local-lis/les=85/86 n=0 ec=30/14 lis/c=85/62 les/c/f=86/66/0 sis=85) [4,1,2]/[4,2,7] backfill=[1] r=0 lpr=85 pi=[62,85)/1 crt=20'117 lcod 0'0 mlcod 0'0 active+remapped+backfill_wait mbc={}] do_peering_event: epoch_sent: 866 epoch_requested: 866 RemoteBackfillReserved o2021-07-22T18:56:16.733+0000 7f1f80833700 10 osd.4 pg_epoch: 866 pg[1.3( v 20'117 (19'114,20'117] local-lis/les=85/86 n=0 ec=30/14 lis/c=85/62 les/c/f=86/66/0 sis=85) [4,1,2]/[4,2,7] backfill=[1] r=0 lpr=85 pi=[62,85)/1 crt=20'117 lcod 0'0 mlcod 0'0 active+remapped+backfill_wait mbc={}] state<Started/Primary/Active/WaitRemoteBackfillReserved>: react num_bytes 0 o2021-07-22T18:56:16.733+0000 7f1f80833700 5 osd.4 pg_epoch: 866 pg[1.3( v 20'117 (19'114,20'117] local-lis/les=85/86 n=0 ec=30/14 lis/c=85/62 les/c/f=86/66/0 sis=85) [4,1,2]/[4,2,7] backfill=[1] r=0 lpr=85 pi=[62,85)/1 crt=20'117 lcod 0'0 mlcod 0'0 active+remapped+backfill_wait mbc={}] exit Started/Primary/Active/WaitRemoteBackfillReserved 0.300198 2 0.000083 o2021-07-22T18:56:16.733+0000 7f1f80833700 5 osd.4 pg_epoch: 866 pg[1.3( v 20'117 (19'114,20'117] local-lis/les=85/86 n=0 ec=30/14 lis/c=85/62 les/c/f=86/66/0 sis=85) [4,1,2]/[4,2,7] backfill=[1] r=0 lpr=85 pi=[62,85)/1 crt=20'117 lcod 0'0 mlcod 0'0 active+remapped+backfill_wait mbc={}] enter Started/Primary/Active/Backfilling o2021-07-22T18:56:16.733+0000 7f1f80833700 10 osd.4 pg_epoch: 866 pg[1.3( v 20'117 (19'114,20'117] local-lis/les=85/86 n=0 ec=30/14 lis/c=85/62 les/c/f=86/66/0 sis=85) [4,1,2]/[4,2,7] backfill=[1] r=0 lpr=85 pi=[62,85)/1 crt=20'117 lcod 0'0 mlcod 0'0 active+remapped+backfill_wait mbc={}] queue_recovery -- already queued o2021-07-22T18:56:16.733+0000 7f1f80833700 20 osd.4 pg_epoch: 866 pg[1.3( v 20'117 (19'114,20'117] local-lis/les=85/86 n=0 ec=30/14 lis/c=85/62 les/c/f=86/66/0 sis=85) [4,1,2]/[4,2,7] backfill=[1] r=0 lpr=85 pi=[62,85)/1 crt=20'117 lcod 0'0 mlcod 0'0 active+remapped+backfilling mbc={}] update_calc_stats actingset 2,4,7 upset 1,2,4 acting_recovery_backfill 1,2,4,7
On osd.1 which is the backfill target, the pg 1.3 is in incomplete state indefinitely after processing a delete
a few minutes before the backfill was initiated:
2021-07-22T18:39:21.844+0000 7f94a77af700 1 osd.1 pg_epoch: 36 pg[1.3( v 20'117 (19'114,20'117] local-lis/les=34/35 n=0 ec=30/14 lis/c=34/34 les/c/f=35/35/0 sis=36 pruub=14.941528320s) [5,2,7] r=-1 lpr=36 pi=[34,36)/1 luod=0'0 crt=20'117 lcod 0'0 mlcod 0'0 active pruub 54.482391357s@ mbc={}] start_peering_interval up [5,2,7] -> [5,2,7], acting [5,2,1] -> [5,2,7], acting_primary 5 -> 5, up_primary 5 -> 5, role 2 -> -1, features acting 4540138303579357183 upacting 4540138303579357183 2021-07-22T18:39:21.844+0000 7f94a77af700 10 osd.1 pg_epoch: 36 pg[1.3( v 20'117 (19'114,20'117] local-lis/les=34/35 n=0 ec=30/14 lis/c=34/34 les/c/f=35/35/0 sis=36 pruub=14.941528320s) [5,2,7] r=-1 lpr=36 pi=[34,36)/1 crt=20'117 lcod 0'0 mlcod 0'0 unknown pruub 54.482391357s@ mbc={}] clear_primary_state 2021-07-22T18:39:21.844+0000 7f94a77af700 20 osd.1 pg_epoch: 36 pg[1.3( v 20'117 (19'114,20'117] local-lis/les=34/35 n=0 ec=30/14 lis/c=34/34 les/c/f=35/35/0 sis=36 pruub=14.941528320s) [5,2,7] r=-1 lpr=36 pi=[34,36)/1 crt=20'117 lcod 0'0 mlcod 0'0 unknown pruub 54.482391357s@ mbc={}] clear_primary_state ... 2021-07-22T18:39:23.103+0000 7f94a77af700 5 osd.1 pg_epoch: 37 pg[1.3( v 20'117 (19'114,20'117] local-lis/les=34/35 n=0 ec=30/14 lis/c=34/34 les/c/f=35/35/0 sis=36 pruub=13.882313728s) [5,2,7] r=-1 lpr=36 pi=[34,36)/1 crt=20'117 lcod 0'0 mlcod 0'0 unknown NOTIFY pruub 54.482391357s@ mbc={}] exit Started/Stray 1.258081 6 0.000102 2021-07-22T18:39:23.103+0000 7f94a77af700 5 osd.1 pg_epoch: 37 pg[1.3( v 20'117 (19'114,20'117] local-lis/les=34/35 n=0 ec=30/14 lis/c=34/34 les/c/f=35/35/0 sis=36 pruub=13.882313728s) [5,2,7] r=-1 lpr=36 pi=[34,36)/1 crt=20'117 lcod 0'0 mlcod 0'0 unknown NOTIFY pruub 54.482391357s@ mbc={}] enter Started/ToDelete 2021-07-22T18:39:23.103+0000 7f94a77af700 5 osd.1 pg_epoch: 37 pg[1.3( v 20'117 (19'114,20'117] local-lis/les=34/35 n=0 ec=30/14 lis/c=34/34 les/c/f=35/35/0 sis=36 pruub=13.882313728s) [5,2,7] r=-1 lpr=36 pi=[34,36)/1 crt=20'117 lcod 0'0 mlcod 0'0 unknown NOTIFY pruub 54.482391357s@ mbc={}] enter Started/ToDelete/WaitDeleteReseved 2021-07-22T18:39:23.103+0000 7f94a77af700 10 osd.1 pg_epoch: 37 pg[1.3( v 20'117 (19'114,20'117] local-lis/les=34/35 n=0 ec=30/14 lis/c=34/34 les/c/f=35/35/0 sis=36 pruub=13.882313728s) [5,2,7] r=-1 lpr=36 pi=[34,36)/1 crt=20'117 lcod 0'0 mlcod 0'0 unknown NOTIFY pruub 54.482391357s@ mbc={}] do_peering_event: epoch_sent: 37 epoch_requested: 37 DeleteReserved 2021-07-22T18:39:23.103+0000 7f94a77af700 5 osd.1 pg_epoch: 37 pg[1.3( v 20'117 (19'114,20'117] local-lis/les=34/35 n=0 ec=30/14 lis/c=34/34 les/c/f=35/35/0 sis=36 pruub=13.882313728s) [5,2,7] r=-1 lpr=36 pi=[34,36)/1 crt=20'117 lcod 0'0 mlcod 0'0 unknown NOTIFY pruub 54.482391357s@ mbc={}] exit Started/ToDelete/WaitDeleteReseved 0.000070 1 0.000049 2021-07-22T18:39:23.103+0000 7f94a77af700 5 osd.1 pg_epoch: 37 pg[1.3( v 20'117 (19'114,20'117] local-lis/les=34/35 n=0 ec=30/14 lis/c=34/34 les/c/f=35/35/0 sis=36 pruub=13.882313728s) [5,2,7] r=-1 lpr=36 pi=[34,36)/1 crt=20'117 lcod 0'0 mlcod 0'0 unknown NOTIFY pruub 54.482391357s@ mbc={}] enter Started/ToDelete/Deleting ... 2021-07-22T18:39:23.104+0000 7f94a77af700 10 osd.1 pg_epoch: 37 pg[1.3( v 20'117 (19'114,20'117] lb MIN local-lis/les=34/35 n=0 ec=30/14 lis/c=34/34 les/c/f=35/35/0 sis=36 pruub=13.882313728s) [5,2,7] r=-1 lpr=36 DELETING pi=[34,36)/1 crt=20'117 lcod 0'0 mlcod 0'0 unknown NOTIFY pruub 54.482391357s@ mbc={}] do_peering_event: epoch_sent: 37 epoch_requested: 37 DeleteSome 2021-07-22T18:39:23.104+0000 7f94a77af700 10 osd.1 pg_epoch: 37 pg[1.3( v 20'117 (19'114,20'117] lb MIN local-lis/les=34/35 n=0 ec=30/14 lis/c=34/34 les/c/f=35/35/0 sis=36 pruub=13.882313728s) [5,2,7] r=-1 lpr=36 DELETING pi=[34,36)/1 crt=20'117 lcod 0'0 mlcod 0'0 unknown NOTIFY pruub 54.482391357s@ mbc={}] do_delete_work 2021-07-22T18:39:23.104+0000 7f94a77af700 20 osd.1 pg_epoch: 37 pg[1.3( v 20'117 (19'114,20'117] lb MIN local-lis/les=34/35 n=0 ec=30/14 lis/c=34/34 les/c/f=35/35/0 sis=36 pruub=13.882313728s) [5,2,7] r=-1 lpr=36 DELETING pi=[34,36)/1 crt=20'117 lcod 0'0 mlcod 0'0 unknown NOTIFY pruub 54.482391357s@ mbc={}] do_delete_work [#1:c0000000::::head#] 2021-07-22T18:39:23.104+0000 7f94a77af700 5 osd.1 pg_epoch: 37 pg[1.3( v 20'117 (19'114,20'117] lb MIN local-lis/les=34/35 n=0 ec=30/14 lis/c=34/34 les/c/f=35/35/0 sis=36 pruub=13.882313728s) [5,2,7] r=-1 lpr=36 DELETING pi=[34,36)/1 crt=20'117 lcod 0'0 mlcod 0'0 unknown NOTIFY pruub 54.482391357s@ mbc={}] exit Started/ToDelete/Deleting 0.001199 1 0.000238 2021-07-22T18:39:23.104+0000 7f94a77af700 5 osd.1 pg_epoch: 37 pg[1.3( v 20'117 (19'114,20'117] lb MIN local-lis/les=34/35 n=0 ec=30/14 lis/c=34/34 les/c/f=35/35/0 sis=36 pruub=13.882313728s) [5,2,7] r=-1 lpr=36 pi=[34,36)/1 crt=20'117 lcod 0'0 mlcod 0'0 unknown NOTIFY pruub 54.482391357s@ mbc={}] exit Started/ToDelete 0.001302 0 0.000000 2021-07-22T18:39:23.104+0000 7f94a77af700 5 osd.1 pg_epoch: 37 pg[1.3( v 20'117 (19'114,20'117] lb MIN local-lis/les=34/35 n=0 ec=30/14 lis/c=34/34 les/c/f=35/35/0 sis=36 pruub=13.882313728s) [5,2,7] r=-1 lpr=36 pi=[34,36)/1 crt=20'117 lcod 0'0 mlcod 0'0 unknown NOTIFY pruub 54.482391357s@ mbc={}] exit Started 1.259437 0 0.000000 2021-07-22T18:39:52.398+0000 7f94a77af700 5 osd.1 pg_epoch: 58 pg[1.3(unlocked)] enter Initial 2021-07-22T18:39:52.398+0000 7f94a77af700 20 osd.1 pg_epoch: 58 pg[1.3(unlocked)] enter NotTrimming 2021-07-22T18:39:52.398+0000 7f94a77af700 15 osd.1 pg_epoch: 58 pg[1.3(unlocked)] scrubberFSM pg(1.3) ScrubMachine created 1.3 2021-07-22T18:39:52.398+0000 7f94a77af700 10 osd.1 pg_epoch: 58 pg[1.3(unlocked)] scrubberFSM pg(1.3) -- state -->> NotActive 2021-07-22T18:39:52.398+0000 7f94a77af700 10 osd.1 pg_epoch: 58 pg[1.3( DNE empty local-lis/les=0/0 n=0 ec=0/0 lis/c=0/0 les/c/f=0/0/0 sis=0) [] r=-1 lpr=0 crt=0'0 mlcod 0'0 unknown mbc={}] init role -1 up [4,1,2] acting [4,2,7] history ec=30/14 lis/c=45/45 les/c/f=46/46/0 sis=58 past_intervals ([45,57] all_participants=1,2,4,7 intervals=([45,55] acting 2,4,7),([56,57] acting 1,2,4)) 2021-07-22T18:39:52.398+0000 7f94a77af700 20 osd.1 pg_epoch: 58 pg[1.3( empty local-lis/les=0/0 n=0 ec=30/14 lis/c=45/45 les/c/f=46/46/0 sis=58) [4,1,2]/[4,2,7] r=-1 lpr=0 pi=[45,58)/2 crt=0'0 mlcod 0'0 unknown mbc={}] on_new_interval 2021-07-22T18:39:52.398+0000 7f94a77af700 20 osd.1 pg_epoch: 58 pg[1.3( empty local-lis/les=0/0 n=0 ec=30/14 lis/c=45/45 les/c/f=46/46/0 sis=58) [4,1,2]/[4,2,7] r=-1 lpr=0 pi=[45,58)/2 crt=0'0 mlcod 0'0 unknown mbc={}] on_new_interval upacting_features 0x3f01cfbb7ffdffff from [4,2,7]+[4,1,2] 2021-07-22T18:39:52.398+0000 7f94a77af700 20 osd.1 pg_epoch: 58 pg[1.3( empty local-lis/les=0/0 n=0 ec=30/14 lis/c=45/45 les/c/f=46/46/0 sis=58) [4,1,2]/[4,2,7] r=-1 lpr=0 pi=[45,58)/2 crt=0'0 mlcod 0'0 unknown mbc={}] on_new_interval checking missing set deletes flag. missing = missing(0 may_include_deletes = 1) ... 2021-07-22T18:39:54.499+0000 7f94a37a7700 10 osd.1 pg_epoch: 61 pg[1.3( v 20'117 (19'114,20'117] lb MIN local-lis/les=58/59 n=0 ec=30/14 lis/c=45/45 les/c/f=46/46/0 sis=58) [4,1,2]/[4,2,7] r=-1 lpr=58 pi=[45,58)/2 luod=0'0 crt=20'117 lcod 0'0 mlcod 0'0 active mbc={}] do_peering_event: epoch_sent: 61 epoch_requested: 61 MInfoRec from 4 info: 1.3( v 20'117 (19'114,20'117] local-lis/les=58/59 n=0 ec=30/14 lis/c=58/45 les/c/f=59/46/0 sis=58) pg_lease(ru 85.778335571s ub 88.188583374s int 16.000000000s) 2021-07-22T18:39:54.499+0000 7f94a37a7700 20 osd.1 pg_epoch: 61 pg[1.3( v 20'117 (19'114,20'117] lb MIN local-lis/les=58/59 n=0 ec=30/14 lis/c=58/45 les/c/f=59/46/0 sis=58) [4,1,2]/[4,2,7] r=-1 lpr=58 pi=[45,58)/2 luod=0'0 crt=20'117 lcod 0'0 mlcod 0'0 active mbc={}] update_history advanced history from ec=30/14 lis/c=58/45 les/c/f=59/46/0 sis=58 2021-07-22T18:39:54.499+0000 7f94a37a7700 20 osd.1 pg_epoch: 61 pg[1.3( v 20'117 (19'114,20'117] lb MIN local-lis/les=58/59 n=0 ec=30/14 lis/c=58/45 les/c/f=59/46/0 sis=58) [4,1,2]/[4,2,7] r=-1 lpr=58 pi=[45,58)/2 luod=0'0 crt=20'117 lcod 0'0 mlcod 0'0 active mbc={}] on_info_history_change ... 2021-07-22T18:39:55.228+0000 7f94a77af700 10 osd.1 pg_epoch: 62 pg[1.3( v 20'117 (19'114,20'117] lb MIN local-lis/les=58/59 n=0 ec=30/14 lis/c=58/45 les/c/f=59/46/0 sis=58) [4,1,2]/[4,2,7] r=-1 lpr=58 pi=[45,58)/2 luod=0'0 crt=20'117 lcod 0'0 mlcod 0'0 active mbc={}] state<Started>: should_restart_peering, transitioning to Reset 2021-07-22T18:39:55.228+0000 7f94a77af700 5 osd.1 pg_epoch: 62 pg[1.3( v 20'117 (19'114,20'117] lb MIN local-lis/les=58/59 n=0 ec=30/14 lis/c=58/45 les/c/f=59/46/0 sis=58) [4,1,2]/[4,2,7] r=-1 lpr=58 pi=[45,58)/2 luod=0'0 crt=20'117 lcod 0'0 mlcod 0'0 active mbc={}] exit Started/ReplicaActive/RepNotRecovering 2.830019 9 0.000128 2021-07-22T18:39:55.228+0000 7f94a77af700 5 osd.1 pg_epoch: 62 pg[1.3( v 20'117 (19'114,20'117] lb MIN local-lis/les=58/59 n=0 ec=30/14 lis/c=58/45 les/c/f=59/46/0 sis=58) [4,1,2]/[4,2,7] r=-1 lpr=58 pi=[45,58)/2 luod=0'0 crt=20'117 lcod 0'0 mlcod 0'0 active mbc={}] exit Started/ReplicaActive 2.830031 0 0.000000 2021-07-22T18:39:55.228+0000 7f94a77af700 5 osd.1 pg_epoch: 62 pg[1.3( v 20'117 (19'114,20'117] lb MIN local-lis/les=58/59 n=0 ec=30/14 lis/c=58/45 les/c/f=59/46/0 sis=58) [4,1,2]/[4,2,7] r=-1 lpr=58 pi=[45,58)/2 luod=0'0 crt=20'117 lcod 0'0 mlcod 0'0 active mbc={}] exit Started 2.830226 0 0.000000 2021-07-22T18:39:55.228+0000 7f94a77af700 5 osd.1 pg_epoch: 62 pg[1.3( v 20'117 (19'114,20'117] lb MIN local-lis/les=58/59 n=0 ec=30/14 lis/c=58/45 les/c/f=59/46/0 sis=58) [4,1,2]/[4,2,7] r=-1 lpr=58 pi=[45,58)/2 luod=0'0 crt=20'117 lcod 0'0 mlcod 0'0 active mbc={}] enter Reset 2021-07-22T18:39:55.228+0000 7f94a77af700 20 osd.1 pg_epoch: 62 pg[1.3( v 20'117 (19'114,20'117] lb MIN local-lis/les=58/59 n=0 ec=30/14 lis/c=58/45 les/c/f=59/46/0 sis=58) [4,1,2]/[4,2,7] r=-1 lpr=58 pi=[45,58)/2 luod=0'0 crt=20'117 lcod 0'0 mlcod 0'0 active mbc={}] set_last_peering_reset 62
The PG was in incomplete state as indicated by "lb MIN" above until the test terminated. The backill initiated from osd.4 didn't have any missing or misplaced objects. The backfill process repeats in a loop and the PG never comes out of backfilling state.
The PG is from pool 1 which indicates a mgr pool. This issue is benign and very likely not seen on real world clusters..
The fix could probably add a check for num of objects before attempting a backfill.
No data to display