Bug #40772: mon: pg size change delayed 1 minute because osdmap 35 delay - RADOS - Ceph

Bug #40772

Updated by David Zafman almost 5 years ago

 
 osd-recovery-prio.sh TEST_recovery_pool_priority fails intermittently due to a delay in recovery starting on a pg.    The test requires simultaneous recovery of 2 PGs. 

     ceph osd pool set $pool1 size 2 
     ceph osd pool set $pool2 size 2 

 Running the test alone doesn't necessarily reproduce the problem which I saw when running all standalone tests (although they are run sequentially). 
 # ../qa/run-standalone.sh "osd-recovery-prio.sh TEST_recovery_pool_priority" 


 No pg[2.0 messages for almost 1 minute.    Map 35 seems to be the one that has the size change.\ 

 <pre> change. 
 2019-07-12T00:29:49.072-0700 7fc35220f700 10 osd.0 pg_epoch: 34 pg[2.0( v 31'600 (0'0,31'600] local-lis/les=29/30 n=200 ec=15/15 lis/c 29/29 les/c/f 30/30/0 29/29/15) [0] r=0 lpr=29 crt=31'600 lcod 31'599 mlcod 31'599 active+clean] do_peering_event: epoch_sent: 34 epoch_requested: 34 NullEvt 
 2019-07-12T00:30:44.588-0700 7fc35220f700 10 osd.0 pg_epoch: 34 pg[2.0( v 31'600 (0'0,31'600] local-lis/les=29/30 n=200 ec=15/15 lis/c 29/29 les/c/f 30/30/0 29/29/15) [0] r=0 lpr=29 crt=31'600 lcod 31'599 mlcod 31'599 active+clean] handle_advance_map: 35 
 </pre> 

                 "message": "pg 2.0 is stuck undersized for 61.208213, current state active+recovering+undersized+degraded+remapped, last acting [0]" 
                 "message": "pg 2.0 is stuck undersized for 63.209241, current state active+recovering+undersized+degraded+remapped, last acting [0]"

Back

Project

General

Profile

Ceph » RADOS

Bug #40772