Bug #14952: New pools have bogus "stuck inactive/unclean" HEALTH_ERR messages until they are first active and clean - Ceph - Ceph

Bug #14952

 
 (In master, on a vstart cluster but presumably happens on real clusters too) 

 Right after creating some pools: 

 <pre> 
      health HEALTH_ERR 
             2 pgs are stuck inactive for more than 300 seconds 
             5 pgs degraded 
             2 pgs stuck inactive 
             5 pgs stuck unclean 
             5 pgs undersized 
 </pre> 

 It seems like this is probably because the stats like last_active are zero if something has never been active?    The logic in PGMonitor is checking these stats against (now - mon_pg_stuck_threshold), and 0 is always before that cutoff. 

 What should our logic be here: 

 
  * we could initialize all the last_* stats to the time of creation 
 
  * we could never count something as stuck until the PG has at least existed for mon_pg_stuck_threshold? 

 As it is the messages are definitely crazy, especially the "for more than 300 seconds" message on a cluster that I created two seconds ago.

Back

Project

General

Profile

Ceph

Bug #14952