Set health to ERR when one or more PGs is stuck inactive
Based on this thread: http://article.gmane.org/gmane.comp.file-systems.ceph.user/25551
I would propose two additional settings:
mon_pg_inactive_max = 300
mon_pg_inactive_num = 1
In this case, if there is 1 or more PGs stuck inactive for more then 300 seconds the health state would go into ERR from WARN.
In RBD environments even one inactive PG can cause almost all I/O to stall since Block Devices hit so many different PGs.
mon: Go into ERR state if multiple PGs are stuck inactive
If >=X PGs are stuck inactive longer than 'mon_pg_stuck_threshold'
we go into ERR state.
This is useful for situations where one or more PGs stay stuck in
peering or undersized state due to a OSD failure.
RBD volumes can become fully unresponsive if one or more PGs are inactive.