Project

General

Profile

Feature #13923

Set health to ERR when one or more PGs is stuck inactive

Added by Wido den Hollander about 3 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
Monitor
Target version:
-
Start date:
12/01/2015
Due date:
% Done:

0%

Source:
other
Tags:
Backport:
Reviewed:
Affected Versions:
Pull request ID:

Description

Based on this thread: http://article.gmane.org/gmane.comp.file-systems.ceph.user/25551

I would propose two additional settings:

mon_pg_inactive_max = 300
mon_pg_inactive_num = 1

In this case, if there is 1 or more PGs stuck inactive for more then 300 seconds the health state would go into ERR from WARN.

In RBD environments even one inactive PG can cause almost all I/O to stall since Block Devices hit so many different PGs.

Associated revisions

Revision a9addc61 (diff)
Added by Wido den Hollander about 3 years ago

mon: Go into ERR state if multiple PGs are stuck inactive

If >=X PGs are stuck inactive longer than 'mon_pg_stuck_threshold'
we go into ERR state.

This is useful for situations where one or more PGs stay stuck in
peering or undersized state due to a OSD failure.

RBD volumes can become fully unresponsive if one or more PGs are inactive.

Fixes: #13923
Signed-off-by: Wido den Hollander <>

History

#1 Updated by Abhishek Lekshmanan about 3 years ago

  • Status changed from New to Need Review

#2 Updated by Wido den Hollander almost 2 years ago

  • Status changed from Need Review to Resolved

Also available in: Atom PDF