Project

General

Profile

Bug #14952

New pools have bogus "stuck inactive/unclean" HEALTH_ERR messages until they are first active and clean

Added by John Spray about 8 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
Monitor
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
jewel
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

(In master, on a vstart cluster but presumably happens on real clusters too)

Right after creating some pools:

     health HEALTH_ERR
            2 pgs are stuck inactive for more than 300 seconds
            5 pgs degraded
            2 pgs stuck inactive
            5 pgs stuck unclean
            5 pgs undersized

It seems like this is probably because the stats like last_active are zero if something has never been active? The logic in PGMonitor is checking these stats against (now - mon_pg_stuck_threshold), and 0 is always before that cutoff.

What should our logic be here:

  • we could initialize all the last_* stats to the time of creation
  • we could never count something as stuck until the PG has at least existed for mon_pg_stuck_threshold?

As it is the messages are definitely crazy, especially the "for more than 300 seconds" message on a cluster that I created two seconds ago.


Related issues

Copied to Ceph - Backport #15806: jewel: New pools have bogus "stuck inactive/unclean" HEALTH_ERR messages until they are first active and clean Resolved

Associated revisions

Revision 433afce6 (diff)
Added by Sage Weil about 8 years ago

mon: initialize last_* timestamps on new pgs to creation time

Currently, when you create a pool, until the PGs peer you generate a health
error like

8 pgs are stuck inactive for more than 300 seconds

which is inaccurate and misleading. Instead, set the timestamps to the
creation time so that warnings don't appear until it's clear they're stuck.

Fixes: #14952
Signed-off-by: Sage Weil <>

Revision 93fdd95b (diff)
Added by Sage Weil about 8 years ago

mon: initialize last_* timestamps on new pgs to creation time

Currently, when you create a pool, until the PGs peer you generate a health
error like

8 pgs are stuck inactive for more than 300 seconds

which is inaccurate and misleading. Instead, set the timestamps to the
creation time so that warnings don't appear until it's clear they're stuck.

Fixes: #14952
Signed-off-by: Sage Weil <>

History

#1 Updated by John Spray about 8 years ago

  • Description updated (diff)

#2 Updated by John Spray about 8 years ago

  • Description updated (diff)

#3 Updated by Samuel Just about 8 years ago

  • Priority changed from Normal to Urgent

#4 Updated by Samuel Just about 8 years ago

I think the pgs should be deemed stuck based on time since creation, that would make sense.

#5 Updated by Sage Weil about 8 years ago

  • Status changed from New to Fix Under Review

#6 Updated by Sage Weil about 8 years ago

  • Status changed from Fix Under Review to Resolved

#7 Updated by Sage Weil almost 8 years ago

  • Status changed from Resolved to Fix Under Review

#8 Updated by Sage Weil almost 8 years ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to jewel

need to backport final fix, 11e4242fbdb2f2f6f654d4cb3a7c95d5b38a88c2

#9 Updated by Nathan Cutler almost 8 years ago

  • Copied to Backport #15806: jewel: New pools have bogus "stuck inactive/unclean" HEALTH_ERR messages until they are first active and clean added

#10 Updated by Loïc Dachary over 7 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF