Project

General

Profile

Bug #14952

New pools have bogus "stuck inactive/unclean" HEALTH_ERR messages until they are first active and clean

Added by John Spray over 3 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
Monitor
Target version:
-
Start date:
03/02/2016
Due date:
% Done:

0%

Source:
other
Tags:
Backport:
jewel
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

(In master, on a vstart cluster but presumably happens on real clusters too)

Right after creating some pools:

     health HEALTH_ERR
            2 pgs are stuck inactive for more than 300 seconds
            5 pgs degraded
            2 pgs stuck inactive
            5 pgs stuck unclean
            5 pgs undersized

It seems like this is probably because the stats like last_active are zero if something has never been active? The logic in PGMonitor is checking these stats against (now - mon_pg_stuck_threshold), and 0 is always before that cutoff.

What should our logic be here:

  • we could initialize all the last_* stats to the time of creation
  • we could never count something as stuck until the PG has at least existed for mon_pg_stuck_threshold?

As it is the messages are definitely crazy, especially the "for more than 300 seconds" message on a cluster that I created two seconds ago.


Related issues

Copied to Ceph - Backport #15806: jewel: New pools have bogus "stuck inactive/unclean" HEALTH_ERR messages until they are first active and clean Resolved

Associated revisions

Revision 433afce6 (diff)
Added by Sage Weil over 3 years ago

mon: initialize last_* timestamps on new pgs to creation time

Currently, when you create a pool, until the PGs peer you generate a health
error like

8 pgs are stuck inactive for more than 300 seconds

which is inaccurate and misleading. Instead, set the timestamps to the
creation time so that warnings don't appear until it's clear they're stuck.

Fixes: #14952
Signed-off-by: Sage Weil <>

Revision 93fdd95b (diff)
Added by Sage Weil over 3 years ago

mon: initialize last_* timestamps on new pgs to creation time

Currently, when you create a pool, until the PGs peer you generate a health
error like

8 pgs are stuck inactive for more than 300 seconds

which is inaccurate and misleading. Instead, set the timestamps to the
creation time so that warnings don't appear until it's clear they're stuck.

Fixes: #14952
Signed-off-by: Sage Weil <>

History

#1 Updated by John Spray over 3 years ago

  • Description updated (diff)

#2 Updated by John Spray over 3 years ago

  • Description updated (diff)

#3 Updated by Samuel Just over 3 years ago

  • Priority changed from Normal to Urgent

#4 Updated by Samuel Just over 3 years ago

I think the pgs should be deemed stuck based on time since creation, that would make sense.

#5 Updated by Sage Weil over 3 years ago

  • Status changed from New to Need Review

#6 Updated by Sage Weil over 3 years ago

  • Status changed from Need Review to Resolved

#7 Updated by Sage Weil about 3 years ago

  • Status changed from Resolved to Need Review

#8 Updated by Sage Weil about 3 years ago

  • Status changed from Need Review to Pending Backport
  • Backport set to jewel

need to backport final fix, 11e4242fbdb2f2f6f654d4cb3a7c95d5b38a88c2

#9 Updated by Nathan Cutler about 3 years ago

  • Copied to Backport #15806: jewel: New pools have bogus "stuck inactive/unclean" HEALTH_ERR messages until they are first active and clean added

#10 Updated by Loic Dachary almost 3 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF