Bug #14952
New pools have bogus "stuck inactive/unclean" HEALTH_ERR messages until they are first active and clean
0%
Description
(In master, on a vstart cluster but presumably happens on real clusters too)
Right after creating some pools:
health HEALTH_ERR 2 pgs are stuck inactive for more than 300 seconds 5 pgs degraded 2 pgs stuck inactive 5 pgs stuck unclean 5 pgs undersized
It seems like this is probably because the stats like last_active are zero if something has never been active? The logic in PGMonitor is checking these stats against (now - mon_pg_stuck_threshold), and 0 is always before that cutoff.
What should our logic be here:
- we could initialize all the last_* stats to the time of creation
- we could never count something as stuck until the PG has at least existed for mon_pg_stuck_threshold?
As it is the messages are definitely crazy, especially the "for more than 300 seconds" message on a cluster that I created two seconds ago.
Related issues
Associated revisions
mon: initialize last_* timestamps on new pgs to creation time
Currently, when you create a pool, until the PGs peer you generate a health
error like
8 pgs are stuck inactive for more than 300 seconds
which is inaccurate and misleading. Instead, set the timestamps to the
creation time so that warnings don't appear until it's clear they're stuck.
Fixes: #14952
Signed-off-by: Sage Weil <sage@redhat.com>
mon: initialize last_* timestamps on new pgs to creation time
Currently, when you create a pool, until the PGs peer you generate a health
error like
8 pgs are stuck inactive for more than 300 seconds
which is inaccurate and misleading. Instead, set the timestamps to the
creation time so that warnings don't appear until it's clear they're stuck.
Fixes: #14952
Signed-off-by: Sage Weil <sage@redhat.com>
History
#1 Updated by John Spray about 8 years ago
- Description updated (diff)
#2 Updated by John Spray about 8 years ago
- Description updated (diff)
#3 Updated by Samuel Just about 8 years ago
- Priority changed from Normal to Urgent
#4 Updated by Samuel Just about 8 years ago
I think the pgs should be deemed stuck based on time since creation, that would make sense.
#5 Updated by Sage Weil about 8 years ago
- Status changed from New to Fix Under Review
#6 Updated by Sage Weil about 8 years ago
- Status changed from Fix Under Review to Resolved
#7 Updated by Sage Weil almost 8 years ago
- Status changed from Resolved to Fix Under Review
#8 Updated by Sage Weil almost 8 years ago
- Status changed from Fix Under Review to Pending Backport
- Backport set to jewel
need to backport final fix, 11e4242fbdb2f2f6f654d4cb3a7c95d5b38a88c2
#9 Updated by Nathan Cutler almost 8 years ago
- Copied to Backport #15806: jewel: New pools have bogus "stuck inactive/unclean" HEALTH_ERR messages until they are first active and clean added
#10 Updated by Loïc Dachary over 7 years ago
- Status changed from Pending Backport to Resolved