Feature #1885
identify top 10 expected failures and process to diagnose
% Done:
0%
Source:
Tags:
Backport:
Reviewed:
Affected Versions:
Pull request ID:
Description
- peering failures
- unfound objects
History
#1 Updated by Sage Weil almost 12 years ago
- translation missing: en.field_position set to 4
#2 Updated by Sage Weil almost 12 years ago
- Assignee set to Anonymous
#3 Updated by Sage Weil almost 12 years ago
- translation missing: en.field_position deleted (
22) - translation missing: en.field_position set to 10
#4 Updated by Sage Weil almost 12 years ago
- Target version changed from v0.41 to v0.42
- translation missing: en.field_position deleted (
17) - translation missing: en.field_position set to 1
#5 Updated by Anonymous almost 12 years ago
OSD:
- cascading failures
- single OSD failure
- failure to complete peering/recovery
- unfound objects after recovery
- full
- slow
- fails to respond to some request
- failure
- failure
- stops forwarding requests
#6 Updated by Anonymous almost 12 years ago
Additional issues from Carl's list:
- RGW request timeouts
- OSD file system timeouts
- OSD that is "down" but still "in"
- degraded placement groups
#7 Updated by Greg Farnum almost 12 years ago
Mark Kampe wrote:
Additional issues from Carl's list:
- RGW request timeouts
That's a symptom, not a cause...
- OSD file system timeouts
What timeouts? We have a few that cause suicides but I suspect he just means OSDs being slow in the filesystem.
- OSD that is "down" but still "in"
- degraded placement groups
I'm not sure what either of these are about. Both are revealed with "ceph -s" (more detail under "ceph osd dump" and "ceph pg dump"), and neither are problems in and of themselves.
#8 Updated by Sage Weil almost 12 years ago
- Status changed from New to Resolved
- translation missing: en.field_position deleted (
16) - translation missing: en.field_position set to 16