Project

General

Profile

Feature #1885

identify top 10 expected failures and process to diagnose

Added by Sage Weil almost 12 years ago. Updated almost 12 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
% Done:

0%

Source:
Tags:
Backport:
Reviewed:
Affected Versions:
Pull request ID:

Description

- peering failures
- unfound objects

History

#1 Updated by Sage Weil almost 12 years ago

  • translation missing: en.field_position set to 4

#2 Updated by Sage Weil almost 12 years ago

  • Assignee set to Anonymous

#3 Updated by Sage Weil almost 12 years ago

  • translation missing: en.field_position deleted (22)
  • translation missing: en.field_position set to 10

#4 Updated by Sage Weil almost 12 years ago

  • Target version changed from v0.41 to v0.42
  • translation missing: en.field_position deleted (17)
  • translation missing: en.field_position set to 1

#5 Updated by Anonymous almost 12 years ago

OSD:
  • cascading failures
  • single OSD failure
  • failure to complete peering/recovery
  • unfound objects after recovery
  • full
  • slow
  • fails to respond to some request
Monitors:
  • failure
RGW:
  • failure
Load Balancer:
  • stops forwarding requests

#6 Updated by Anonymous almost 12 years ago

Additional issues from Carl's list:
  • RGW request timeouts
  • OSD file system timeouts
  • OSD that is "down" but still "in"
  • degraded placement groups

#7 Updated by Greg Farnum almost 12 years ago

Mark Kampe wrote:

Additional issues from Carl's list:
  • RGW request timeouts

That's a symptom, not a cause...

  • OSD file system timeouts

What timeouts? We have a few that cause suicides but I suspect he just means OSDs being slow in the filesystem.

  • OSD that is "down" but still "in"
  • degraded placement groups

I'm not sure what either of these are about. Both are revealed with "ceph -s" (more detail under "ceph osd dump" and "ceph pg dump"), and neither are problems in and of themselves.

#8 Updated by Sage Weil almost 12 years ago

  • Status changed from New to Resolved
  • translation missing: en.field_position deleted (16)
  • translation missing: en.field_position set to 16

Also available in: Atom PDF