Project

General

Profile

Feature #3849

Track slow PGs and times OSDs marked down

Added by Ian Colle about 7 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Support
Tags:
Backport:
Reviewed:
02/27/2013
Affected Versions:
Pull request ID:

Description

Kyle Bader:
"Over the weekend of 01/02/13 we encountered an issue that we had not yet
encountered. One of our cephstore nodes started having issues on a bonded
link to the Ceph public network. We discovered the issue by collecting a
list of PGs that had slow request entries in the cluster log on the monitor
(/var/log/ceph/ceph.log), sorting by frequency and correlating the primary
OSDs to the cephstore they were running on. It turns out that almost all the
PGs that were experiencing slowness pointed to a single cephstore, which
upon investigation had TCP errors on the bonded interface. Ideally errors
like this should be detected by the switch and the problematic link should
be shut off but in practice this didn't happen. I suspect similar situations
will arise with other customers and it would be nice if Ceph could help
detect and heal itself. Perhaps something that keeps track of slowness and
increments a counter on each occurrence, when the counter passes a
configurable threshold an entry is sent to the OSD log and the OSD is shut
off."

History

#1 Updated by JuanJose Galvez about 7 years ago

  • Source set to Support

#2 Updated by Ian Colle about 7 years ago

  • Reviewed set to 02/27/2013

#3 Updated by Neil Levine almost 7 years ago

  • Status changed from New to 12

#4 Updated by Sage Weil over 5 years ago

  • Status changed from 12 to Resolved

Also available in: Atom PDF