Bug #45388: Insufficient monitor logging to diagnose downed OSDs - RADOS - Ceph

Actions

Copy link

Bug #45388

open

Insufficient monitor logging to diagnose downed OSDs

Added by Christian Huebner almost 4 years ago. Updated 10 months ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Community (user)

Tags:

low-hanging-fruit

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

We just had a case where in a Ceph Luminous cluster the monitor forced newly started OSDs to commit suicide. Communication between monitor and OSD were fine, but the OSD went down with a log message that the monitor forced it to commit suicide, Only after increasing the debug level we found that some OSDs reported the OSD down and thus the monitor took action forcing the OSD process to stop.

If a monitor forces an OSD to commit suicide the reason why must be reported in the monitor log at default log level, including the OSDs which reported the OSD down.

Impact: Troubleshooting on production clusters is always in a time crunch, so having the reasons reported often makes the difference between maintaining SLAs and breaking them.