Project

General

Profile

Actions

Bug #45388

open

Insufficient monitor logging to diagnose downed OSDs

Added by Christian Huebner about 4 years ago. Updated 11 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
low-hanging-fruit
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We just had a case where in a Ceph Luminous cluster the monitor forced newly started OSDs to commit suicide. Communication between monitor and OSD were fine, but the OSD went down with a log message that the monitor forced it to commit suicide, Only after increasing the debug level we found that some OSDs reported the OSD down and thus the monitor took action forcing the OSD process to stop.

If a monitor forces an OSD to commit suicide the reason why must be reported in the monitor log at default log level, including the OSDs which reported the OSD down.

Impact: Troubleshooting on production clusters is always in a time crunch, so having the reasons reported often makes the difference between maintaining SLAs and breaking them.

Actions

Also available in: Atom PDF