Project

General

Profile

Actions

Bug #45388

open

Insufficient monitor logging to diagnose downed OSDs

Added by Christian Huebner almost 4 years ago. Updated 10 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
low-hanging-fruit
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We just had a case where in a Ceph Luminous cluster the monitor forced newly started OSDs to commit suicide. Communication between monitor and OSD were fine, but the OSD went down with a log message that the monitor forced it to commit suicide, Only after increasing the debug level we found that some OSDs reported the OSD down and thus the monitor took action forcing the OSD process to stop.

If a monitor forces an OSD to commit suicide the reason why must be reported in the monitor log at default log level, including the OSDs which reported the OSD down.

Impact: Troubleshooting on production clusters is always in a time crunch, so having the reasons reported often makes the difference between maintaining SLAs and breaking them.

Actions #1

Updated by Nathan Cutler over 3 years ago

  • Tags set to low-hanging-fruit
Actions #2

Updated by Sage Weil almost 3 years ago

  • Project changed from Ceph to RADOS
  • Category deleted (Monitor)
Actions #3

Updated by Laura Flores almost 2 years ago

  • Translation missing: en.field_tag_list set to low-hanging-fruit
Actions #4

Updated by Laura Flores 10 months ago

  • Translation missing: en.field_tag_list changed from low-hanging-fruit to low-hanging-fruit, open-source-day
Actions

Also available in: Atom PDF