Bug #45388: Insufficient monitor logging to diagnose downed OSDs - RADOS - Ceph

Actions

Copy link

Bug #45388

open

Insufficient monitor logging to diagnose downed OSDs

Added by Christian Huebner about 4 years ago. Updated 11 months ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Community (user)

Tags:

low-hanging-fruit

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

We just had a case where in a Ceph Luminous cluster the monitor forced newly started OSDs to commit suicide. Communication between monitor and OSD were fine, but the OSD went down with a log message that the monitor forced it to commit suicide, Only after increasing the debug level we found that some OSDs reported the OSD down and thus the monitor took action forcing the OSD process to stop.

If a monitor forces an OSD to commit suicide the reason why must be reported in the monitor log at default log level, including the OSDs which reported the OSD down.

Impact: Troubleshooting on production clusters is always in a time crunch, so having the reasons reported often makes the difference between maintaining SLAs and breaking them.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #45388

Insufficient monitor logging to diagnose downed OSDs

Updated by Nathan Cutler over 3 years ago

Updated by Sage Weil about 3 years ago

Updated by Laura Flores almost 2 years ago

Updated by Laura Flores 11 months ago