Project

General

Profile

Bug #41943

ceph-mgr fails to report OSD status correctly

Added by Brian Andrus 6 months ago. Updated 6 months ago.

Status:
Need More Info
Priority:
High
Assignee:
-
Category:
Administration/Usability
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature:

Description

After an inexplicable cluster event that resulted in around 10% of our OSDs falsely reported down (and shortly after back up), we had an OSD that was seemingly not functioning correctly, but health was reporting HEALTH_OK, which greatly prolonged the outage.

For some time after the storage blip we were experiencing strange inconsistent behavior in a portion of our libvirt guests wherein some would boot okay, but then go into 100% IOWAIT and crash shortly after. Others had random issues preventing the boot process from completing. Some could be remedied by copying rbd image data to new images, but those copies were often held up indefinitely at some point in the copy process with no failure. Once/if a copy completed successfully, the VM would usually boot successfully.

One of our engineers restarted a ceph-mgr for an unrelated reason, and we then had HEALTH_WARN with 119 PGs reporting inactive with no other information (Bug #23049 a partial match to this issue). A `ceph health detail` showed 119 PGs inactive since the mgr restart and each of those PGs had no OSDs listed in the OSD list. A `ceph pg map` quickly showed that they all had the same OSD as their primary and after kicking the OSD the cluster was restored to full functionality and any remaining VMs having issues were immediately unblocked. RBD image copies from then on completed without hanging.

In my approximation, it would seem that when write requests were sent to the down PGs/misbehaving OSD, they would hang but not report in ceph health status at all. The cluster did not seem to recognize or report or log any blocked requests.

Other info:

Recently Luminous-upgraded from Jewel, 25600 PGs, 3x replication, 225 OSDs (224 up, 222 in)
I am aware the PG count is approx double what should be present in the cluster.
Approx 1 hour prior to the as of now cause unknown cluster outage, the machine hosting the OSD that was rebooted experienced a kernel oops for an unknown reason.

History

#1 Updated by Greg Farnum 6 months ago

Sounds like this OSD was somehow up enough that it responded to peer heartbeats, but was not processing any client requests.

Presumably it also wasn't sending anything to the manager, but the pg stats hadn't timed out and gone unclean yet. Not sure if there's a good way to check their last-update timestamps.

#2 Updated by Neha Ojha 6 months ago

  • Status changed from New to Need More Info
  • Priority changed from Normal to High

Do you have any other information from that OSD while this happened?

Also available in: Atom PDF