Project

General

Profile

Actions

Bug #54434

open

hdd osd's crashing after pacific upgrade

Added by Maximilian Stinsky about 2 years ago. Updated 7 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

In the last couple of month we upgraded all of our ceph clusters from nautilus to pacific.
After upgrading our last cluster which is also hosting an s3 service on an ec pool which is backed by hdd's we have the problem that everyday a couple of those hdd osd's are crashing.

What we can observe is that most of the time the osd's crash in roughly the same timeframe everyday.

As I said this is only happening in one of our 5 clusters and is only happening on hdd osds.
We upgraded from 14.2.22 to 16.2.7 and the upgrade is completely finished no open tasks are left from the manual upgrade guide.

The log message that seems to be the reason for osds crashing is `1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fd2a1f44700' had timed out after 15.000000954s`
We see a lot of those messages until the cluster removes the osd and everything goes back to normal and the osd rejoins the cluster as healthy.

I attached a small part of one osd log that is crashing. The log shows timestamps of around 11:25 but the problem for that specific osd started at 11:08 and repeated the same pattern for a couple of minutes until joining the cluster again.
The issue we are seeing always lasts for around 10-20 minutes causing slow ops in the cluster and effecting several osd's in that timeframe. It seems that failing osd's happening in a serial manner.


Files

osd-log.csv (814 KB) osd-log.csv Maximilian Stinsky, 03/01/2022 11:15 AM
Actions

Also available in: Atom PDF