Project

General

Profile

Actions

Bug #49353

open

Random OSDs being marked as down even when there is very less activity on the cluster- 14.2.2

Added by Nokia ceph-users about 3 years ago. Updated about 3 years ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi,

We recently see some random OSDs being marked as down status with the below message on one of our Nautilus clusters. There were no big activities being done at the time; only some minimal constant write and read traffic that was ongoing in the cluster.

ceph.log-20201212.gz:2020-12-11 21:57:34.845446 mon.cn1 (mon.0) 113050 : cluster [INF] osd.266 marked down after no beacon for 902.300583 seconds

But we have noticed that there is high cpu usage by the respective OSD process at the time it is marked down until we restart the OSD. Once we restart the OSD, cpu usage becomes less again (around 10%) and OSD is marked as up. We don't have any messages in the osd logs that could indicate why there is high cpu and if that is the cause for the issue. In fact, we dont have messages in the osd log from the time the osd went down until the time it was brought up.

Environment - 5 node 14.2.2 nautilus with 60 OSDs each.

Logs of one such failure attached.


Files

ceph-osd.149.log-20210115 (1).gz (28.8 KB) ceph-osd.149.log-20210115 (1).gz Nokia ceph-users, 02/18/2021 09:44 AM
ceph.log-20210114.gz (912 KB) ceph.log-20210114.gz Nokia ceph-users, 02/18/2021 09:51 AM
MCN_CN5_ceph-osd.146.log (184 KB) MCN_CN5_ceph-osd.146.log Nokia ceph-users, 02/23/2021 12:22 PM
Actions

Also available in: Atom PDF