Project

General

Profile

Actions

Bug #49353

open

Random OSDs being marked as down even when there is very less activity on the cluster- 14.2.2

Added by Nokia ceph-users about 3 years ago. Updated almost 3 years ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi,

We recently see some random OSDs being marked as down status with the below message on one of our Nautilus clusters. There were no big activities being done at the time; only some minimal constant write and read traffic that was ongoing in the cluster.

ceph.log-20201212.gz:2020-12-11 21:57:34.845446 mon.cn1 (mon.0) 113050 : cluster [INF] osd.266 marked down after no beacon for 902.300583 seconds

But we have noticed that there is high cpu usage by the respective OSD process at the time it is marked down until we restart the OSD. Once we restart the OSD, cpu usage becomes less again (around 10%) and OSD is marked as up. We don't have any messages in the osd logs that could indicate why there is high cpu and if that is the cause for the issue. In fact, we dont have messages in the osd log from the time the osd went down until the time it was brought up.

Environment - 5 node 14.2.2 nautilus with 60 OSDs each.

Logs of one such failure attached.


Files

ceph-osd.149.log-20210115 (1).gz (28.8 KB) ceph-osd.149.log-20210115 (1).gz Nokia ceph-users, 02/18/2021 09:44 AM
ceph.log-20210114.gz (912 KB) ceph.log-20210114.gz Nokia ceph-users, 02/18/2021 09:51 AM
MCN_CN5_ceph-osd.146.log (184 KB) MCN_CN5_ceph-osd.146.log Nokia ceph-users, 02/23/2021 12:22 PM
Actions #1

Updated by Igor Fedotov about 3 years ago

  • Status changed from New to Need More Info

osd.149 went down at 03:25:26
2021-01-14 03:25:25.974634 mon.cn1 (mon.0) 384654 : cluster [INF] osd.149 marked down after no beacon for 902.569519 seconds

But attached osd log doesn't contain that point in time:

2021-01-14 03:38:02.085 7f7bf8034700 -1 received signal: Hangup from pkill -1 -x ceph-mon|ceph-mgr|ceph-mds|ceph-osd|ceph-fuse|radosgw (PID: 1845016) UID: 0
...
2021-01-15 03:20:01.318 7fed57589700 -1 received signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw (PID: 4193911) UID: 0

Please provide relevant OSD log snippet.

Not to mention that v14.2.2 is too outdated, tons of bugs have been fixes since then...

Actions #2

Updated by Nokia ceph-users about 3 years ago

Hi , Another occurrence

2021-02-22 09:19:43.010071 mon.cn1 (mon.0) 267937 : cluster [INF] osd.146 marked down after no beacon for 902.366627 seconds

But I don't see any relevant log messages at the time of issue.

Actions #3

Updated by Nokia ceph-users about 3 years ago

Nokia ceph-users wrote:

Hi , Another occurrence

2021-02-22 09:19:43.010071 mon.cn1 (mon.0) 267937 : cluster [INF] osd.146 marked down after no beacon for 902.366627 seconds

But I don't see any relevant log messages at the time of issue on the OSD log.

Actions #4

Updated by Nokia ceph-users about 3 years ago

Do you suspect that this is something relevant to 14.2.2 and could be solved with a higher version?

Actions #5

Updated by Igor Fedotov about 3 years ago

Nokia ceph-users wrote:

Do you suspect that this is something relevant to 14.2.2 and could be solved with a higher version?

Hard to tell since there are no evident symptoms in the available logs - the last one shows no output for the specified time period and beyond.

But IMO upgrade makes sense anyway as tons of issues have been fixed since then. Hopefully the one you're facing is among them... Or fresh release would provide more details for troubleshooting..

Actions #6

Updated by Sage Weil almost 3 years ago

  • Project changed from Ceph to RADOS
  • Category deleted (OSD)
Actions

Also available in: Atom PDF