Bug #49353
openRandom OSDs being marked as down even when there is very less activity on the cluster- 14.2.2
0%
Description
Hi,
We recently see some random OSDs being marked as down status with the below message on one of our Nautilus clusters. There were no big activities being done at the time; only some minimal constant write and read traffic that was ongoing in the cluster.
ceph.log-20201212.gz:2020-12-11 21:57:34.845446 mon.cn1 (mon.0) 113050 : cluster [INF] osd.266 marked down after no beacon for 902.300583 seconds
But we have noticed that there is high cpu usage by the respective OSD process at the time it is marked down until we restart the OSD. Once we restart the OSD, cpu usage becomes less again (around 10%) and OSD is marked as up. We don't have any messages in the osd logs that could indicate why there is high cpu and if that is the cause for the issue. In fact, we dont have messages in the osd log from the time the osd went down until the time it was brought up.
Environment - 5 node 14.2.2 nautilus with 60 OSDs each.
Logs of one such failure attached.
Files
Updated by Igor Fedotov about 3 years ago
- Status changed from New to Need More Info
osd.149 went down at 03:25:26
2021-01-14 03:25:25.974634 mon.cn1 (mon.0) 384654 : cluster [INF] osd.149 marked down after no beacon for 902.569519 seconds
But attached osd log doesn't contain that point in time:
2021-01-14 03:38:02.085 7f7bf8034700 -1 received signal: Hangup from pkill -1 -x ceph-mon|ceph-mgr|ceph-mds|ceph-osd|ceph-fuse|radosgw (PID: 1845016) UID: 0
...
2021-01-15 03:20:01.318 7fed57589700 -1 received signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw (PID: 4193911) UID: 0
Please provide relevant OSD log snippet.
Not to mention that v14.2.2 is too outdated, tons of bugs have been fixes since then...
Updated by Nokia ceph-users about 3 years ago
- File MCN_CN5_ceph-osd.146.log MCN_CN5_ceph-osd.146.log added
Hi , Another occurrence
2021-02-22 09:19:43.010071 mon.cn1 (mon.0) 267937 : cluster [INF] osd.146 marked down after no beacon for 902.366627 seconds
But I don't see any relevant log messages at the time of issue.
Updated by Nokia ceph-users about 3 years ago
Nokia ceph-users wrote:
Hi , Another occurrence
2021-02-22 09:19:43.010071 mon.cn1 (mon.0) 267937 : cluster [INF] osd.146 marked down after no beacon for 902.366627 seconds
But I don't see any relevant log messages at the time of issue on the OSD log.
Updated by Nokia ceph-users about 3 years ago
Do you suspect that this is something relevant to 14.2.2 and could be solved with a higher version?
Updated by Igor Fedotov about 3 years ago
Nokia ceph-users wrote:
Do you suspect that this is something relevant to 14.2.2 and could be solved with a higher version?
Hard to tell since there are no evident symptoms in the available logs - the last one shows no output for the specified time period and beyond.
But IMO upgrade makes sense anyway as tons of issues have been fixed since then. Hopefully the one you're facing is among them... Or fresh release would provide more details for troubleshooting..
Updated by Sage Weil almost 3 years ago
- Project changed from Ceph to RADOS
- Category deleted (
OSD)