Bug #50950: MIMIC OSD very high CPU usage(3xx%), stop responding to other osd, causing PG stuck at peering - RADOS - Ceph

Actions

Copy link

Bug #50950

closed

MIMIC OSD very high CPU usage(3xx%), stop responding to other osd, causing PG stuck at peering

Added by Bin Guo almost 3 years ago. Updated almost 3 years ago.

Status:

Won't Fix

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

Ceph - v13.2.4

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

I'm using this mimic cluster (about 530 OSDs) for over 1 year, recently I found some particular OSDs randomly run into busy loop mode, with very cpu usage(300%~400% which hornor the Pod resource limitation). Meanwhile, these OSDs stop responding to any messages from outside and the cluster status shows some PGs stuck at peering state.

All the problems mentioned above could disappear after about 3 to 4 hours, and them everything back to normal. I can't reproduce this, but it's been happened for 3 times.

Any help will be appreciated!

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #50950

MIMIC OSD very high CPU usage(3xx%), stop responding to other osd, causing PG stuck at peering

Updated by Bin Guo almost 3 years ago

Updated by Bin Guo almost 3 years ago

Updated by Neha Ojha almost 3 years ago