Project

General

Profile

Bug #48297

OSD process using up complete available memory after pg_num change / autoscaler on

Added by Tobias Fischer over 3 years ago. Updated almost 3 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

we did following change on our cluster (cephadm octopus 15.2.5):

ceph osd pool set one pg_num 512

after some time the cluster went crazy. we did some analysis and figured out that some of our nodes got stuck because of memory problems and oom killer kicking in.
we could stabilize the cluster after setting a memory limit for the OSD Docker Containers.

now we have 3 OSDs on 3 different nodes left that are not able to start and join - because after a couple of minutes they hit the set memory limit (tested up to 30G) and get killed.

Our Setup:

5 Nodes - 4 with 8 OSDs ( 48 GB Ram) and 1 with 12 OSDs ( 64 GB Ram) - orchestrated by cephadm.

logs: https://storage.clyso.com/s/865ReZS4MKMnFp9

History

#1 Updated by Greg Farnum almost 3 years ago

  • Project changed from Ceph to RADOS
  • Category deleted (OSD)

Also available in: Atom PDF