Project

General

Profile

Actions

Bug #55842

open

Upgrading to 16.2.9 with 9M strays files causes MDS OOM

Added by Arthur Outhenin-Chalandre almost 2 years ago. Updated 8 months ago.

Status:
Triaged
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi,

Yesterday we upgraded one of our test cephfs clusters from 16.2.6 to 16.2.9. This cluster was mainly used to test and stress test snapshots in CephFS.
This cluster had about 9M strays a few connected clients but not much active. We have only one active mds and when we upgraded it, it went OOM. We are collocating MDS and OSD in this case, so after we stopped the OSD to free up some RAM for the MDS, it started successfully using 70G of memory at his peak and then the resident memory went back to about 1G.

I don't have massive evidences of this but as this cluster had a huge number of strays and that's mainly the only specific thing about this cluster, I am suspecting the following PR https://github.com/ceph/ceph/pull/44342 to have caused this issue. Also the number of stray file decreased to 0 after the upgrade so I believe we shouldn't have this problem anymore...

Actions

Also available in: Atom PDF