Bug #54253: Avoid OOM exceeding 10x MDS cache limit on restart after many files were opened - CephFS - Ceph

Actions

Copy link

Bug #54253

open

Avoid OOM exceeding 10x MDS cache limit on restart after many files were opened

Added by Niklas Hambuechen about 2 years ago. Updated over 1 year ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Performance/Resource Usage

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

crash

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Today I had a multi-hour CephFS outage due to a bug that I believe was discussed in various mailing lists and posts already, but not put into the issue tracker yet.

I had started an rsync of 50M files from a single directory of a legacy Ceph 13.2.2 cluster to a Ceph 16.2.7 cluster.

The next MDS restart of the Ceph 13 cluster completely killed it: Any MDS would start, go into `up:replay` state in `ceph status` for 3 minutes, and then go into `up:rejoin` state, allocating all 64 GB memory of the machine, thus getting OOM-killed. This would repeat forever.

The `mds cache memory limit` was set to 10 GB; certainly less than 64 GB.

The following threads have observed the same behaviour:

https://ramsgaard.me/ceph-mds-stuck-in-rejoin/ * Describes exactly what I saw.
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/23QX5METSYHH52ZNKTAITMM57ZJMMCIV/
> the rejoin step took around 5 minutes and during that time the MDS memory ballooned to 41GB (10x the configured cache memory limit of 4GB).
https://www.spinics.net/lists/ceph-users/msg60412.html

As suggested in the 3rd link, I enabled swap; with this the `up:rejoin` went to 100 GB (10x the configured cache limit, like in the 2nd link), but then got stuck there and never recovered. It would eventually print `rejoin_done` in the MDS log, but `ceph status` would not improve.

As in the first 3rd link, I got `heartbeat_map is_healthy 'MDSRank' had timed out after 15` messages, and as suggested there I set `ceph config set global mds_beacon_grace 600000` (changing away from the default of 15 seconds), but this did not help: The messages disappeared from the log, but as before `rejoin_done` would be reached, but `ceph status` not improve.

I eventually applied the suggestion from all the above links:

rados rm -p cephfs_metadata mds0_openfiles.0

This fixed the problem.

(This is despite the fact that the file had size 0 according to `rados stat`.)

It would go into `up:replay` state in `ceph status` for 3 minutes, but then reach `up:active`.

I am reporting this as a bug after having found a workaround, because this cost me a large amount of downtime, and I think Ceph could perhaps improve this so that the same won't happen to other users who might research equally long to find this workaround.

(I understand that the cluster on which that happened to me is for an older Ceph verison, but the linked threads are newer so I assume it is still relevant for newer versions.)

The main issue is that Ceph apparently exceeds the configured cache memory limit by 10x. This cannot possibly work without out-of-memory kills happen on most systems, so I wonder if it makes sense that Ceph even tries that.

If, as it seems, this process reliably takes quite exactly 10x the cache size, perhaps Ceph could compute if it cannot possibly fit into RAM, and if so, warn the user in `ceph status` that this is a current problem?

Next, there is not indication in the MDS logs (at default level) about what Ceph is doing when this happens. If Ceph is performing some operation, e.g. enumerating some previously opened files (as suggested by some of the links; I can't judge if that is fully accurate), some form of log or progress report on that might help the debugging admin a lot.

Finally, according to the threads, the solution to delete `mds0_openfiles.0` is "always safe". If that is true, could Ceph do this automatically if it detects this situation?

Related issues 1 (1 open — 0 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #54253

Avoid OOM exceeding 10x MDS cache limit on restart after many files were opened

Updated by Xiubo Li about 2 years ago

Updated by Niklas Hambuechen about 2 years ago

Updated by Xiubo Li about 2 years ago

Updated by Dan van der Ster about 2 years ago

Updated by Niklas Hambuechen about 2 years ago

Updated by Xiubo Li about 2 years ago

Updated by Venky Shankar about 2 years ago

Updated by Niklas Hambuechen about 2 years ago

Updated by Niklas Hambuechen over 1 year ago