Project

General

Profile

Actions

Bug #54253

open

Avoid OOM exceeding 10x MDS cache limit on restart after many files were opened

Added by Niklas Hambuechen about 2 years ago. Updated over 1 year ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Performance/Resource Usage
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
crash
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Today I had a multi-hour CephFS outage due to a bug that I believe was discussed in various mailing lists and posts already, but not put into the issue tracker yet.

I had started an rsync of 50M files from a single directory of a legacy Ceph 13.2.2 cluster to a Ceph 16.2.7 cluster.

The next MDS restart of the Ceph 13 cluster completely killed it: Any MDS would start, go into `up:replay` state in `ceph status` for 3 minutes, and then go into `up:rejoin` state, allocating all 64 GB memory of the machine, thus getting OOM-killed. This would repeat forever.

The `mds cache memory limit` was set to 10 GB; certainly less than 64 GB.

The following threads have observed the same behaviour:

As suggested in the 3rd link, I enabled swap; with this the `up:rejoin` went to 100 GB (10x the configured cache limit, like in the 2nd link), but then got stuck there and never recovered. It would eventually print `rejoin_done` in the MDS log, but `ceph status` would not improve.

As in the first 3rd link, I got `heartbeat_map is_healthy 'MDSRank' had timed out after 15` messages, and as suggested there I set `ceph config set global mds_beacon_grace 600000` (changing away from the default of 15 seconds), but this did not help: The messages disappeared from the log, but as before `rejoin_done` would be reached, but `ceph status` not improve.

I eventually applied the suggestion from all the above links:

rados rm -p cephfs_metadata mds0_openfiles.0

This fixed the problem.

(This is despite the fact that the file had size 0 according to `rados stat`.)

It would go into `up:replay` state in `ceph status` for 3 minutes, but then reach `up:active`.


I am reporting this as a bug after having found a workaround, because this cost me a large amount of downtime, and I think Ceph could perhaps improve this so that the same won't happen to other users who might research equally long to find this workaround.

(I understand that the cluster on which that happened to me is for an older Ceph verison, but the linked threads are newer so I assume it is still relevant for newer versions.)

The main issue is that Ceph apparently exceeds the configured cache memory limit by 10x. This cannot possibly work without out-of-memory kills happen on most systems, so I wonder if it makes sense that Ceph even tries that.

If, as it seems, this process reliably takes quite exactly 10x the cache size, perhaps Ceph could compute if it cannot possibly fit into RAM, and if so, warn the user in `ceph status` that this is a current problem?

Next, there is not indication in the MDS logs (at default level) about what Ceph is doing when this happens. If Ceph is performing some operation, e.g. enumerating some previously opened files (as suggested by some of the links; I can't judge if that is fully accurate), some form of log or progress report on that might help the debugging admin a lot.

Finally, according to the threads, the solution to delete `mds0_openfiles.0` is "always safe". If that is true, could Ceph do this automatically if it detects this situation?


Related issues 1 (1 open0 closed)

Related to CephFS - Bug #54271: mds/OpenFileTable.cc: 777: FAILED ceph_assert(omap_num_objs == num_objs)TriagedKotresh Hiremath Ravishankar

Actions
Actions

Also available in: Atom PDF