Project

General

Profile

Actions

Bug #54253

open

Avoid OOM exceeding 10x MDS cache limit on restart after many files were opened

Added by Niklas Hambuechen about 2 years ago. Updated over 1 year ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Performance/Resource Usage
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
crash
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Today I had a multi-hour CephFS outage due to a bug that I believe was discussed in various mailing lists and posts already, but not put into the issue tracker yet.

I had started an rsync of 50M files from a single directory of a legacy Ceph 13.2.2 cluster to a Ceph 16.2.7 cluster.

The next MDS restart of the Ceph 13 cluster completely killed it: Any MDS would start, go into `up:replay` state in `ceph status` for 3 minutes, and then go into `up:rejoin` state, allocating all 64 GB memory of the machine, thus getting OOM-killed. This would repeat forever.

The `mds cache memory limit` was set to 10 GB; certainly less than 64 GB.

The following threads have observed the same behaviour:

As suggested in the 3rd link, I enabled swap; with this the `up:rejoin` went to 100 GB (10x the configured cache limit, like in the 2nd link), but then got stuck there and never recovered. It would eventually print `rejoin_done` in the MDS log, but `ceph status` would not improve.

As in the first 3rd link, I got `heartbeat_map is_healthy 'MDSRank' had timed out after 15` messages, and as suggested there I set `ceph config set global mds_beacon_grace 600000` (changing away from the default of 15 seconds), but this did not help: The messages disappeared from the log, but as before `rejoin_done` would be reached, but `ceph status` not improve.

I eventually applied the suggestion from all the above links:

rados rm -p cephfs_metadata mds0_openfiles.0

This fixed the problem.

(This is despite the fact that the file had size 0 according to `rados stat`.)

It would go into `up:replay` state in `ceph status` for 3 minutes, but then reach `up:active`.


I am reporting this as a bug after having found a workaround, because this cost me a large amount of downtime, and I think Ceph could perhaps improve this so that the same won't happen to other users who might research equally long to find this workaround.

(I understand that the cluster on which that happened to me is for an older Ceph verison, but the linked threads are newer so I assume it is still relevant for newer versions.)

The main issue is that Ceph apparently exceeds the configured cache memory limit by 10x. This cannot possibly work without out-of-memory kills happen on most systems, so I wonder if it makes sense that Ceph even tries that.

If, as it seems, this process reliably takes quite exactly 10x the cache size, perhaps Ceph could compute if it cannot possibly fit into RAM, and if so, warn the user in `ceph status` that this is a current problem?

Next, there is not indication in the MDS logs (at default level) about what Ceph is doing when this happens. If Ceph is performing some operation, e.g. enumerating some previously opened files (as suggested by some of the links; I can't judge if that is fully accurate), some form of log or progress report on that might help the debugging admin a lot.

Finally, according to the threads, the solution to delete `mds0_openfiles.0` is "always safe". If that is true, could Ceph do this automatically if it detects this situation?


Related issues 1 (1 open0 closed)

Related to CephFS - Bug #54271: mds/OpenFileTable.cc: 777: FAILED ceph_assert(omap_num_objs == num_objs)TriagedKotresh Hiremath Ravishankar

Actions
Actions #1

Updated by Xiubo Li about 2 years ago

Actions #2

Updated by Niklas Hambuechen about 2 years ago

Thanks! That sounds like it might, yes.

From that, it seems related bugs are (I think don't have Redmine permission to link them in somehow):

I'll leave it to Ceph devs to decide if that PR defaulting the option to false is enough, or whether it'd be worth implementing some more detailed logging/warning as I suggested in the issue description.

If it's worth it, this issue could be a tracking issue for not, if not, feel free to close!

Actions #3

Updated by Xiubo Li about 2 years ago

Niklas Hambuechen wrote:

Today I had a multi-hour CephFS outage due to a bug that I believe was discussed in various mailing lists and posts already, but not put into the issue tracker yet.

[...]

Finally, according to the threads, the solution to delete `mds0_openfiles.0` is "always safe". If that is true, could Ceph do this automatically if it detects this situation?

Not very sure will this introduce some bugs like https://tracker.ceph.com/issues/53504.

Actions #4

Updated by Dan van der Ster about 2 years ago

Niklas, you don't have to wait for that PR -- just do `ceph config set mds mds_oft_prefetch_dirfrags false` now.

For our CephFS clusters, it reduces the rejoin step from ~10mins to less than 1min, and the memory usage used to balloon to 2x the cache size, now with that config it stays well below.

Actions #5

Updated by Niklas Hambuechen about 2 years ago

Thanks Dan, I will add it to the config of the Ceph 16 cluster.

Unfortunatley I can't use it for the source cluster I'm copying from, because Ceph 13 doesn't have the patch, only Ceph >= 16 (and there's a backport for Ceph 15): https://github.com/ceph/ceph/commit/cc19fc624b1ee4d7e3248d1dfc8f89f8879a46bf

So I guess I'll have to use

rados rm -p cephfs_metadata mds0_openfiles.0

on restarts of the MDS on the Ceph 13 cluster until it's fully replaced by the Ceph 16 cluster.

Actions #6

Updated by Xiubo Li about 2 years ago

  • Related to Bug #54271: mds/OpenFileTable.cc: 777: FAILED ceph_assert(omap_num_objs == num_objs) added
Actions #7

Updated by Venky Shankar about 2 years ago

Niklas,

Were you able to get things to a stable state after following your note https://tracker.ceph.com/issues/54253#note-5?

If yes, should this tracker be marked as resolved?

Cheers,
Venky

Actions #8

Updated by Niklas Hambuechen about 2 years ago

Hey Venky,

yes, the workaround fixes my Ceph 13 cluster (until the next restart).

Whether it should be marked as resolved, see above in https://tracker.ceph.com/issues/54253#note-2:

I'll leave it to Ceph devs to decide if that PR defaulting the option to false is enough, or whether it'd be worth implementing some more detailed logging/warning as I suggested in the issue description.

If it's worth it, this issue could be a tracking issue for not, if not, feel free to close!

I would find it clever if:

  • the documentation of the new `mds_oft_prefetch_dirfrags` could be extended to warn users that dirfrag prefetching can create excessive memory use (because while the default being switched fixes the issue, it might still be very helpful for others to know of the big impact of that option, should they switch it on)
  • "some form of log or progress report on that might help the debugging admin a lot" could be added to the prefetching code, like I suggested in the issue description

But whether these should really be done or whether this issue is the right place to track this isn't my call to make -- my immediate problem was solved, so from my side it's OK to close as resolved.

Actions #9

Updated by Niklas Hambuechen over 1 year ago

Unfortunately I must report that I'm still hitting this issue even with Ceph 16.2.7 and

[global]
mds_oft_prefetch_dirfrags = false

Today, after moving many files around, and restarting the MDS, all MDS would go OOM at > 10x the configured MDS cache size.

As before, the solution was

rados rm -p cephfs_metadata mds0_openfiles.0

and restarting the MDS.

So I think this problem is not solved yet.

Actions

Also available in: Atom PDF