Bug #64348: mds: possible memory leak in up:rejoin when opening cap inodes (from OFT) - CephFS - Ceph

Actions

Copy link

Bug #64348

open

mds: possible memory leak in up:rejoin when opening cap inodes (from OFT)

Added by Venky Shankar 3 months ago. Updated 3 months ago.

Status:

Triaged

Priority:

High

Assignee:

Leonid Usov

Category:

Performance/Resource Usage

Target version:

Ceph - v19.0.0

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

task(medium)

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Seems to happen when there are entries in OFT for which the MDS prefetches inodes. The config mds_oft_prefetch_dirfrags which is disabled by default is concerned only to disable prefetching dirfrags, however, the OFT will still prefetch inodes and there seems to be a memleak somewhere (which isn't getting tested in our qa suite, else we probably would have noticed in valgrind test).

The memleak causes the MDS to get OOM killed (also partly because the cache limits aren't really taken into consideration in this state). This was observed in a couple of user clusters. Unfortunately the logs didn't provide any hints other than the MDS prefetching inodes from the OFT and the MDS rss size hitting the node memory limit.

Related issues 1 (1 open — 0 closed)

Actions

Copy link

Updated by Venky Shankar 3 months ago

Status changed from New to Triaged
Assignee set to Leonid Usov

Actions

Copy link

Updated by Venky Shankar 3 months ago

This was discussed in cephfs standup yesterday. The following are the items that (at minimal) should be investigated:

- This issue was seen in pacific clusters. Although pacific in EOL'd, the bug might exists is support releases (quincy and reef at this point of time) and there is merit in investigating it,
- Inspect our qa tests to check if adequate coverage is done with OFT populated with or without valgrind.
- Also check if adequate debug logs are placed in up:rejoin state (the state where this issue exists). This can be tricky since overpopulating the logs degrades everything else.

Actions

Copy link

Updated by Venky Shankar about 2 months ago

Related to Bug #64717: MDS stuck in replay/resolve use added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #64348

mds: possible memory leak in up:rejoin when opening cap inodes (from OFT)

Updated by Venky Shankar 3 months ago

Updated by Venky Shankar 3 months ago

Updated by Venky Shankar about 2 months ago