Project

General

Profile

Actions

Bug #64348

open

mds: possible memory leak in up:rejoin when opening cap inodes (from OFT)

Added by Venky Shankar 3 months ago. Updated 3 months ago.

Status:
Triaged
Priority:
High
Assignee:
Category:
Performance/Resource Usage
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
task(medium)
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Seems to happen when there are entries in OFT for which the MDS prefetches inodes. The config mds_oft_prefetch_dirfrags which is disabled by default is concerned only to disable prefetching dirfrags, however, the OFT will still prefetch inodes and there seems to be a memleak somewhere (which isn't getting tested in our qa suite, else we probably would have noticed in valgrind test).

The memleak causes the MDS to get OOM killed (also partly because the cache limits aren't really taken into consideration in this state). This was observed in a couple of user clusters. Unfortunately the logs didn't provide any hints other than the MDS prefetching inodes from the OFT and the MDS rss size hitting the node memory limit.


Related issues 1 (1 open0 closed)

Related to CephFS - Bug #64717: MDS stuck in replay/resolve useNewMilind Changire

Actions
Actions #1

Updated by Venky Shankar 3 months ago

  • Status changed from New to Triaged
  • Assignee set to Leonid Usov
Actions #2

Updated by Venky Shankar 3 months ago

This was discussed in cephfs standup yesterday. The following are the items that (at minimal) should be investigated:

- This issue was seen in pacific clusters. Although pacific in EOL'd, the bug might exists is support releases (quincy and reef at this point of time) and there is merit in investigating it,
- Inspect our qa tests to check if adequate coverage is done with OFT populated with or without valgrind.
- Also check if adequate debug logs are placed in up:rejoin state (the state where this issue exists). This can be tricky since overpopulating the logs degrades everything else.

Actions #3

Updated by Venky Shankar about 2 months ago

  • Related to Bug #64717: MDS stuck in replay/resolve use added
Actions

Also available in: Atom PDF