Project

General

Profile

Bug #48673

High memory usage on standby replay MDS

Added by Daniel Persson about 1 year ago. Updated 15 days ago.

Status:
In Progress
Priority:
Normal
Category:
Performance/Resource Usage
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi.

We have recently installed a Ceph cluster and with about 27M objects. The filesystem seems to have 15M files.

The MDS is configured with a 20Gb mds_cache_memory_limit. If we look at the nodes, the memory keeps a bit above the limit on the active node 4 but not extremely so.

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND  
2165668 ceph      20   0   27.6g  26.1g  22088 S  12.3  13.9   2081:55 ceph-mds

However, we have problems with the standby replay node 3 with a large memory footprint.

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND  
2166195 ceph      20   0   40.7g  38.2g  21000 S   0.7  20.4  86:31.18 ceph-mds 

This level has remained constant for days. We have received warnings from the cluster reset a couple of times, even if the memory footprint has not changed.

[WARN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache
        mdsnode3(mds.0): MDS cache is too large (30GB/20GB); 0 inodes in use by clients, 0 stray files

The nodes also run a couple of OSDs, and we don't want them to be affected now that we soon go for the Xmas holidays, so I thought I open a ticket here and see if we can get any suggestions on preventive measures from now on.

If you want any extra information, please ask.

Best regards
Daniel


Related issues

Related to CephFS - Bug #50048: mds: standby-replay only trims cache when it reaches the end of the replay log Resolved

History

#1 Updated by Patrick Donnelly about 1 year ago

Daniel Persson wrote:

Hi.

We have recently installed a Ceph cluster and with about 27M objects. The filesystem seems to have 15M files.

The MDS is configured with a 20Gb mds_cache_memory_limit. If we look at the nodes, the memory keeps a bit above the limit on the active node 4 but not extremely so.

[...]

However, we have problems with the standby replay node 3 with a large memory footprint.

[...]

This level has remained constant for days. We have received warnings from the cluster reset a couple of times, even if the memory footprint has not changed.

[...]

The nodes also run a couple of OSDs, and we don't want them to be affected now that we soon go for the Xmas holidays, so I thought I open a ticket here and see if we can get any suggestions on preventive measures from now on.

If you want any extra information, please ask.

Please share `ceph versions` and `ceph fs dump`.

I believe we've recently fixed some issues with standby-replay daemons using too much memory. Those fixes would have been backported. Please try upgrading to the latest version of nautilus or octopus to see if that helps.

#2 Updated by Daniel Persson about 1 year ago

Patrick Donnelly wrote:

Please share `ceph versions` and `ceph fs dump`.

I believe we've recently fixed some issues with standby-replay daemons using too much memory. Those fixes would have been backported. Please try upgrading to the latest version of nautilus or octopus to see if that helps.

Hi Patrick.

Thank you for the quick reply. I thought the affected version was what to set to supply the version we have. And I've looked at the last two changelogs for the 15.2.6 and 15.2.7 and did not see anything mentioning memory fixes.

To be clear, the 15 OSDs on 15.2.6 have data. The other 14 are slower hardware that we have connected but don't carry any data at the moment.

Best regards
Daniel

{
    "mon": {
        "ceph version 15.2.5 (2c93eff00150f0cc5f106a559557a58d3d7b6f1f) octopus (stable)": 3
    },
    "mgr": {
        "ceph version 15.2.5 (2c93eff00150f0cc5f106a559557a58d3d7b6f1f) octopus (stable)": 3
    },
    "osd": {
        "ceph version 15.2.5 (2c93eff00150f0cc5f106a559557a58d3d7b6f1f) octopus (stable)": 14,
        "ceph version 15.2.6 (cb8c61a60551b72614257d632a574d420064c17a) octopus (stable)": 15
    },
    "mds": {
        "ceph version 15.2.6 (cb8c61a60551b72614257d632a574d420064c17a) octopus (stable)": 3
    },
    "rgw": {
        "ceph version 15.2.5 (2c93eff00150f0cc5f106a559557a58d3d7b6f1f) octopus (stable)": 3
    },
    "overall": {
        "ceph version 15.2.5 (2c93eff00150f0cc5f106a559557a58d3d7b6f1f) octopus (stable)": 23,
        "ceph version 15.2.6 (cb8c61a60551b72614257d632a574d420064c17a) octopus (stable)": 18
    }
}
dumped fsmap epoch 20853
e20853
enable_multiple, ever_enabled_multiple: 0,0
compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
legacy client fscid: 1

Filesystem 'cephfs' (1)
fs_name cephfs
epoch   20853
flags   32
created 2020-11-02T13:38:07.192474+0100
modified        2020-12-21T15:10:24.295566+0100
tableserver     0
root    0
session_timeout 60
session_autoclose       300
max_file_size   1099511627776
min_compat_client       0 (unknown)
last_failure    0
last_failure_osd_epoch  0
compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
max_mds 1
in      0
up      {0=148406}
failed
damaged
stopped 1
data_pools      [2]
metadata_pool   3
inline_data     disabled
balancer
standby_count_wanted    1
[mds.node4{0:148406} state up:active seq 39179 addr [v2:-----:6820/3504326627,v1:-----:6821/3504326627]]
[mds.node3{0:134983} state up:standby-replay seq 260130 addr [v2:-----:6820/2104241553,v1:-----:6821/2104241553]]

Standby daemons:

[mds.node2{-1:134965} state up:standby seq 2 addr [v2:-----:6820/340836297,v1:-----:6821/340836297]]
  • Ip addresses replaced with -----

#3 Updated by Patrick Donnelly about 1 year ago

Thanks for the information. There were a few fixes in v15.2.8 relating to memory consumption for the MDS which may be related to this. Please try upgrading to that version and report back.

#4 Updated by Patrick Donnelly about 1 year ago

  • Status changed from New to Need More Info

#5 Updated by Daniel Persson about 1 year ago

Patrick Donnelly wrote:

Thanks for the information. There were a few fixes in v15.2.8 relating to memory consumption for the MDS which may be related to this. Please try upgrading to that version and report back.

Hi Patrick.

We have now updated the cluster and all the clients, and it's now running on v15.2.8.

Rank State           Daemon   Activity        Dentries  Inodes
0    active          node3    Reqs: 20.4 /s    6.9 M     6.9 M   
0-s  standby-replay  node4    Evts: 0 /s      12.4 M    12.4 M

node3

352481 ceph      20   0   25.3g  24.8g  20508 S   5.0  13.2 657:22.25 ceph-mds     

node4

3812398 ceph      20   0   41.8g  41.4g  20988 S   0.7  22.1  23:28.38 ceph-mds   

We still see that the standby MDS holds a lot more entries and also more memory than requested.

=== Full health status ===
[WARN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache
        mds.node4(mds.0): MDS cache is too large (30GB/20GB); 0 inodes in use by clients, 0 stray files

Please tell us if there is any other information we could provide for your work.

Best regards
Daniel

#6 Updated by Tom Myny 12 months ago

Hello,

We have noticed the same behavior in ceph v15.2.3 and v15.2.8

Note, this is not the case with all filesystems.

ANK      STATE             MDS            ACTIVITY     DNS    INOS
 0        active      web.ceph1.ahytos  Reqs:  139 /s  7283k  7273k
0-s   standby-replay  web.ceph2.hjydph  Evts:  123 /s  24.0M  24.0M

#7 Updated by Julian Einwag 9 months ago

Hi,
we are experiencing the same behavior, but with ceph 14.2.18. Memory usage of the standby-replay MDS keeps growing and growing. I can easily reproduce this issue by simply running find over the while filesystem.

#8 Updated by Patrick Donnelly 9 months ago

Daniel Persson wrote:

Patrick Donnelly wrote:

Thanks for the information. There were a few fixes in v15.2.8 relating to memory consumption for the MDS which may be related to this. Please try upgrading to that version and report back.

Hi Patrick.

We have now updated the cluster and all the clients, and it's now running on v15.2.8.

[...]

node3
[...]

node4
[...]

We still see that the standby MDS holds a lot more entries and also more memory than requested.

[...]

Please tell us if there is any other information we could provide for your work.

Please try this command to see if that helps improve things:

ceph config set mds mds_cache_trim_threshold 256K

or even

ceph config set mds mds_cache_trim_threshold 512K

#9 Updated by Daniel Persson 9 months ago

Hi Patrick.

I've tried to run the cluster with both settings for 24 hours each. It became slightly worse, but that might be because it consigned with some backup routines.

Rank       State            Daemon       Activity     Dentries       Inodes
0          active           node3     Reqs: 15.2 /s      6.8 M       6.8 M
0-s        standby-replay   node4     Evts: 0 /s        15.2 M      15.2 M

I have not seen the Evicts go over 0 /s, which seems a bit strange. Should be doing that at least a couple per second if there is a lot of activity on the active node.

I've previously tried to follow one guide at Suse for increasing trimming by 10%, but it only seems to affect the active node and not the standby-replay one.

https://www.suse.com/support/kb/doc/?id=000019740

Best regards
Daniel

Patrick Donnelly wrote:

Daniel Persson wrote:

Patrick Donnelly wrote:

Thanks for the information. There were a few fixes in v15.2.8 relating to memory consumption for the MDS which may be related to this. Please try upgrading to that version and report back.

Hi Patrick.

We have now updated the cluster and all the clients, and it's now running on v15.2.8.

[...]

node3
[...]

node4
[...]

We still see that the standby MDS holds a lot more entries and also more memory than requested.

[...]

Please tell us if there is any other information we could provide for your work.

Please try this command to see if that helps improve things:

ceph config set mds mds_cache_trim_threshold 256K

or even

ceph config set mds mds_cache_trim_threshold 512K

#10 Updated by Howie C 6 months ago

We are seeing the same issue on pacific 16.2.5 as well. Not a big issue but very annoying.

homes - 3 clients
=====
RANK      STATE             MDS                  ACTIVITY     DNS    INOS   DIRS   CAPS
 0        active      homes.ceph1m01.iakegt  Reqs: 1780 /s  2800k  2800k   359k  72.1k
 1        active      homes.ceph1m02.khomui  Reqs:    0 /s   862k   860k   114k  84.0k
0-s   standby-replay  homes.ceph1m03.waoiry  Evts: 2902 /s  14.0M  14.0M  1582k     0
1-s   standby-replay  homes.ceph1m01.rwitvl  Evts:    0 /s   862k   860k   113k     0
       POOL          TYPE     USED  AVAIL
cephfs.homes.meta  metadata  18.3G  52.8T
cephfs.homes.data    data    2807G  52.8T
MDS version: ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable)

#11 Updated by Patrick Donnelly 6 months ago

  • Related to Bug #50048: mds: standby-replay only trims cache when it reaches the end of the replay log added

#12 Updated by Patrick Donnelly 6 months ago

  • Status changed from Need More Info to In Progress
  • Assignee set to Patrick Donnelly
  • Target version set to v17.0.0

I've been able to reproduce this. Will try to track down the cause...

#13 Updated by Yongseok Oh 2 months ago

Patrick Donnelly wrote:

I've been able to reproduce this. Will try to track down the cause...

The same situation happens with standby-replay daemons in our cluster. It seems that dentries are rarely trimmed as dentry's linkage inode is not set to nullptr. Please refer to the line. https://github.com/ceph/ceph/blob/master/src/mds/MDCache.cc#L6688

MDCache::standby_trim_segment() tries to trim inodes and dentries and then moves them to the last position of the LRU list. But, dentry's linkage inode is still valid. CDir::unlink_inode() may not be called between the standby_trim_segment() and trim_lru() calls.
Could you briefly describe when/where dentry's linkage inode is invalidated during replaying journals?

It can be observed that trimming is done successfully when the commit is reverted. https://github.com/ceph/ceph/pull/40963 It however incurs recory failures.

#14 Updated by Mykola Golub 15 days ago

Patrick, do you have any comments for the last comment from Yongseok Oh? Our customer also observes uncontrolled memory growth for an mds in standby-replay state, and we believe the root cause is what Yongseok Oh described.

Also available in: Atom PDF