Project

General

Profile

Bug #48673

High memory usage on standby replay MDS

Added by Daniel Persson about 3 years ago. Updated 3 months ago.

Status:
Pending Backport
Priority:
High
Category:
Performance/Resource Usage
Target version:
% Done:

0%

Source:
Community (user)
Tags:
backport_processed
Backport:
quincy,reef
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi.

We have recently installed a Ceph cluster and with about 27M objects. The filesystem seems to have 15M files.

The MDS is configured with a 20Gb mds_cache_memory_limit. If we look at the nodes, the memory keeps a bit above the limit on the active node 4 but not extremely so.

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND  
2165668 ceph      20   0   27.6g  26.1g  22088 S  12.3  13.9   2081:55 ceph-mds

However, we have problems with the standby replay node 3 with a large memory footprint.

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND  
2166195 ceph      20   0   40.7g  38.2g  21000 S   0.7  20.4  86:31.18 ceph-mds 

This level has remained constant for days. We have received warnings from the cluster reset a couple of times, even if the memory footprint has not changed.

[WARN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache
        mdsnode3(mds.0): MDS cache is too large (30GB/20GB); 0 inodes in use by clients, 0 stray files

The nodes also run a couple of OSDs, and we don't want them to be affected now that we soon go for the Xmas holidays, so I thought I open a ticket here and see if we can get any suggestions on preventive measures from now on.

If you want any extra information, please ask.

Best regards
Daniel


Related issues

Related to CephFS - Bug #50048: mds: standby-replay only trims cache when it reaches the end of the replay log Resolved
Related to CephFS - Bug #40213: mds: cannot switch mds state from standby-replay to active Resolved
Related to CephFS - Bug #50246: mds: failure replaying journal (EMetaBlob) Resolved
Copied to CephFS - Backport #63675: quincy: High memory usage on standby replay MDS In Progress
Copied to CephFS - Backport #63676: reef: High memory usage on standby replay MDS In Progress

History

#1 Updated by Patrick Donnelly about 3 years ago

Daniel Persson wrote:

Hi.

We have recently installed a Ceph cluster and with about 27M objects. The filesystem seems to have 15M files.

The MDS is configured with a 20Gb mds_cache_memory_limit. If we look at the nodes, the memory keeps a bit above the limit on the active node 4 but not extremely so.

[...]

However, we have problems with the standby replay node 3 with a large memory footprint.

[...]

This level has remained constant for days. We have received warnings from the cluster reset a couple of times, even if the memory footprint has not changed.

[...]

The nodes also run a couple of OSDs, and we don't want them to be affected now that we soon go for the Xmas holidays, so I thought I open a ticket here and see if we can get any suggestions on preventive measures from now on.

If you want any extra information, please ask.

Please share `ceph versions` and `ceph fs dump`.

I believe we've recently fixed some issues with standby-replay daemons using too much memory. Those fixes would have been backported. Please try upgrading to the latest version of nautilus or octopus to see if that helps.

#2 Updated by Daniel Persson about 3 years ago

Patrick Donnelly wrote:

Please share `ceph versions` and `ceph fs dump`.

I believe we've recently fixed some issues with standby-replay daemons using too much memory. Those fixes would have been backported. Please try upgrading to the latest version of nautilus or octopus to see if that helps.

Hi Patrick.

Thank you for the quick reply. I thought the affected version was what to set to supply the version we have. And I've looked at the last two changelogs for the 15.2.6 and 15.2.7 and did not see anything mentioning memory fixes.

To be clear, the 15 OSDs on 15.2.6 have data. The other 14 are slower hardware that we have connected but don't carry any data at the moment.

Best regards
Daniel

{
    "mon": {
        "ceph version 15.2.5 (2c93eff00150f0cc5f106a559557a58d3d7b6f1f) octopus (stable)": 3
    },
    "mgr": {
        "ceph version 15.2.5 (2c93eff00150f0cc5f106a559557a58d3d7b6f1f) octopus (stable)": 3
    },
    "osd": {
        "ceph version 15.2.5 (2c93eff00150f0cc5f106a559557a58d3d7b6f1f) octopus (stable)": 14,
        "ceph version 15.2.6 (cb8c61a60551b72614257d632a574d420064c17a) octopus (stable)": 15
    },
    "mds": {
        "ceph version 15.2.6 (cb8c61a60551b72614257d632a574d420064c17a) octopus (stable)": 3
    },
    "rgw": {
        "ceph version 15.2.5 (2c93eff00150f0cc5f106a559557a58d3d7b6f1f) octopus (stable)": 3
    },
    "overall": {
        "ceph version 15.2.5 (2c93eff00150f0cc5f106a559557a58d3d7b6f1f) octopus (stable)": 23,
        "ceph version 15.2.6 (cb8c61a60551b72614257d632a574d420064c17a) octopus (stable)": 18
    }
}
dumped fsmap epoch 20853
e20853
enable_multiple, ever_enabled_multiple: 0,0
compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
legacy client fscid: 1

Filesystem 'cephfs' (1)
fs_name cephfs
epoch   20853
flags   32
created 2020-11-02T13:38:07.192474+0100
modified        2020-12-21T15:10:24.295566+0100
tableserver     0
root    0
session_timeout 60
session_autoclose       300
max_file_size   1099511627776
min_compat_client       0 (unknown)
last_failure    0
last_failure_osd_epoch  0
compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
max_mds 1
in      0
up      {0=148406}
failed
damaged
stopped 1
data_pools      [2]
metadata_pool   3
inline_data     disabled
balancer
standby_count_wanted    1
[mds.node4{0:148406} state up:active seq 39179 addr [v2:-----:6820/3504326627,v1:-----:6821/3504326627]]
[mds.node3{0:134983} state up:standby-replay seq 260130 addr [v2:-----:6820/2104241553,v1:-----:6821/2104241553]]

Standby daemons:

[mds.node2{-1:134965} state up:standby seq 2 addr [v2:-----:6820/340836297,v1:-----:6821/340836297]]
  • Ip addresses replaced with -----

#3 Updated by Patrick Donnelly about 3 years ago

Thanks for the information. There were a few fixes in v15.2.8 relating to memory consumption for the MDS which may be related to this. Please try upgrading to that version and report back.

#4 Updated by Patrick Donnelly about 3 years ago

  • Status changed from New to Need More Info

#5 Updated by Daniel Persson about 3 years ago

Patrick Donnelly wrote:

Thanks for the information. There were a few fixes in v15.2.8 relating to memory consumption for the MDS which may be related to this. Please try upgrading to that version and report back.

Hi Patrick.

We have now updated the cluster and all the clients, and it's now running on v15.2.8.

Rank State           Daemon   Activity        Dentries  Inodes
0    active          node3    Reqs: 20.4 /s    6.9 M     6.9 M   
0-s  standby-replay  node4    Evts: 0 /s      12.4 M    12.4 M

node3

352481 ceph      20   0   25.3g  24.8g  20508 S   5.0  13.2 657:22.25 ceph-mds     

node4

3812398 ceph      20   0   41.8g  41.4g  20988 S   0.7  22.1  23:28.38 ceph-mds   

We still see that the standby MDS holds a lot more entries and also more memory than requested.

=== Full health status ===
[WARN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache
        mds.node4(mds.0): MDS cache is too large (30GB/20GB); 0 inodes in use by clients, 0 stray files

Please tell us if there is any other information we could provide for your work.

Best regards
Daniel

#6 Updated by Tom Myny about 3 years ago

Hello,

We have noticed the same behavior in ceph v15.2.3 and v15.2.8

Note, this is not the case with all filesystems.

ANK      STATE             MDS            ACTIVITY     DNS    INOS
 0        active      web.ceph1.ahytos  Reqs:  139 /s  7283k  7273k
0-s   standby-replay  web.ceph2.hjydph  Evts:  123 /s  24.0M  24.0M

#7 Updated by Julian Einwag almost 3 years ago

Hi,
we are experiencing the same behavior, but with ceph 14.2.18. Memory usage of the standby-replay MDS keeps growing and growing. I can easily reproduce this issue by simply running find over the while filesystem.

#8 Updated by Patrick Donnelly almost 3 years ago

Daniel Persson wrote:

Patrick Donnelly wrote:

Thanks for the information. There were a few fixes in v15.2.8 relating to memory consumption for the MDS which may be related to this. Please try upgrading to that version and report back.

Hi Patrick.

We have now updated the cluster and all the clients, and it's now running on v15.2.8.

[...]

node3
[...]

node4
[...]

We still see that the standby MDS holds a lot more entries and also more memory than requested.

[...]

Please tell us if there is any other information we could provide for your work.

Please try this command to see if that helps improve things:

ceph config set mds mds_cache_trim_threshold 256K

or even

ceph config set mds mds_cache_trim_threshold 512K

#9 Updated by Daniel Persson almost 3 years ago

Hi Patrick.

I've tried to run the cluster with both settings for 24 hours each. It became slightly worse, but that might be because it consigned with some backup routines.

Rank       State            Daemon       Activity     Dentries       Inodes
0          active           node3     Reqs: 15.2 /s      6.8 M       6.8 M
0-s        standby-replay   node4     Evts: 0 /s        15.2 M      15.2 M

I have not seen the Evicts go over 0 /s, which seems a bit strange. Should be doing that at least a couple per second if there is a lot of activity on the active node.

I've previously tried to follow one guide at Suse for increasing trimming by 10%, but it only seems to affect the active node and not the standby-replay one.

https://www.suse.com/support/kb/doc/?id=000019740

Best regards
Daniel

Patrick Donnelly wrote:

Daniel Persson wrote:

Patrick Donnelly wrote:

Thanks for the information. There were a few fixes in v15.2.8 relating to memory consumption for the MDS which may be related to this. Please try upgrading to that version and report back.

Hi Patrick.

We have now updated the cluster and all the clients, and it's now running on v15.2.8.

[...]

node3
[...]

node4
[...]

We still see that the standby MDS holds a lot more entries and also more memory than requested.

[...]

Please tell us if there is any other information we could provide for your work.

Please try this command to see if that helps improve things:

ceph config set mds mds_cache_trim_threshold 256K

or even

ceph config set mds mds_cache_trim_threshold 512K

#10 Updated by Howie C over 2 years ago

We are seeing the same issue on pacific 16.2.5 as well. Not a big issue but very annoying.

homes - 3 clients
=====
RANK      STATE             MDS                  ACTIVITY     DNS    INOS   DIRS   CAPS
 0        active      homes.ceph1m01.iakegt  Reqs: 1780 /s  2800k  2800k   359k  72.1k
 1        active      homes.ceph1m02.khomui  Reqs:    0 /s   862k   860k   114k  84.0k
0-s   standby-replay  homes.ceph1m03.waoiry  Evts: 2902 /s  14.0M  14.0M  1582k     0
1-s   standby-replay  homes.ceph1m01.rwitvl  Evts:    0 /s   862k   860k   113k     0
       POOL          TYPE     USED  AVAIL
cephfs.homes.meta  metadata  18.3G  52.8T
cephfs.homes.data    data    2807G  52.8T
MDS version: ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable)

#11 Updated by Patrick Donnelly over 2 years ago

  • Related to Bug #50048: mds: standby-replay only trims cache when it reaches the end of the replay log added

#12 Updated by Patrick Donnelly over 2 years ago

  • Status changed from Need More Info to In Progress
  • Assignee set to Patrick Donnelly
  • Target version set to v17.0.0

I've been able to reproduce this. Will try to track down the cause...

#13 Updated by Yongseok Oh over 2 years ago

Patrick Donnelly wrote:

I've been able to reproduce this. Will try to track down the cause...

The same situation happens with standby-replay daemons in our cluster. It seems that dentries are rarely trimmed as dentry's linkage inode is not set to nullptr. Please refer to the line. https://github.com/ceph/ceph/blob/master/src/mds/MDCache.cc#L6688

MDCache::standby_trim_segment() tries to trim inodes and dentries and then moves them to the last position of the LRU list. But, dentry's linkage inode is still valid. CDir::unlink_inode() may not be called between the standby_trim_segment() and trim_lru() calls.
Could you briefly describe when/where dentry's linkage inode is invalidated during replaying journals?

It can be observed that trimming is done successfully when the commit is reverted. https://github.com/ceph/ceph/pull/40963 It however incurs recory failures.

#14 Updated by Mykola Golub about 2 years ago

Patrick, do you have any comments for the last comment from Yongseok Oh? Our customer also observes uncontrolled memory growth for an mds in standby-replay state, and we believe the root cause is what Yongseok Oh described.

#15 Updated by Venky Shankar about 2 years ago

Yongseok/Mykola - Patrick is on PTO - I'll try to make progress on this issue.

Yongseok, you mention https://github.com/ceph/ceph/pull/40963 which skips trimming inodes for standby-replay - afaiu, that's required to avoid failure during journal replay when an inode gets trimmed but has a corresponding journal entry. So, we would run into issues if we let a standby-replay daemon trim inode from its cache. However, the unbounded memory usage is not favorable either.

I'll try to see if we could have a alternate solution for the same.

#16 Updated by Patrick Donnelly over 1 year ago

  • Target version deleted (v17.0.0)

#17 Updated by Venky Shankar over 1 year ago

  • Priority changed from Normal to High
  • Target version set to v18.0.0
  • Backport set to pacific,quincy
  • Severity changed from 3 - minor to 2 - major

We seem to be running into this pretty frequently and easily with standby-replay configuration.

#18 Updated by Patrick Donnelly over 1 year ago

  • Status changed from In Progress to Fix Under Review
  • Backport changed from pacific,quincy to quincy,pacific
  • Pull request ID set to 48483

#19 Updated by Patrick Donnelly over 1 year ago

  • Related to Bug #40213: mds: cannot switch mds state from standby-replay to active added

#20 Updated by Patrick Donnelly over 1 year ago

  • Related to Bug #50246: mds: failure replaying journal (EMetaBlob) added

#21 Updated by Joshua Hoblitt 8 months ago

I believe that I have observed this issue while trying to reproduce a different mds problem. It manifests by the standby mds cache continuously growing well beyond the configured limit. Commanding the mds to drop the cache does nothing. However, it appears that briefly disabling allow_standby_replay does flush the caches. E.g.:

 ~ $ ceph fs status auxtel
auxtel - 1 clients
======
RANK      STATE         MDS        ACTIVITY     DNS    INOS   DIRS   CAPS  
 0        active      auxtel-c  Reqs:    0 /s   360k   355k  17.6k   858   
 1        active      auxtel-b  Reqs:    0 /s  1494k  1494k  5159    100   
 2        active      auxtel-d  Reqs:    0 /s  1051k  1050k  2665    295   
0-s   standby-replay  auxtel-e  Evts:    0 /s   139k   129k  15.5k     0   
1-s   standby-replay  auxtel-a  Evts:    0 /s  3463k  3462k  5155      0   
2-s   standby-replay  auxtel-f  Evts:    0 /s   850k   836k   981      0   
      POOL         TYPE     USED  AVAIL  
auxtel-metadata  metadata  10.5G  5927G  
  auxtel-data0     data    1968G  5927G  
MDS version: ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)

ceph> fs set auxtel allow_standby_replay false
ceph> fs status auxtel
auxtel - 1 clients
======
RANK  STATE     MDS        ACTIVITY     DNS    INOS   DIRS   CAPS  
 0    active  auxtel-c  Reqs:    0 /s   360k   355k  17.6k   858   
 1    active  auxtel-b  Reqs:    0 /s  1494k  1494k  5159   1336   
 2    active  auxtel-d  Reqs:    0 /s  1051k  1050k  2665    295   
      POOL         TYPE     USED  AVAIL  
auxtel-metadata  metadata  10.4G  5925G  
  auxtel-data0     data    1968G  5925G  
STANDBY MDS  
  auxtel-e   
  auxtel-f   
  auxtel-a   
MDS version: ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)

ceph> fs set auxtel allow_standby_replay true 
ceph> fs status auxtel
auxtel - 1 clients
======
RANK      STATE         MDS        ACTIVITY     DNS    INOS   DIRS   CAPS  
 0        active      auxtel-c  Reqs:    0 /s   360k   355k  17.6k   858   
 1        active      auxtel-b  Reqs:    0 /s  1494k  1494k  5159   1336   
 2        active      auxtel-d  Reqs:    0 /s  1051k  1050k  2665    295   
0-s   standby-replay  auxtel-e  Evts:    0 /s     0      0      0      0   
1-s   standby-replay  auxtel-f  Evts:    0 /s     0      0      0      0   
2-s   standby-replay  auxtel-a  Evts:    0 /s     0      0      0      0   
      POOL         TYPE     USED  AVAIL  
auxtel-metadata  metadata  10.4G  5925G  
  auxtel-data0     data    1968G  5925G  
MDS version: ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)

 ~ $ k top pods -l app.kubernetes.io/part-of=auxtel
NAME                                      CPU(cores)   MEMORY(bytes)   
rook-ceph-mds-auxtel-a-dfdfb685f-sdpmd    16m          26Mi            
rook-ceph-mds-auxtel-b-7c6875b594-xfmhq   18m          7948Mi          
rook-ceph-mds-auxtel-c-5799f48f45-c25ml   19m          3075Mi          
rook-ceph-mds-auxtel-d-864f8987cb-77z5f   20m          7246Mi          
rook-ceph-mds-auxtel-e-6989dd8b7f-gh8g7   11m          11Mi            
rook-ceph-mds-auxtel-f-76cd5f5886-68psl   11m          11Mi            
rook-ceph-nfs-auxtel-a-cfcd4cb65-t7pmc    2m           217Mi        

#22 Updated by Konstantin Shalygin 7 months ago

  • Target version changed from v18.0.0 to v19.0.0
  • Backport changed from quincy,pacific to pacific quincy reef

#23 Updated by Joshua Hoblitt 7 months ago

This issue triggered again this morning for the first time in 2 weeks. What's note worthy is that the active mds seem to be leaking memory as well. Note the size of mds -d, which is active:

 ~ $ ceph health detail
HEALTH_WARN 1 MDSs report oversized cache; 2 MDSs report slow requests
[WRN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache
    mds.auxtel-f(mds.1): MDS cache is too large (7GB/4GB); 0 inodes in use by clients, 0 stray files
[WRN] MDS_SLOW_REQUEST: 2 MDSs report slow requests
    mds.auxtel-c(mds.0): 3 slow requests are blocked > 30 secs
    mds.auxtel-d(mds.2): 15482432 slow requests are blocked > 30 secs

 ~ $ k top pods -l app.kubernetes.io/part-of=auxtel
NAME                                      CPU(cores)   MEMORY(bytes)   
rook-ceph-mds-auxtel-a-7757c969bc-d48nn   17m          10027Mi         
rook-ceph-mds-auxtel-b-cc44b44b9-pn2lp    13m          1349Mi          
rook-ceph-mds-auxtel-c-84f59bc477-b4zxg   21m          1352Mi          
rook-ceph-mds-auxtel-d-556fbdffdd-lkmfw   1002m        49894Mi         
rook-ceph-mds-auxtel-e-5bcfb5cbd-pzvh9    15m          228Mi           
rook-ceph-mds-auxtel-f-67444d9d4b-7bwqk   19m          11612Mi         
rook-ceph-nfs-auxtel-a-bcf8f7f67-p6cc9    1m           742Mi      

ceph> fs status auxtel
auxtel - 1 clients
======
RANK      STATE         MDS        ACTIVITY     DNS    INOS   DIRS   CAPS  
 0        active      auxtel-c  Reqs:    0 /s   334k   334k  1740    268   
 1        active      auxtel-a  Reqs:    0 /s  1174k  1174k  5594     50   
 2        active      auxtel-d  Reqs:    0 /s   674    591    165    426   
1-s   standby-replay  auxtel-f  Evts:    0 /s  3114k  3114k  7285      0   
2-s   standby-replay  auxtel-e  Evts:    0 /s  2515    439    149      0   
0-s   standby-replay  auxtel-b  Evts:    0 /s   337k   334k   813      0   
      POOL         TYPE     USED  AVAIL  
auxtel-metadata  metadata  23.2G  4863G  
  auxtel-data0     data    1958G  4863G  
MDS version: ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)

#24 Updated by Joshua Hoblitt 7 months ago

I've confirmed that `fs set auxtel allow_standby_replay false` does free the memory leak in the standby mds but doesn't fix the issue with the active mds... so it seems probable that I'm seeing two different mds memory leak issues at the same time.

#25 Updated by Venky Shankar 4 months ago

  • Backport changed from pacific quincy reef to quincy,reef

#26 Updated by Venky Shankar 3 months ago

  • Status changed from Fix Under Review to Pending Backport

#27 Updated by Backport Bot 3 months ago

  • Copied to Backport #63675: quincy: High memory usage on standby replay MDS added

#28 Updated by Backport Bot 3 months ago

  • Copied to Backport #63676: reef: High memory usage on standby replay MDS added

#29 Updated by Backport Bot 3 months ago

  • Tags set to backport_processed

Also available in: Atom PDF