Bug #48673
High memory usage on standby replay MDS
0%
Description
Hi.
We have recently installed a Ceph cluster and with about 27M objects. The filesystem seems to have 15M files.
The MDS is configured with a 20Gb mds_cache_memory_limit. If we look at the nodes, the memory keeps a bit above the limit on the active node 4 but not extremely so.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2165668 ceph 20 0 27.6g 26.1g 22088 S 12.3 13.9 2081:55 ceph-mds
However, we have problems with the standby replay node 3 with a large memory footprint.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2166195 ceph 20 0 40.7g 38.2g 21000 S 0.7 20.4 86:31.18 ceph-mds
This level has remained constant for days. We have received warnings from the cluster reset a couple of times, even if the memory footprint has not changed.
[WARN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache mdsnode3(mds.0): MDS cache is too large (30GB/20GB); 0 inodes in use by clients, 0 stray files
The nodes also run a couple of OSDs, and we don't want them to be affected now that we soon go for the Xmas holidays, so I thought I open a ticket here and see if we can get any suggestions on preventive measures from now on.
If you want any extra information, please ask.
Best regards
Daniel
Related issues
History
#1 Updated by Patrick Donnelly over 3 years ago
Daniel Persson wrote:
Hi.
We have recently installed a Ceph cluster and with about 27M objects. The filesystem seems to have 15M files.
The MDS is configured with a 20Gb mds_cache_memory_limit. If we look at the nodes, the memory keeps a bit above the limit on the active node 4 but not extremely so.
[...]
However, we have problems with the standby replay node 3 with a large memory footprint.
[...]
This level has remained constant for days. We have received warnings from the cluster reset a couple of times, even if the memory footprint has not changed.
[...]
The nodes also run a couple of OSDs, and we don't want them to be affected now that we soon go for the Xmas holidays, so I thought I open a ticket here and see if we can get any suggestions on preventive measures from now on.
If you want any extra information, please ask.
Please share `ceph versions` and `ceph fs dump`.
I believe we've recently fixed some issues with standby-replay daemons using too much memory. Those fixes would have been backported. Please try upgrading to the latest version of nautilus or octopus to see if that helps.
#2 Updated by Daniel Persson over 3 years ago
Patrick Donnelly wrote:
Please share `ceph versions` and `ceph fs dump`.
I believe we've recently fixed some issues with standby-replay daemons using too much memory. Those fixes would have been backported. Please try upgrading to the latest version of nautilus or octopus to see if that helps.
Hi Patrick.
Thank you for the quick reply. I thought the affected version was what to set to supply the version we have. And I've looked at the last two changelogs for the 15.2.6 and 15.2.7 and did not see anything mentioning memory fixes.
To be clear, the 15 OSDs on 15.2.6 have data. The other 14 are slower hardware that we have connected but don't carry any data at the moment.
Best regards
Daniel
{ "mon": { "ceph version 15.2.5 (2c93eff00150f0cc5f106a559557a58d3d7b6f1f) octopus (stable)": 3 }, "mgr": { "ceph version 15.2.5 (2c93eff00150f0cc5f106a559557a58d3d7b6f1f) octopus (stable)": 3 }, "osd": { "ceph version 15.2.5 (2c93eff00150f0cc5f106a559557a58d3d7b6f1f) octopus (stable)": 14, "ceph version 15.2.6 (cb8c61a60551b72614257d632a574d420064c17a) octopus (stable)": 15 }, "mds": { "ceph version 15.2.6 (cb8c61a60551b72614257d632a574d420064c17a) octopus (stable)": 3 }, "rgw": { "ceph version 15.2.5 (2c93eff00150f0cc5f106a559557a58d3d7b6f1f) octopus (stable)": 3 }, "overall": { "ceph version 15.2.5 (2c93eff00150f0cc5f106a559557a58d3d7b6f1f) octopus (stable)": 23, "ceph version 15.2.6 (cb8c61a60551b72614257d632a574d420064c17a) octopus (stable)": 18 } }
dumped fsmap epoch 20853 e20853 enable_multiple, ever_enabled_multiple: 0,0 compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2} legacy client fscid: 1 Filesystem 'cephfs' (1) fs_name cephfs epoch 20853 flags 32 created 2020-11-02T13:38:07.192474+0100 modified 2020-12-21T15:10:24.295566+0100 tableserver 0 root 0 session_timeout 60 session_autoclose 300 max_file_size 1099511627776 min_compat_client 0 (unknown) last_failure 0 last_failure_osd_epoch 0 compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2} max_mds 1 in 0 up {0=148406} failed damaged stopped 1 data_pools [2] metadata_pool 3 inline_data disabled balancer standby_count_wanted 1 [mds.node4{0:148406} state up:active seq 39179 addr [v2:-----:6820/3504326627,v1:-----:6821/3504326627]] [mds.node3{0:134983} state up:standby-replay seq 260130 addr [v2:-----:6820/2104241553,v1:-----:6821/2104241553]] Standby daemons: [mds.node2{-1:134965} state up:standby seq 2 addr [v2:-----:6820/340836297,v1:-----:6821/340836297]]
- Ip addresses replaced with -----
#3 Updated by Patrick Donnelly over 3 years ago
Thanks for the information. There were a few fixes in v15.2.8 relating to memory consumption for the MDS which may be related to this. Please try upgrading to that version and report back.
#4 Updated by Patrick Donnelly about 3 years ago
- Status changed from New to Need More Info
#5 Updated by Daniel Persson about 3 years ago
Patrick Donnelly wrote:
Thanks for the information. There were a few fixes in v15.2.8 relating to memory consumption for the MDS which may be related to this. Please try upgrading to that version and report back.
Hi Patrick.
We have now updated the cluster and all the clients, and it's now running on v15.2.8.
Rank State Daemon Activity Dentries Inodes 0 active node3 Reqs: 20.4 /s 6.9 M 6.9 M 0-s standby-replay node4 Evts: 0 /s 12.4 M 12.4 M
node3
352481 ceph 20 0 25.3g 24.8g 20508 S 5.0 13.2 657:22.25 ceph-mds
node4
3812398 ceph 20 0 41.8g 41.4g 20988 S 0.7 22.1 23:28.38 ceph-mds
We still see that the standby MDS holds a lot more entries and also more memory than requested.
=== Full health status === [WARN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache mds.node4(mds.0): MDS cache is too large (30GB/20GB); 0 inodes in use by clients, 0 stray files
Please tell us if there is any other information we could provide for your work.
Best regards
Daniel
#6 Updated by Tom Myny about 3 years ago
Hello,
We have noticed the same behavior in ceph v15.2.3 and v15.2.8
Note, this is not the case with all filesystems.
ANK STATE MDS ACTIVITY DNS INOS 0 active web.ceph1.ahytos Reqs: 139 /s 7283k 7273k 0-s standby-replay web.ceph2.hjydph Evts: 123 /s 24.0M 24.0M
#7 Updated by Julian Einwag almost 3 years ago
Hi,
we are experiencing the same behavior, but with ceph 14.2.18. Memory usage of the standby-replay MDS keeps growing and growing. I can easily reproduce this issue by simply running find over the while filesystem.
#8 Updated by Patrick Donnelly almost 3 years ago
Daniel Persson wrote:
Patrick Donnelly wrote:
Thanks for the information. There were a few fixes in v15.2.8 relating to memory consumption for the MDS which may be related to this. Please try upgrading to that version and report back.
Hi Patrick.
We have now updated the cluster and all the clients, and it's now running on v15.2.8.
[...]
node3
[...]node4
[...]We still see that the standby MDS holds a lot more entries and also more memory than requested.
[...]
Please tell us if there is any other information we could provide for your work.
Please try this command to see if that helps improve things:
ceph config set mds mds_cache_trim_threshold 256K
or even
ceph config set mds mds_cache_trim_threshold 512K
#9 Updated by Daniel Persson almost 3 years ago
Hi Patrick.
I've tried to run the cluster with both settings for 24 hours each. It became slightly worse, but that might be because it consigned with some backup routines.
Rank State Daemon Activity Dentries Inodes
0 active node3 Reqs: 15.2 /s 6.8 M 6.8 M
0-s standby-replay node4 Evts: 0 /s 15.2 M 15.2 M
I have not seen the Evicts go over 0 /s, which seems a bit strange. Should be doing that at least a couple per second if there is a lot of activity on the active node.
I've previously tried to follow one guide at Suse for increasing trimming by 10%, but it only seems to affect the active node and not the standby-replay one.
https://www.suse.com/support/kb/doc/?id=000019740
Best regards
Daniel
Patrick Donnelly wrote:
Daniel Persson wrote:
Patrick Donnelly wrote:
Thanks for the information. There were a few fixes in v15.2.8 relating to memory consumption for the MDS which may be related to this. Please try upgrading to that version and report back.
Hi Patrick.
We have now updated the cluster and all the clients, and it's now running on v15.2.8.
[...]
node3
[...]node4
[...]We still see that the standby MDS holds a lot more entries and also more memory than requested.
[...]
Please tell us if there is any other information we could provide for your work.
Please try this command to see if that helps improve things:
ceph config set mds mds_cache_trim_threshold 256K
or even
ceph config set mds mds_cache_trim_threshold 512K
#10 Updated by Howie C over 2 years ago
We are seeing the same issue on pacific 16.2.5 as well. Not a big issue but very annoying.
homes - 3 clients ===== RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 active homes.ceph1m01.iakegt Reqs: 1780 /s 2800k 2800k 359k 72.1k 1 active homes.ceph1m02.khomui Reqs: 0 /s 862k 860k 114k 84.0k 0-s standby-replay homes.ceph1m03.waoiry Evts: 2902 /s 14.0M 14.0M 1582k 0 1-s standby-replay homes.ceph1m01.rwitvl Evts: 0 /s 862k 860k 113k 0 POOL TYPE USED AVAIL cephfs.homes.meta metadata 18.3G 52.8T cephfs.homes.data data 2807G 52.8T MDS version: ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable)
#11 Updated by Patrick Donnelly over 2 years ago
- Related to Bug #50048: mds: standby-replay only trims cache when it reaches the end of the replay log added
#12 Updated by Patrick Donnelly over 2 years ago
- Status changed from Need More Info to In Progress
- Assignee set to Patrick Donnelly
- Target version set to v17.0.0
I've been able to reproduce this. Will try to track down the cause...
#13 Updated by Yongseok Oh over 2 years ago
Patrick Donnelly wrote:
I've been able to reproduce this. Will try to track down the cause...
The same situation happens with standby-replay daemons in our cluster. It seems that dentries are rarely trimmed as dentry's linkage inode is not set to nullptr. Please refer to the line. https://github.com/ceph/ceph/blob/master/src/mds/MDCache.cc#L6688
MDCache::standby_trim_segment() tries to trim inodes and dentries and then moves them to the last position of the LRU list. But, dentry's linkage inode is still valid. CDir::unlink_inode() may not be called between the standby_trim_segment() and trim_lru() calls.
Could you briefly describe when/where dentry's linkage inode is invalidated during replaying journals?
It can be observed that trimming is done successfully when the commit is reverted. https://github.com/ceph/ceph/pull/40963 It however incurs recory failures.
#14 Updated by Mykola Golub about 2 years ago
Patrick, do you have any comments for the last comment from Yongseok Oh? Our customer also observes uncontrolled memory growth for an mds in standby-replay state, and we believe the root cause is what Yongseok Oh described.
#15 Updated by Venky Shankar about 2 years ago
Yongseok/Mykola - Patrick is on PTO - I'll try to make progress on this issue.
Yongseok, you mention https://github.com/ceph/ceph/pull/40963 which skips trimming inodes for standby-replay - afaiu, that's required to avoid failure during journal replay when an inode gets trimmed but has a corresponding journal entry. So, we would run into issues if we let a standby-replay daemon trim inode from its cache. However, the unbounded memory usage is not favorable either.
I'll try to see if we could have a alternate solution for the same.
#16 Updated by Patrick Donnelly over 1 year ago
- Target version deleted (
v17.0.0)
#17 Updated by Venky Shankar over 1 year ago
- Priority changed from Normal to High
- Target version set to v18.0.0
- Backport set to pacific,quincy
- Severity changed from 3 - minor to 2 - major
We seem to be running into this pretty frequently and easily with standby-replay configuration.
#18 Updated by Patrick Donnelly over 1 year ago
- Status changed from In Progress to Fix Under Review
- Backport changed from pacific,quincy to quincy,pacific
- Pull request ID set to 48483
#19 Updated by Patrick Donnelly over 1 year ago
- Related to Bug #40213: mds: cannot switch mds state from standby-replay to active added
#20 Updated by Patrick Donnelly over 1 year ago
- Related to Bug #50246: mds: failure replaying journal (EMetaBlob) added
#21 Updated by Joshua Hoblitt 9 months ago
I believe that I have observed this issue while trying to reproduce a different mds problem. It manifests by the standby mds cache continuously growing well beyond the configured limit. Commanding the mds to drop the cache does nothing. However, it appears that briefly disabling allow_standby_replay does flush the caches. E.g.:
~ $ ceph fs status auxtel auxtel - 1 clients ====== RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 active auxtel-c Reqs: 0 /s 360k 355k 17.6k 858 1 active auxtel-b Reqs: 0 /s 1494k 1494k 5159 100 2 active auxtel-d Reqs: 0 /s 1051k 1050k 2665 295 0-s standby-replay auxtel-e Evts: 0 /s 139k 129k 15.5k 0 1-s standby-replay auxtel-a Evts: 0 /s 3463k 3462k 5155 0 2-s standby-replay auxtel-f Evts: 0 /s 850k 836k 981 0 POOL TYPE USED AVAIL auxtel-metadata metadata 10.5G 5927G auxtel-data0 data 1968G 5927G MDS version: ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable) ceph> fs set auxtel allow_standby_replay false ceph> fs status auxtel auxtel - 1 clients ====== RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 active auxtel-c Reqs: 0 /s 360k 355k 17.6k 858 1 active auxtel-b Reqs: 0 /s 1494k 1494k 5159 1336 2 active auxtel-d Reqs: 0 /s 1051k 1050k 2665 295 POOL TYPE USED AVAIL auxtel-metadata metadata 10.4G 5925G auxtel-data0 data 1968G 5925G STANDBY MDS auxtel-e auxtel-f auxtel-a MDS version: ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable) ceph> fs set auxtel allow_standby_replay true ceph> fs status auxtel auxtel - 1 clients ====== RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 active auxtel-c Reqs: 0 /s 360k 355k 17.6k 858 1 active auxtel-b Reqs: 0 /s 1494k 1494k 5159 1336 2 active auxtel-d Reqs: 0 /s 1051k 1050k 2665 295 0-s standby-replay auxtel-e Evts: 0 /s 0 0 0 0 1-s standby-replay auxtel-f Evts: 0 /s 0 0 0 0 2-s standby-replay auxtel-a Evts: 0 /s 0 0 0 0 POOL TYPE USED AVAIL auxtel-metadata metadata 10.4G 5925G auxtel-data0 data 1968G 5925G MDS version: ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable) ~ $ k top pods -l app.kubernetes.io/part-of=auxtel NAME CPU(cores) MEMORY(bytes) rook-ceph-mds-auxtel-a-dfdfb685f-sdpmd 16m 26Mi rook-ceph-mds-auxtel-b-7c6875b594-xfmhq 18m 7948Mi rook-ceph-mds-auxtel-c-5799f48f45-c25ml 19m 3075Mi rook-ceph-mds-auxtel-d-864f8987cb-77z5f 20m 7246Mi rook-ceph-mds-auxtel-e-6989dd8b7f-gh8g7 11m 11Mi rook-ceph-mds-auxtel-f-76cd5f5886-68psl 11m 11Mi rook-ceph-nfs-auxtel-a-cfcd4cb65-t7pmc 2m 217Mi
#22 Updated by Konstantin Shalygin 8 months ago
- Target version changed from v18.0.0 to v19.0.0
- Backport changed from quincy,pacific to pacific quincy reef
#23 Updated by Joshua Hoblitt 8 months ago
This issue triggered again this morning for the first time in 2 weeks. What's note worthy is that the active mds seem to be leaking memory as well. Note the size of mds -d, which is active:
~ $ ceph health detail HEALTH_WARN 1 MDSs report oversized cache; 2 MDSs report slow requests [WRN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache mds.auxtel-f(mds.1): MDS cache is too large (7GB/4GB); 0 inodes in use by clients, 0 stray files [WRN] MDS_SLOW_REQUEST: 2 MDSs report slow requests mds.auxtel-c(mds.0): 3 slow requests are blocked > 30 secs mds.auxtel-d(mds.2): 15482432 slow requests are blocked > 30 secs ~ $ k top pods -l app.kubernetes.io/part-of=auxtel NAME CPU(cores) MEMORY(bytes) rook-ceph-mds-auxtel-a-7757c969bc-d48nn 17m 10027Mi rook-ceph-mds-auxtel-b-cc44b44b9-pn2lp 13m 1349Mi rook-ceph-mds-auxtel-c-84f59bc477-b4zxg 21m 1352Mi rook-ceph-mds-auxtel-d-556fbdffdd-lkmfw 1002m 49894Mi rook-ceph-mds-auxtel-e-5bcfb5cbd-pzvh9 15m 228Mi rook-ceph-mds-auxtel-f-67444d9d4b-7bwqk 19m 11612Mi rook-ceph-nfs-auxtel-a-bcf8f7f67-p6cc9 1m 742Mi ceph> fs status auxtel auxtel - 1 clients ====== RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 active auxtel-c Reqs: 0 /s 334k 334k 1740 268 1 active auxtel-a Reqs: 0 /s 1174k 1174k 5594 50 2 active auxtel-d Reqs: 0 /s 674 591 165 426 1-s standby-replay auxtel-f Evts: 0 /s 3114k 3114k 7285 0 2-s standby-replay auxtel-e Evts: 0 /s 2515 439 149 0 0-s standby-replay auxtel-b Evts: 0 /s 337k 334k 813 0 POOL TYPE USED AVAIL auxtel-metadata metadata 23.2G 4863G auxtel-data0 data 1958G 4863G MDS version: ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)
#24 Updated by Joshua Hoblitt 8 months ago
I've confirmed that `fs set auxtel allow_standby_replay false` does free the memory leak in the standby mds but doesn't fix the issue with the active mds... so it seems probable that I'm seeing two different mds memory leak issues at the same time.
#25 Updated by Venky Shankar 5 months ago
- Backport changed from pacific quincy reef to quincy,reef
#26 Updated by Venky Shankar 4 months ago
- Status changed from Fix Under Review to Pending Backport
#27 Updated by Backport Bot 4 months ago
- Copied to Backport #63675: quincy: High memory usage on standby replay MDS added
#28 Updated by Backport Bot 4 months ago
- Copied to Backport #63676: reef: High memory usage on standby replay MDS added
#29 Updated by Backport Bot 4 months ago
- Tags set to backport_processed