Project

General

Profile

Bug #56988

mds: memory leak suspected

Added by Ramin Torabi 4 months ago. Updated 3 months ago.

Status:
Need More Info
Priority:
Normal
Category:
Performance/Resource Usage
Target version:
% Done:

0%

Source:
Tags:
Backport:
pacific,quincy
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
fs
Component(FS):
MDS
Labels (FS):
multimds
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We are runnung a cephfs pacific cluster in production:

MDS version: ceph version 16.2.9 (4c3647a322c0ff5a1dd2344e039859dcbd28c830) pacific (stable)

The six servers were installed using Rocky Linux 8.6 and configured using ceph-ansible from the "stable-6.0" branch.
The three MDS servers are running in an active/active/standby configuration:

mds_max_mds: 2

The MDS pool is replicated:

cephfs_metadata_pool:
  name: 'cephfs_201_metadata'
  pg_num: 256
  pgp_num: 256
  size: 3  # i.e. 3 fold replication
  type: 'replicated'
  rule_name: 'crush_rule_mdnvme'
  erasure_profile: ''
  # expected_num_objects:
  application: 'cephfs'
  # min_size: '{{ osd_pool_default_min_size }}'
  pg_autoscale_mode: no
  target_size_ratio: 0.200

while the data pool is using erasure coding 4+2

cephfs_data_pool:
  name: 'cephfs_201_data'
  pg_num: 1024
  pgp_num: 1024
  type: 'erasure'
  rule_name: 'crush_rule_ec42'
  erasure_profile: 'ec42_profile'
  # expected_num_objects:
  application: 'cephfs'
  size: "{{ ec42_profile.ec_config.m }}" # size==m: EC 4+2: k=4: m=2
  # min_size: '{{ osd_pool_default_min_size }}'
  pg_autoscale_mode: warn
  target_size_ratio: 0.8500

The pools are also compressed:

    bluestore_compression_algorithm: lz4
    bluestore_compression_mode: aggressive

Devices in the data crush rule "crush_rule_ec42" are hard disks which are encrypted by LUKS. These are accelerated using NVMe SSDs as a write ahead log WAL. CephFS clients are using the kernel module cephfs from the SLES 11.3 kernel.

Problem:
The two active MDS-servers are increasingly consuming more memory.
Configured memory values.

[mds]
mds_cache_memory = 24G
mds_cache_memory_limit = 48G

are not adhered to.

Active MDS servers:

root@rrdfc06:~ $ceph fs status
cephfs_201 - 43 clients
==========
RANK  STATE     MDS       ACTIVITY     DNS    INOS   DIRS   CAPS  
 0    active  rrdfc04  Reqs:    0 /s  1055k  1053k  73.6k  1174k  
 1    active  rrdfc06  Reqs:    0 /s  24.5k  22.9k  1121   7853   
        POOL           TYPE     USED  AVAIL  
cephfs_201_metadata  metadata  70.5G   876G  
  cephfs_201_data      data     545T   124T  
STANDBY MDS  
  rrdfc02    
MDS version: ceph version 16.2.9 (4c3647a322c0ff5a1dd2344e039859dcbd28c830) pacific (stable)

Heap stats:

root@rrdfc06:~ $ceph tell mds.$(hostname -s) heap stats
2022-07-19T08:51:26.515+0200 7f19baffd700  0 client.319645 ms_handle_reset on v2:172.17.0.86:6800/3109409290
2022-07-19T08:51:26.529+0200 7f19baffd700  0 client.309893 ms_handle_reset on v2:172.17.0.86:6800/3109409290
mds.rrdfc06 tcmalloc heap stats:------------------------------------------------
MALLOC:   269974258480 (257467.5 MiB) Bytes in use by application
MALLOC: +       180224 (    0.2 MiB) Bytes in page heap freelist
MALLOC: +   1919539832 ( 1830.6 MiB) Bytes in central cache freelist                                                            
MALLOC: +      6489600 (    6.2 MiB) Bytes in transfer cache freelist                                                           
MALLOC: +     83483736 (   79.6 MiB) Bytes in thread cache freelists                                                            
MALLOC: +   1238499328 ( 1181.1 MiB) Bytes in malloc metadata                                                                   
MALLOC:   ------------                                                                                                          
MALLOC: = 273222451200 (260565.2 MiB) Actual memory used (physical + swap)                                                      
MALLOC: +    279666688 (  266.7 MiB) Bytes released to OS (aka unmapped)                                                        
MALLOC:   ------------                                                                                                          
MALLOC: = 273502117888 (260831.9 MiB) Virtual address space used                                                                
MALLOC:                                                                                                                         
MALLOC:       18760628              Spans in use                                                                                
MALLOC:             22              Thread heaps in use                                                                         
MALLOC:           8192              Tcmalloc page size                                                                          
------------------------------------------------                                                                                
Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
Bytes released to the OS take up virtual address space but no physical memory.

Restarting the MDS takes a bit:

root@rrdfc06:~ $time systemctl restart ceph-mds@rrdfc06.service

real    1m45.460s
user    0m0.005s
sys     0m0.004s

but frees a lot of memory for the time being:

root@rrdfc06:~ $ceph tell mds.$(hostname -s) heap stats
2022-07-19T08:54:29.247+0200 7f277d7fa700  0 client.329514 ms_handle_reset on v2:172.17.0.86:6800/1793786924
2022-07-19T08:54:29.261+0200 7f277d7fa700  0 client.329520 ms_handle_reset on v2:172.17.0.86:6800/1793786924
mds.rrdfc06 tcmalloc heap stats:------------------------------------------------
MALLOC:       14006472 (   13.4 MiB) Bytes in use by application
MALLOC: +       647168 (    0.6 MiB) Bytes in page heap freelist
MALLOC: +       362264 (    0.3 MiB) Bytes in central cache freelist
MALLOC: +       310272 (    0.3 MiB) Bytes in transfer cache freelist
MALLOC: +       803872 (    0.8 MiB) Bytes in thread cache freelists
MALLOC: +      2752512 (    2.6 MiB) Bytes in malloc metadata
MALLOC:   ------------
MALLOC: =     18882560 (   18.0 MiB) Actual memory used (physical + swap)
MALLOC: +            0 (    0.0 MiB) Bytes released to OS (aka unmapped)
MALLOC:   ------------
MALLOC: =     18882560 (   18.0 MiB) Virtual address space used
MALLOC:
MALLOC:            294              Spans in use
MALLOC:             12              Thread heaps in use
MALLOC:           8192              Tcmalloc page size
------------------------------------------------
Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
Bytes released to the OS take up virtual address space but no physical memory.

Is there any way to analyse the root cause of this issue?
Currently I am restarting the MDS servers manually on a regular basis.

graph.php(1).png View - Ganglia Memory Summary Graph (35.4 KB) Ramin Torabi, 08/01/2022 07:02 AM

MDS_memory_arrow.jpg View (92.2 KB) Ramin Torabi, 08/02/2022 06:46 AM

56988_2022-08-26_ceph_mds_perf.logs.tar.bz2 (564 KB) Ramin Torabi, 08/26/2022 06:07 AM

History

#1 Updated by Ramin Torabi 4 months ago

#2 Updated by Ramin Torabi 4 months ago

I have automated restarting a single MDS-Server when MDS memory consumption is 80GB (roughly twice the configured mds_cache_memory_limit) and the cluster is healthy.
It appears to be a working workaround. When restarting an MDS Server the "backup" MDS becomes "active".
The main disadvantage I observed appears to be a temporary (one minute) health issue during restart:
[WARN] FS_DEGRADED: 1 filesystem is degraded
fs cephfs... is degraded
[WARN] MDS_INSUFFICIENT_STANDBY: insufficient standby MDS daemons available
have 0; want 1 more

#3 Updated by Ramin Torabi 4 months ago

Here is a graph of the memory summary without and with the automated restart.

#4 Updated by Venky Shankar 4 months ago

  • Status changed from New to Triaged
  • Assignee set to Patrick Donnelly
  • Target version set to v18.0.0
  • Backport set to pacific,quincy
  • Component(FS) MDS added
  • Labels (FS) multimds added

#5 Updated by Patrick Donnelly 4 months ago

  • Subject changed from memory leak in MDS suspected to mds: memory leak suspected
  • Description updated (diff)
  • Tags deleted (mds memory)

#6 Updated by Patrick Donnelly 4 months ago

If you could capture

ceph tell mds.<fs_name>:0 perf dump

at regular intervals (1/5 min), especially when the event is in process, that would help.

#7 Updated by Patrick Donnelly 4 months ago

  • Status changed from Triaged to Need More Info

#8 Updated by Ramin Torabi 3 months ago

I have attached the current "ceph tell mds.<fs_name>:0 perf dump" log files.
Files with the name:
ceph_mds_perf_<date>.txt
have been created by the snapshot taken every five minutes.
Files with the name:
ceph_mds_perf_watchdog_<date>.txt
have been created by the script that is restarting the MDS when memory consumption exceeds 80GB. The dump was created just before the restart.

You can extract the directory using:
tar -xvjpf 56988_2022-08-26_ceph_mds_perf.logs.tar.bz2

I hope that helps.

Also available in: Atom PDF