Project

General

Profile

Bug #56988

Updated by Patrick Donnelly over 1 year ago

We are runnung a cephfs pacific cluster in production: 

   MDS version: ceph version 16.2.9 (4c3647a322c0ff5a1dd2344e039859dcbd28c830) pacific (stable) 

 The six servers were installed using Rocky Linux 8.6 and configured using ceph-ansible from the "stable-6.0" branch. 
 The three MDS servers are running in an active/active/standby configuration: 

   mds_max_mds: 2 

 The MDS pool is replicated: 

 <pre> 
 cephfs_metadata_pool: 
   name: 'cephfs_201_metadata' 
   pg_num: 256 
   pgp_num: 256 
   size: 3    # i.e. 3 fold replication 
   type: 'replicated' 
   rule_name: 'crush_rule_mdnvme' 
   erasure_profile: '' 
   # expected_num_objects: 
   application: 'cephfs' 
   # min_size: '{{ osd_pool_default_min_size }}' 
   pg_autoscale_mode: no 
   target_size_ratio: 0.200 
 </pre> 

 while the data pool is using erasure coding 4+2 

 <pre> 
 cephfs_data_pool: 
   name: 'cephfs_201_data' 
   pg_num: 1024 
   pgp_num: 1024 
   type: 'erasure' 
   rule_name: 'crush_rule_ec42' 
   erasure_profile: 'ec42_profile' 
   # expected_num_objects: 
   application: 'cephfs' 
   size: "{{ ec42_profile.ec_config.m }}" # size==m: EC 4+2: k=4: m=2 
   # min_size: '{{ osd_pool_default_min_size }}' 
   pg_autoscale_mode: warn 
   target_size_ratio: 0.8500 
 </pre> 

 The pools are also compressed: 

 <pre> 
     

     bluestore_compression_algorithm: lz4 
     bluestore_compression_mode: aggressive 
 </pre> 

 Devices in the data crush rule "crush_rule_ec42" are hard disks which are encrypted by LUKS. These are accelerated using NVMe SSDs as a write ahead log WAL. CephFS clients are using the kernel module cephfs from the SLES 11.3 kernel. 

 Problem: 
 The two active MDS-servers are increasingly consuming more memory. 
 Configured memory values. 

 <pre> 
 [mds] 
 mds_cache_memory = 24G 
 mds_cache_memory_limit = 48G 
 </pre> 

 are not adhered to. 

 Active MDS servers: 

 <pre> 
 root@rrdfc06:~ $ceph fs status 
 cephfs_201 - 43 clients 
 ========== 
 RANK    STATE       MDS         ACTIVITY       DNS      INOS     DIRS     CAPS   
  0      active    rrdfc04    Reqs:      0 /s    1055k    1053k    73.6k    1174k   
  1      active    rrdfc06    Reqs:      0 /s    24.5k    22.9k    1121     7853    
         POOL             TYPE       USED    AVAIL   
 cephfs_201_metadata    metadata    70.5G     876G   
   cephfs_201_data        data       545T     124T   
 STANDBY MDS   
   rrdfc02     
 MDS version: ceph version 16.2.9 (4c3647a322c0ff5a1dd2344e039859dcbd28c830) pacific (stable) 
 </pre> 

 Heap stats: 

 <pre> 
 root@rrdfc06:~ $ceph tell mds.$(hostname -s) heap stats 
 2022-07-19T08:51:26.515+0200 7f19baffd700    0 client.319645 ms_handle_reset on v2:172.17.0.86:6800/3109409290 
 2022-07-19T08:51:26.529+0200 7f19baffd700    0 client.309893 ms_handle_reset on v2:172.17.0.86:6800/3109409290 
 mds.rrdfc06 tcmalloc heap stats:------------------------------------------------ 
 MALLOC:     269974258480 (257467.5 MiB) Bytes in use by application 
 MALLOC: +         180224 (      0.2 MiB) Bytes in page heap freelist 
 MALLOC: +     1919539832 ( 1830.6 MiB) Bytes in central cache freelist                                                             
 MALLOC: +        6489600 (      6.2 MiB) Bytes in transfer cache freelist                                                            
 MALLOC: +       83483736 (     79.6 MiB) Bytes in thread cache freelists                                                             
 MALLOC: +     1238499328 ( 1181.1 MiB) Bytes in malloc metadata                                                                    
 MALLOC:     ------------                                                                                                           
 MALLOC: = 273222451200 (260565.2 MiB) Actual memory used (physical + swap)                                                       
 MALLOC: +      279666688 (    266.7 MiB) Bytes released to OS (aka unmapped)                                                         
 MALLOC:     ------------                                                                                                           
 MALLOC: = 273502117888 (260831.9 MiB) Virtual address space used                                                                 
 MALLOC:                                                                                                                          
 MALLOC:         18760628                Spans in use                                                                                 
 MALLOC:               22                Thread heaps in use                                                                          
 MALLOC:             8192                Tcmalloc page size                                                                           
 ------------------------------------------------                                                                                 
 Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()). 
 Bytes released to the OS take up virtual address space but no physical memory. 
 </pre> 

 Restarting the MDS takes a bit: 

 <pre> 
 root@rrdfc06:~ $time systemctl restart ceph-mds@rrdfc06.service 

 real      1m45.460s 
 user      0m0.005s 
 sys       0m0.004s 
 </pre> 

 but frees a lot of memory for the time being: 

 <pre> 
 root@rrdfc06:~ $ceph tell mds.$(hostname -s) heap stats 
 2022-07-19T08:54:29.247+0200 7f277d7fa700    0 client.329514 ms_handle_reset on v2:172.17.0.86:6800/1793786924 
 2022-07-19T08:54:29.261+0200 7f277d7fa700    0 client.329520 ms_handle_reset on v2:172.17.0.86:6800/1793786924 
 mds.rrdfc06 tcmalloc heap stats:------------------------------------------------ 
 MALLOC:         14006472 (     13.4 MiB) Bytes in use by application 
 MALLOC: +         647168 (      0.6 MiB) Bytes in page heap freelist 
 MALLOC: +         362264 (      0.3 MiB) Bytes in central cache freelist 
 MALLOC: +         310272 (      0.3 MiB) Bytes in transfer cache freelist 
 MALLOC: +         803872 (      0.8 MiB) Bytes in thread cache freelists 
 MALLOC: +        2752512 (      2.6 MiB) Bytes in malloc metadata 
 MALLOC:     ------------ 
 MALLOC: =       18882560 (     18.0 MiB) Actual memory used (physical + swap) 
 MALLOC: +              0 (      0.0 MiB) Bytes released to OS (aka unmapped) 
 MALLOC:     ------------ 
 MALLOC: =       18882560 (     18.0 MiB) Virtual address space used 
 MALLOC: 
 MALLOC:              294                Spans in use 
 MALLOC:               12                Thread heaps in use 
 MALLOC:             8192                Tcmalloc page size 
 ------------------------------------------------ 
 Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()). 
 Bytes released to the OS take up virtual address space but no physical memory. 
 </pre> 

 Is there any way to analyse the root cause of this issue? 
 Currently I am restarting the MDS servers manually on a regular basis.

Back