Project

General

Profile

Actions

Bug #61493

open

rbd:"enable tcmalloc with 2M hugepage" causes iops of 4K random write degradation from 300K to 30K

Added by Zhiqiang Liu 11 months ago. Updated 11 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
rbd
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

rbd:"enable tcmalloc with 2M hugepage" causes iops of 4K random write degradation from 330K to 30K

version: ceph v14.2.8.

When I try to enable tcmalloc with 2M hugepage for improving iops of 4K random write, at the beginning, iops increased from 270K to 330K.
After one hour, iops began to rapidly decline to 20K. 'ceph -s' reported "64 slow ops, daemons [osd.3,osd.36,osd.19] have slow ops". I have set "osd_memory_target = 10G" to limit used memory size for each osd. And I have found most of memory in osd.19 are in page heap freelist, when iops is 20K. It seems that the memory in page heap freelist can no longer be used by OSD, while all memory has been reached osd_memory_target.

And 'ceph tell osd.19 heap stats' shows as follows:
```
[root@ceph1 ceph]# ceph tell osd.19 heap stats
osd.19 tcmalloc heap stats:------------------------------------------------
MALLOC: 556469648 ( 530.7 MiB) Bytes in use by application
MALLOC: + 10349543424 ( 9870.1 MiB) Bytes in page heap freelist
MALLOC: + 813390616 ( 775.7 MiB) Bytes in central cache freelist
MALLOC: + 8120576 ( 7.7 MiB) Bytes in transfer cache freelist
MALLOC: + 84684376 ( 80.8 MiB) Bytes in thread cache freelists
MALLOC: + 59113472 ( 56.4 MiB) Bytes in malloc metadata
MALLOC: ------------
MALLOC: = 11871322112 (11321.4 MiB) Actual memory used (physical + swap)
MALLOC: + 0 ( 0.0 MiB) Bytes released to OS (aka unmapped)
MALLOC: ------------
MALLOC: = 11871322112 (11321.4 MiB) Virtual address space used
MALLOC:
MALLOC: 231468 Spans in use
MALLOC: 114 Thread heaps in use
MALLOC: 8192 Tcmalloc page size
------------------------------------------------
Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
Bytes released to the OS take up virtual address space but no physical memory.
```

In addition, 'perf top -p <slow-osd>':
```
Samples: 277K of event 'cycles:ppp', 4000 Hz, Event count (approx.): 132899867608 lost: 0/0
Overhead Shared Object Symbol
57.57% ceph-osd [.] rocksdb::crc32c::ExtendImpl<&rocksdb::crc32c::Slow_CRC32>
18.66% [kernel] [k] __arch_copy_to_user
1.39% [kernel] [k] generic_file_buffered_read
1.36% [kernel] [k] find_get_entry
0.83% ceph-osd [.] rocksdb::InlineSkipList<rocksdb::MemTableRep::KeyComparator const&>::FindGreaterOrEqual
0.70% libtcmalloc.so.4.5.6 [.] operator delete[]
0.61% libtcmalloc.so.4.5.6 [.] operator new[]
0.47% libc-2.33.so [.] memcmp
0.47% [kernel] [k] __radix_tree_lookup
0.43% [kernel] [k] __softirqentry_text_start
0.42% [kernel] [k] __wake_up_common_lock
```

tcmalloc with 2M hugepage:
1.enable 2M hugepage in grub.cfg
2.set hugepage mount point in /etc/fstab: "nodev /mnt/huge hugetlbfs defaults,pagesize=2M 0 0"
3.enable 2M hugepage with seting "Environment=TCMALLOC_MEMFS_MALLOC_PATH=/mnt/huge/osd%i" in "/usr/lib/systemd/system/ceph-osd@.service"

osd configure in ceph.conf:
```
[OSD]
bluestore_rocksdb_options = use_direct_reads=true,use_direct_io_for_flush_and_compaction=true,compression=kNoCompression,max_write_buffer_number=128,min_write_buffer_number_to_merge=32,recycle_log_file_num=64,compaction_style=kCompactionStyleLevel,write_buffer_size=4M,target_file_size_base=4M,max_background_compactions=64,level0_file_num_compaction_trigger=64,level0_slowdown_writes_trigger=128,level0_stop_writes_trigger=256,max_bytes_for_level_base=6GB,compaction_threads=32,max_bytes_for_level_multiplier=8,flusher_threads=8,compaction_readahead_size=0
mon_osd_full_ratio = 0.97
mon_osd_nearfull_ratio = 0.95
osd_min_pg_log_entries = 10
osd_max_pg_log_entries = 10
bluestore_cache_meta_ratio = 0.49
bluestore_cache_kv_ratio = 0.49
bluestore_cache_size_ssd = 6G
osd_memory_target = 10G

```

Actions #1

Updated by Zhiqiang Liu 11 months ago

Could someone review this issue?

Actions #2

Updated by Zhiqiang Liu 11 months ago

I find that:

Memory allocate through hugetlb cannot be freed in tcmalloc. So the memory allocated for one OSD will never decrease. In

BlueStore::MempoolThread::entry()
->Manager::tune_memory()  //tune 
  -> ceph_heap_release_free_memory();
  -> ceph_heap_get_numeric_property("generic.heap_size", &heap_size);
  -> ceph_heap_get_numeric_property("tcmalloc.pageheap_unmapped_bytes", &unmapped);
  -> mapped = heap_size - unmapped;
  -> if mapped < target_mem: new_size += ratio * (max_mem - new_size);
  -> if mapped >= target_mem: new_size -= ratio * (new_size - min_mem); // if mapped memory is larger than target_mem, will reduce tuned_mem until a very low value.
->_trim_shards(interval_stats_trim) //trim based on tuned_mem
Actions #3

Updated by Zhiqiang Liu 11 months ago

Zhiqiang Liu wrote:

I find that:

Memory allocate through hugetlb cannot be freed in tcmalloc. So the memory allocated for one OSD will never decrease. In

[...]

When enable tcmalloc with 2M hugepage, in bstore_mempool thread, ceph_heap_release_free_memory() cannot free memory in free lists (page heap freelist, central freelist, thread freelist). Once the heap_size is greater than target_mem, the tuned_mem will continue to be decreased until a very small value. Actually, there are already many memory in free lists which cannot be used by the OSD.

The tuned_memory should be adjusted by current actual used memory by the OSD. However, `"generic.heap_size"-"tcmalloc.pageheap_unmapped_bytes"` not only includes memory used by the OSD, but also includes free memory in free lists.

We can use ceph_heap_get_numeric_property("generic.current_allocated_bytes", &current_allocated_byte) to indicate how much memory used by OSD, which not include freelists(page heap freelist, central freelist, thread freelist).

    if (strcmp(name, "generic.current_allocated_bytes") == 0) {
      TCMallocStats stats;
      ExtractStats(&stats, NULL, NULL, NULL);
      *value = stats.pageheap.system_bytes
               - stats.thread_bytes
               - stats.central_bytes
               - stats.transfer_bytes
               - stats.pageheap.free_bytes
               - stats.pageheap.unmapped_bytes;
      return true;
    }

Actions

Also available in: Atom PDF