Project

General

Profile

Actions

Bug #41037

closed

Containerized cluster failure due to osd_memory_target not being set to ratio of cgroup_limit per osd_memory_target_cgroup_limit_ratio

Added by Dustin Black over 4 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Target version:
% Done:

0%

Source:
Tags:
Backport:
nautilus
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Under heavy I/O workload (generated to multiple postgres databaseses, backed by Ceph RBD, via the pgbench utility), we have experienced a condition where the memory limit for the OSD pods is reached, OOM kills happen, and the cluster then becomes unresponsive. This is a critical failure caused apparently by the osd_memory_target parameter not being sized with some headroom below the cgroup_limit. Mark Nelson has repeatedly said that 10-20% headroom was needed to ensure that the OSD could trim its memory before the OOM killer was triggered.

Note that because Bluestore is computing the osd_memory_target for us based on the CGroup limit, there is no way to override this, so this is a high priority problem!

Deploying containerized Ceph via Rook.io, we apply a memory limit to the OSDs via the cluster.yaml, but not a memory request.

  resources:
    osd:
      requests:
        cpu: "2" 
      limits:
        cpu: "2" 
        memory: "6Gi" 

Per fc3bdad [1], our expectation is that, without a memory request, the value of osd_memory_target should default to cgroup_limit * osd_memory_target_cgroup_limit_ratio (0.8 default). However, the deployed cluster shows ...

sh-4.2# ceph daemon /var/lib/rook/osd0/ceph-osd.0.asok config show | grep osd_memory
    "osd_memory_base": "805306368",
    "osd_memory_cache_min": "134217728",
    "osd_memory_cache_resize_interval": "1.000000",
    "osd_memory_expected_fragmentation": "0.150000",
    "osd_memory_target": "6442450944",  <-- Target = limit instead of = (limit*0.8)
    "osd_memory_target_cgroup_limit_ratio": "0.800000",  <-- Ratio looks correct
  cluster:
    id:     ffa396e4-7874-472d-95fa-692d754f5e6e
    health: HEALTH_ERR
            1 MDSs report slow metadata IOs
            2 osds down
            15/76461 objects unfound (0.020%)
            Reduced data availability: 309 pgs inactive, 309 pgs peering, 103 pgs stale
            Possible data damage: 10 pgs recovery_unfound
            Degraded data redundancy: 73406/229339 objects degraded (32.008%), 669 pgs degraded, 644 pgs undersized

  services:
    mon: 3 daemons, quorum a,b,c (age 2w)
    mgr: a(active, since 2d)
    mds: myfs:1 {0=myfs-b=up:active} 1 up:standby-replay
    osd: 12 osds: 6 up (since 24m), 8 in (since 23m); 104 remapped pgs

  data:
    pools:   3 pools, 1224 pgs
    objects: 76.46k objects, 296 GiB
    usage:   638 GiB used, 11 TiB / 12 TiB avail
    pgs:     25.245% pgs not active
             73406/229339 objects degraded (32.008%)
             15/76461 objects unfound (0.020%)
             548 active+undersized+degraded
             253 peering
             145 active+clean
             100 stale+active+clean
             56  remapped+peering
             40  active+recovery_wait+undersized+degraded
             36  active+undersized+degraded+remapped+backfill_wait
             23  active+recovery_wait+degraded
             9   active+recovery_wait+undersized+degraded+remapped
             6   active+recovery_unfound+undersized+degraded+remapped
             4   active+recovery_unfound+undersized+degraded
             2   stale+active+recovery_wait+degraded
             1   stale+active+undersized+degraded
             1   active+recovery_wait

[1] https://github.com/ceph/ceph/commit/fc3bdad87597066a813a3734b2a79e803340be36#diff-a9faffcf40600fd57aea5451cef5abe9


Related issues 1 (0 open1 closed)

Copied to bluestore - Backport #41273: nautilus: Containerized cluster failure due to osd_memory_target not being set to ratio of cgroup_limit per osd_memory_target_cgroup_limit_ratioResolvedSage WeilActions
Actions #1

Updated by Neha Ojha over 4 years ago

Which version of Ceph are you running?

Actions #2

Updated by Neha Ojha over 4 years ago

Actions #3

Updated by Ben England over 4 years ago

  • Target version set to v14.2.3

version you asked for:

ceph-base-14.2.2-0.el7.x86_64

from the Ceph container image ceph/ceph:v14.2.2-20190722

Actions #4

Updated by Ben England over 4 years ago

from Joe T on his system:
  1. ceph version
    ceph version 14.2.2-218-g734b519 (734b5199dc45d3d36c8d8d066d6249cc304d0e0e) nautilus (stable)
Actions #5

Updated by Neha Ojha over 4 years ago

Neha Ojha wrote:

Can you enable debug_osd=10 and see what this line (https://github.com/ceph/ceph/commit/fc3bdad87597066a813a3734b2a79e803340be36#diff-a9faffcf40600fd57aea5451cef5abe9R4214) reports in the log?

my bad, it should be debug_bluestore=10

Actions #6

Updated by Josh Durgin over 4 years ago

  • Priority changed from Normal to Urgent

Joe Talerico reproduced this and found the POD_LIMIT was getting set, but not the system-wide limit, so the current OSD code reading only the system limit would have no effect.

These info available in the osd container was:

cat /proc/80946/cgroup 
12:pids:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod688470ce_b491_11e9_b8b1_98039b616b98.slice/crio-7d8c95780ecfe701043003a914b4ab2cb410c0139478e46db8f07008ba8e733e.scope
11:perf_event:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod688470ce_b491_11e9_b8b1_98039b616b98.slice/crio-7d8c95780ecfe701043003a914b4ab2cb410c0139478e46db8f07008ba8e733e.scope
10:devices:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod688470ce_b491_11e9_b8b1_98039b616b98.slice/crio-7d8c95780ecfe701043003a914b4ab2cb410c0139478e46db8f07008ba8e733e.scope
9:hugetlb:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod688470ce_b491_11e9_b8b1_98039b616b98.slice/crio-7d8c95780ecfe701043003a914b4ab2cb410c0139478e46db8f07008ba8e733e.scope
8:freezer:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod688470ce_b491_11e9_b8b1_98039b616b98.slice/crio-7d8c95780ecfe701043003a914b4ab2cb410c0139478e46db8f07008ba8e733e.scope
7:blkio:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod688470ce_b491_11e9_b8b1_98039b616b98.slice/crio-7d8c95780ecfe701043003a914b4ab2cb410c0139478e46db8f07008ba8e733e.scope
6:cpu,cpuacct:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod688470ce_b491_11e9_b8b1_98039b616b98.slice/crio-7d8c95780ecfe701043003a914b4ab2cb410c0139478e46db8f07008ba8e733e.scope
5:memory:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod688470ce_b491_11e9_b8b1_98039b616b98.slice/crio-7d8c95780ecfe701043003a914b4ab2cb410c0139478e46db8f07008ba8e733e.scope
4:rdma:/
3:net_cls,net_prio:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod688470ce_b491_11e9_b8b1_98039b616b98.slice/crio-7d8c95780ecfe701043003a914b4ab2cb410c0139478e46db8f07008ba8e733e.scope
2:cpuset:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod688470ce_b491_11e9_b8b1_98039b616b98.slice/crio-7d8c95780ecfe701043003a914b4ab2cb410c0139478e46db8f07008ba8e733e.scope
1:name=systemd:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod688470ce_b491_11e9_b8b1_98039b616b98.slice/crio-7d8c95780ecfe701043003a914b4ab2cb410c0139478e46db8f07008ba8e733e.scope

cat /sys/fs/cgroup/memory/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod688470ce_b491_11e9_b8b1_98039b616b98.slice/memory.limit_in_bytes 
6442450944

cat /sys/fs/cgroup/memory/memory.limit_in_bytes
9223372036854771712

Actions #7

Updated by Ben England over 4 years ago

My guess would be if CGroup limit is X, then 0.95 X - 1/2 GB should be fine for osd_memory_target, that would give the OSDs time to detect that they are over their limit and purge the cache to bring it back down, while wasting as little memory resource as possible.

The worst case here is really intense I/O to NVM devices, where the cache size can increase rapidly before the OSD can detect that there is a problem. In the past, HDDs limited the IOPS rate.

Actions #8

Updated by Mark Nelson over 4 years ago

@Ben Webber that's probably a semi-reasonable assumption in a lot of cases, though I've noticed that the kernel doesn't always reclaim unmapped memory right away which can make this really tricky. CentOS and RHEL seem to do better than Ubuntu does though and I have no idea how containers interact with it.

Typically I just say to give them an extra 20% but I could be convinced something like 5%+XMB could also work.

Actions #9

Updated by Sage Weil over 4 years ago

  • Status changed from New to Fix Under Review
  • Backport set to nautilus
Actions #10

Updated by Sage Weil over 4 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #11

Updated by Nathan Cutler over 4 years ago

  • Copied to Backport #41273: nautilus: Containerized cluster failure due to osd_memory_target not being set to ratio of cgroup_limit per osd_memory_target_cgroup_limit_ratio added
Actions #12

Updated by Josh Durgin over 4 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF