Bug #41037: Containerized cluster failure due to osd_memory_target not being set to ratio of cgroup_limit per osd_memory_target_cgroup_limit_ratio - bluestore - Ceph

Actions

Copy link

Bug #41037

closed

Containerized cluster failure due to osd_memory_target not being set to ratio of cgroup_limit per osd_memory_target_cgroup_limit_ratio

Added by Dustin Black over 4 years ago. Updated over 4 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Target version:

Ceph - v14.2.3

% Done:

Source:

Tags:

Backport:

nautilus

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Under heavy I/O workload (generated to multiple postgres databaseses, backed by Ceph RBD, via the pgbench utility), we have experienced a condition where the memory limit for the OSD pods is reached, OOM kills happen, and the cluster then becomes unresponsive. This is a critical failure caused apparently by the osd_memory_target parameter not being sized with some headroom below the cgroup_limit. Mark Nelson has repeatedly said that 10-20% headroom was needed to ensure that the OSD could trim its memory before the OOM killer was triggered.

Note that because Bluestore is computing the osd_memory_target for us based on the CGroup limit, there is no way to override this, so this is a high priority problem!

Deploying containerized Ceph via Rook.io, we apply a memory limit to the OSDs via the cluster.yaml, but not a memory request.

  resources:
    osd:
      requests:
        cpu: "2" 
      limits:
        cpu: "2" 
        memory: "6Gi"

Per fc3bdad [1], our expectation is that, without a memory request, the value of osd_memory_target should default to cgroup_limit * osd_memory_target_cgroup_limit_ratio (0.8 default). However, the deployed cluster shows ...

sh-4.2# ceph daemon /var/lib/rook/osd0/ceph-osd.0.asok config show | grep osd_memory
    "osd_memory_base": "805306368",
    "osd_memory_cache_min": "134217728",
    "osd_memory_cache_resize_interval": "1.000000",
    "osd_memory_expected_fragmentation": "0.150000",
    "osd_memory_target": "6442450944",  <-- Target = limit instead of = (limit*0.8)
    "osd_memory_target_cgroup_limit_ratio": "0.800000",  <-- Ratio looks correct

  cluster:
    id:     ffa396e4-7874-472d-95fa-692d754f5e6e
    health: HEALTH_ERR
            1 MDSs report slow metadata IOs
            2 osds down
            15/76461 objects unfound (0.020%)
            Reduced data availability: 309 pgs inactive, 309 pgs peering, 103 pgs stale
            Possible data damage: 10 pgs recovery_unfound
            Degraded data redundancy: 73406/229339 objects degraded (32.008%), 669 pgs degraded, 644 pgs undersized

  services:
    mon: 3 daemons, quorum a,b,c (age 2w)
    mgr: a(active, since 2d)
    mds: myfs:1 {0=myfs-b=up:active} 1 up:standby-replay
    osd: 12 osds: 6 up (since 24m), 8 in (since 23m); 104 remapped pgs

  data:
    pools:   3 pools, 1224 pgs
    objects: 76.46k objects, 296 GiB
    usage:   638 GiB used, 11 TiB / 12 TiB avail
    pgs:     25.245% pgs not active
             73406/229339 objects degraded (32.008%)
             15/76461 objects unfound (0.020%)
             548 active+undersized+degraded
             253 peering
             145 active+clean
             100 stale+active+clean
             56  remapped+peering
             40  active+recovery_wait+undersized+degraded
             36  active+undersized+degraded+remapped+backfill_wait
             23  active+recovery_wait+degraded
             9   active+recovery_wait+undersized+degraded+remapped
             6   active+recovery_unfound+undersized+degraded+remapped
             4   active+recovery_unfound+undersized+degraded
             2   stale+active+recovery_wait+degraded
             1   stale+active+undersized+degraded
             1   active+recovery_wait

[1] https://github.com/ceph/ceph/commit/fc3bdad87597066a813a3734b2a79e803340be36#diff-a9faffcf40600fd57aea5451cef5abe9

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Neha Ojha over 4 years ago

Which version of Ceph are you running?

Actions

Copy link

Updated by Neha Ojha over 4 years ago

Can you enable debug_osd=10 and see what this line (https://github.com/ceph/ceph/commit/fc3bdad87597066a813a3734b2a79e803340be36#diff-a9faffcf40600fd57aea5451cef5abe9R4214) reports in the log?

Actions

Copy link

Updated by Ben England over 4 years ago

Target version set to v14.2.3

version you asked for:

ceph-base-14.2.2-0.el7.x86_64

from the Ceph container image ceph/ceph:v14.2.2-20190722

Actions

Copy link

Updated by Ben England over 4 years ago

from Joe T on his system:

ceph version
ceph version 14.2.2-218-g734b519 (734b5199dc45d3d36c8d8d066d6249cc304d0e0e) nautilus (stable)

Actions

Copy link

Updated by Neha Ojha over 4 years ago

Neha Ojha wrote:

Can you enable debug_osd=10 and see what this line (https://github.com/ceph/ceph/commit/fc3bdad87597066a813a3734b2a79e803340be36#diff-a9faffcf40600fd57aea5451cef5abe9R4214) reports in the log?

my bad, it should be debug_bluestore=10

Actions

Copy link

Updated by Josh Durgin over 4 years ago

Priority changed from Normal to Urgent

Joe Talerico reproduced this and found the POD_LIMIT was getting set, but not the system-wide limit, so the current OSD code reading only the system limit would have no effect.

These info available in the osd container was:

cat /proc/80946/cgroup 
12:pids:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod688470ce_b491_11e9_b8b1_98039b616b98.slice/crio-7d8c95780ecfe701043003a914b4ab2cb410c0139478e46db8f07008ba8e733e.scope
11:perf_event:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod688470ce_b491_11e9_b8b1_98039b616b98.slice/crio-7d8c95780ecfe701043003a914b4ab2cb410c0139478e46db8f07008ba8e733e.scope
10:devices:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod688470ce_b491_11e9_b8b1_98039b616b98.slice/crio-7d8c95780ecfe701043003a914b4ab2cb410c0139478e46db8f07008ba8e733e.scope
9:hugetlb:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod688470ce_b491_11e9_b8b1_98039b616b98.slice/crio-7d8c95780ecfe701043003a914b4ab2cb410c0139478e46db8f07008ba8e733e.scope
8:freezer:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod688470ce_b491_11e9_b8b1_98039b616b98.slice/crio-7d8c95780ecfe701043003a914b4ab2cb410c0139478e46db8f07008ba8e733e.scope
7:blkio:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod688470ce_b491_11e9_b8b1_98039b616b98.slice/crio-7d8c95780ecfe701043003a914b4ab2cb410c0139478e46db8f07008ba8e733e.scope
6:cpu,cpuacct:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod688470ce_b491_11e9_b8b1_98039b616b98.slice/crio-7d8c95780ecfe701043003a914b4ab2cb410c0139478e46db8f07008ba8e733e.scope
5:memory:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod688470ce_b491_11e9_b8b1_98039b616b98.slice/crio-7d8c95780ecfe701043003a914b4ab2cb410c0139478e46db8f07008ba8e733e.scope
4:rdma:/
3:net_cls,net_prio:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod688470ce_b491_11e9_b8b1_98039b616b98.slice/crio-7d8c95780ecfe701043003a914b4ab2cb410c0139478e46db8f07008ba8e733e.scope
2:cpuset:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod688470ce_b491_11e9_b8b1_98039b616b98.slice/crio-7d8c95780ecfe701043003a914b4ab2cb410c0139478e46db8f07008ba8e733e.scope
1:name=systemd:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod688470ce_b491_11e9_b8b1_98039b616b98.slice/crio-7d8c95780ecfe701043003a914b4ab2cb410c0139478e46db8f07008ba8e733e.scope

cat /sys/fs/cgroup/memory/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod688470ce_b491_11e9_b8b1_98039b616b98.slice/memory.limit_in_bytes 
6442450944

cat /sys/fs/cgroup/memory/memory.limit_in_bytes
9223372036854771712

Actions

Copy link

Updated by Ben England over 4 years ago

My guess would be if CGroup limit is X, then 0.95 X - 1/2 GB should be fine for osd_memory_target, that would give the OSDs time to detect that they are over their limit and purge the cache to bring it back down, while wasting as little memory resource as possible.

The worst case here is really intense I/O to NVM devices, where the cache size can increase rapidly before the OSD can detect that there is a problem. In the past, HDDs limited the IOPS rate.

Actions

Copy link

Updated by Mark Nelson over 4 years ago

@Ben Webber that's probably a semi-reasonable assumption in a lot of cases, though I've noticed that the kernel doesn't always reclaim unmapped memory right away which can make this really tricky. CentOS and RHEL seem to do better than Ubuntu does though and I have no idea how containers interact with it.

Typically I just say to give them an extra 20% but I could be convinced something like 5%+XMB could also work.

Actions

Copy link

Updated by Sage Weil over 4 years ago

Status changed from New to Fix Under Review
Backport set to nautilus

https://github.com/ceph/ceph/pull/29511

Actions

Copy link

#10

Updated by Sage Weil over 4 years ago

Status changed from Fix Under Review to Pending Backport

nautilus backport: https://github.com/ceph/ceph/pull/29562

Actions

Copy link

#11

Updated by Nathan Cutler over 4 years ago

Copied to Backport #41273: nautilus: Containerized cluster failure due to osd_memory_target not being set to ratio of cgroup_limit per osd_memory_target_cgroup_limit_ratio added

Actions

Copy link

#12

Updated by Josh Durgin over 4 years ago

Status changed from Pending Backport to Resolved

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » bluestore

Custom queries

Bug #41037

Containerized cluster failure due to osd_memory_target not being set to ratio of cgroup_limit per osd_memory_target_cgroup_limit_ratio

Updated by Neha Ojha over 4 years ago

Updated by Neha Ojha over 4 years ago

Updated by Ben England over 4 years ago

Updated by Ben England over 4 years ago

Updated by Neha Ojha over 4 years ago

Updated by Josh Durgin over 4 years ago

Updated by Ben England over 4 years ago

Updated by Mark Nelson over 4 years ago

Updated by Sage Weil over 4 years ago

Updated by Sage Weil over 4 years ago

Updated by Nathan Cutler over 4 years ago

Updated by Josh Durgin over 4 years ago