Bug #41037
closedContainerized cluster failure due to osd_memory_target not being set to ratio of cgroup_limit per osd_memory_target_cgroup_limit_ratio
0%
Description
Under heavy I/O workload (generated to multiple postgres databaseses, backed by Ceph RBD, via the pgbench utility), we have experienced a condition where the memory limit for the OSD pods is reached, OOM kills happen, and the cluster then becomes unresponsive. This is a critical failure caused apparently by the osd_memory_target
parameter not being sized with some headroom below the cgroup_limit
. Mark Nelson has repeatedly said that 10-20% headroom was needed to ensure that the OSD could trim its memory before the OOM killer was triggered.
Note that because Bluestore is computing the osd_memory_target
for us based on the CGroup limit, there is no way to override this, so this is a high priority problem!
Deploying containerized Ceph via Rook.io, we apply a memory limit to the OSDs via the cluster.yaml
, but not a memory request.
resources:
osd:
requests:
cpu: "2"
limits:
cpu: "2"
memory: "6Gi"
Per fc3bdad [1], our expectation is that, without a memory request, the value of osd_memory_target
should default to cgroup_limit * osd_memory_target_cgroup_limit_ratio
(0.8 default). However, the deployed cluster shows ...
sh-4.2# ceph daemon /var/lib/rook/osd0/ceph-osd.0.asok config show | grep osd_memory "osd_memory_base": "805306368", "osd_memory_cache_min": "134217728", "osd_memory_cache_resize_interval": "1.000000", "osd_memory_expected_fragmentation": "0.150000", "osd_memory_target": "6442450944", <-- Target = limit instead of = (limit*0.8) "osd_memory_target_cgroup_limit_ratio": "0.800000", <-- Ratio looks correct
cluster: id: ffa396e4-7874-472d-95fa-692d754f5e6e health: HEALTH_ERR 1 MDSs report slow metadata IOs 2 osds down 15/76461 objects unfound (0.020%) Reduced data availability: 309 pgs inactive, 309 pgs peering, 103 pgs stale Possible data damage: 10 pgs recovery_unfound Degraded data redundancy: 73406/229339 objects degraded (32.008%), 669 pgs degraded, 644 pgs undersized services: mon: 3 daemons, quorum a,b,c (age 2w) mgr: a(active, since 2d) mds: myfs:1 {0=myfs-b=up:active} 1 up:standby-replay osd: 12 osds: 6 up (since 24m), 8 in (since 23m); 104 remapped pgs data: pools: 3 pools, 1224 pgs objects: 76.46k objects, 296 GiB usage: 638 GiB used, 11 TiB / 12 TiB avail pgs: 25.245% pgs not active 73406/229339 objects degraded (32.008%) 15/76461 objects unfound (0.020%) 548 active+undersized+degraded 253 peering 145 active+clean 100 stale+active+clean 56 remapped+peering 40 active+recovery_wait+undersized+degraded 36 active+undersized+degraded+remapped+backfill_wait 23 active+recovery_wait+degraded 9 active+recovery_wait+undersized+degraded+remapped 6 active+recovery_unfound+undersized+degraded+remapped 4 active+recovery_unfound+undersized+degraded 2 stale+active+recovery_wait+degraded 1 stale+active+undersized+degraded 1 active+recovery_wait