Bug #54296: OSDs using too much memory - RADOS - Ceph

Actions

Copy link

Bug #54296

closed

OSDs using too much memory

Added by Ruben Kerkhof about 2 years ago. Updated almost 2 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v15.2.15

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

One of our customers upgraded from Nautilus to Octopus, and now a lot of his OSDs are using way more ram than allowed by the osd_memory_target.
Here's one:
$ sudo ceph daemon osd.75 config get osd_memory_target {
"osd_memory_target": "4294967296"
}
$ sudo ceph daemon osd.75 dump_mempools | jq '.mempool.total' {
"items": 51529102,
"bytes": 4589310658
}
$ ps -o rss -p $(pgrep -f '/usr/bin/ceph-osd -f --cluster ceph --id 75 --setuser ceph --setgroup ceph')
RSS
9134604

Some more details about his cluster:
All nodes are running 15.2.15.
He has hdd nodes in the default crush root, and ssd nodes in a separate crush tree.
The hdd osds are using approximately 4GB ram, only the SSD osds use double. He's running nightly snap trims only on the SSD-backed pools.

Please let me know what additional details I can provide.

Files

Download all files

mempools.txt (3.21 KB) mempools.txt	dump_mempools output	Ruben Kerkhof, 02/16/2022 12:19 PM
pg-dump.txt.gz (224 KB) pg-dump.txt.gz		Ruben Kerkhof, 02/17/2022 09:46 AM

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Igor Fedotov about 2 years ago

Hi Ruben,
please share full dump_mempools output.

Actions

Copy link

Updated by Ruben Kerkhof about 2 years ago

File mempools.txt mempools.txt added

Actions

Copy link

Updated by Ruben Kerkhof about 2 years ago

Hi Igor,

See attachment.

One thing I tried was to set osd_max_pg_log_entries to 500 instead of the default of 10000, but this doesn't help.

Actions

Copy link

Updated by Dan van der Ster about 2 years ago

Ruben Kerkhof wrote:

One thing I tried was to set osd_max_pg_log_entries to 500 instead of the default of 10000, but this doesn't help.

What exactly did you do to try this?

In our clusters when osd_pglog mempool exploded we set osd_max_pg_log_entries and osd_min_pg_log_entries to 500 in the ceph.conf [osd] section and restarted OSDs.
The active OSD for a PG needs to have these settings for the trim to take effect.
Also, you can check the LOG columns in `ceph pg dump` to see the current num entries per PG.

Actions

Copy link

Updated by Ruben Kerkhof about 2 years ago

File pg-dump.txt.gz pg-dump.txt.gz added

Hi Dan,

Thanks for your response.
I only adjusted osd_max_pg_log_entries and left osd_min_pg_log_entries alone. All OSDs that use SSDs were restarted.
I've attached the output of `ceph pg dump pgs`, the average size of the log is between 500 and 600 entries, for the pools on SSD having the issue (pool 11 in the dump).

Actions

Copy link

Updated by Neha Ojha about 2 years ago

Project changed from Ceph to RADOS
Category deleted (~~OSD~~)

Actions

Copy link

Updated by Dan van der Ster about 2 years ago

Related to Bug #53729: ceph-osd takes all memory before oom on boot added

Actions

Copy link

Updated by Dan van der Ster about 2 years ago

Hi Ruben, Did you make any more progress on this?

I'm going through all the osd pglog memory usage tickets, and it looks like most or all of them are happening to SSDs.
I'm guessing here, but one notable difference is that SSDs have 2 op threads per shard by default, HDDs have 1.

Does the memory usage look more reasonable if you configure the SSD osds to use the _hdd defaults for shards/threads?

[osd]
osd_op_num_shards = 5
osd_op_num_threads_per_shard = 1

Actions

Copy link

Updated by Ruben Kerkhof about 2 years ago

Dan van der Ster wrote:

Hi Ruben, Did you make any more progress on this?

Hi Dan, I missed your update, sorry. Not yet, the customer did upgrade a few hosts to Ubuntu 20.04, but a few of those OSDs also have high memory usage so it seems like tcmalloc is a red herring and the os doesn't affect this.

I'm going through all the osd pglog memory usage tickets, and it looks like most or all of them are happening to SSDs.
I'm guessing here, but one notable difference is that SSDs have 2 op threads per shard by default, HDDs have 1.

Does the memory usage look more reasonable if you configure the SSD osds to use the _hdd defaults for shards/threads?

[osd]
osd_op_num_shards = 5
osd_op_num_threads_per_shard = 1

Excellent idea! I'll ask the customer and get back with the results.

Actions

Copy link

#10

Updated by Ruben Kerkhof about 2 years ago

Excellent idea! I'll ask the customer and get back with the results.

We restarted the OSDs on a single node with these settings, and left them running for a few days, but no change.
The OSDs still use between 3GiB and 4Gib for the pglog mempool, and around 8GB rss.

Actions

Copy link

#11

Updated by Dan van der Ster about 2 years ago

Ruben Kerkhof wrote:

Excellent idea! I'll ask the customer and get back with the results.

We restarted the OSDs on a single node with these settings, and left them running for a few days, but no change.
The OSDs still use between 3GiB and 4Gib for the pglog mempool, and around 8GB rss.

Okay, it is likely this: https://tracker.ceph.com/issues/53729 (see the last week or so of comments)

You can confirm by stopping the osd and checking the number of dups in the PGs (it is normally under 3k per PG -- you may have somethign like 100k to a million in a few PGs, based on your memory usage):

PGS=$(ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0/ --op list-pgs)
for p in $PGS; do echo $p; ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0/ --op log --pgid $p | jq '(.pg_log_t.log|length),(.pg_log_t.dups|length)'; echo; done

There's nothing you can do until that fix is available.

Actions

Copy link

#12

Updated by Ruben Kerkhof about 2 years ago

Dan van der Ster wrote:

Ruben Kerkhof wrote:

Excellent idea! I'll ask the customer and get back with the results.

We restarted the OSDs on a single node with these settings, and left them running for a few days, but no change.
The OSDs still use between 3GiB and 4Gib for the pglog mempool, and around 8GB rss.

Okay, it is likely this: https://tracker.ceph.com/issues/53729 (see the last week or so of comments)

You can confirm by stopping the osd and checking the number of dups in the PGs (it is normally under 3k per PG -- you may have somethign like 100k to a million in a few PGs, based on your memory usage):

[...]

There's nothing you can do until that fix is available.

Hi Dan,

Great catch, that seems to be it. Most pgs show:
11.e4
554
2499

11.b4
579
2499

11.97
599
2499

11.8b
544
2499

But there are certainly outliers:

11.3b1
520
1589213

11.37a
519
602198

11.2c
595
2794697

11.3f
511
998223

Actions

Copy link

#13

Updated by Radoslaw Zarzynski about 2 years ago

Related to deleted (Bug #53729: ceph-osd takes all memory before oom on boot)

Actions

Copy link

#14

Updated by Radoslaw Zarzynski about 2 years ago

Is duplicate of Bug #53729: ceph-osd takes all memory before oom on boot added

Actions

Copy link

#15

Updated by Neha Ojha almost 2 years ago

Status changed from New to Resolved

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #54296

OSDs using too much memory

Updated by Igor Fedotov about 2 years ago

Updated by Ruben Kerkhof about 2 years ago

Updated by Ruben Kerkhof about 2 years ago

Updated by Dan van der Ster about 2 years ago

Updated by Ruben Kerkhof about 2 years ago

Updated by Neha Ojha about 2 years ago

Updated by Dan van der Ster about 2 years ago

Updated by Dan van der Ster about 2 years ago

Updated by Ruben Kerkhof about 2 years ago

Updated by Ruben Kerkhof about 2 years ago

Updated by Dan van der Ster about 2 years ago

Updated by Ruben Kerkhof about 2 years ago

Updated by Radoslaw Zarzynski about 2 years ago

Updated by Radoslaw Zarzynski about 2 years ago

Updated by Neha Ojha almost 2 years ago