Project

General

Profile

Actions

Bug #54296

closed

OSDs using too much memory

Added by Ruben Kerkhof about 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

One of our customers upgraded from Nautilus to Octopus, and now a lot of his OSDs are using way more ram than allowed by the osd_memory_target.
Here's one:
$ sudo ceph daemon osd.75 config get osd_memory_target {
"osd_memory_target": "4294967296"
}
$ sudo ceph daemon osd.75 dump_mempools | jq '.mempool.total' {
"items": 51529102,
"bytes": 4589310658
}
$ ps -o rss -p $(pgrep -f '/usr/bin/ceph-osd -f --cluster ceph --id 75 --setuser ceph --setgroup ceph')
RSS
9134604

Some more details about his cluster:
All nodes are running 15.2.15.
He has hdd nodes in the default crush root, and ssd nodes in a separate crush tree.
The hdd osds are using approximately 4GB ram, only the SSD osds use double. He's running nightly snap trims only on the SSD-backed pools.

Please let me know what additional details I can provide.


Files

mempools.txt (3.21 KB) mempools.txt dump_mempools output Ruben Kerkhof, 02/16/2022 12:19 PM
pg-dump.txt.gz (224 KB) pg-dump.txt.gz Ruben Kerkhof, 02/17/2022 09:46 AM

Related issues 1 (0 open1 closed)

Is duplicate of RADOS - Bug #53729: ceph-osd takes all memory before oom on bootResolvedNitzan Mordechai

Actions
Actions #1

Updated by Igor Fedotov about 2 years ago

Hi Ruben,
please share full dump_mempools output.

Actions #3

Updated by Ruben Kerkhof about 2 years ago

Hi Igor,

See attachment.

One thing I tried was to set osd_max_pg_log_entries to 500 instead of the default of 10000, but this doesn't help.

Actions #4

Updated by Dan van der Ster about 2 years ago

Ruben Kerkhof wrote:

One thing I tried was to set osd_max_pg_log_entries to 500 instead of the default of 10000, but this doesn't help.

What exactly did you do to try this?

In our clusters when osd_pglog mempool exploded we set osd_max_pg_log_entries and osd_min_pg_log_entries to 500 in the ceph.conf [osd] section and restarted OSDs.
The active OSD for a PG needs to have these settings for the trim to take effect.
Also, you can check the LOG columns in `ceph pg dump` to see the current num entries per PG.

Actions #5

Updated by Ruben Kerkhof about 2 years ago

Hi Dan,

Thanks for your response.
I only adjusted osd_max_pg_log_entries and left osd_min_pg_log_entries alone. All OSDs that use SSDs were restarted.
I've attached the output of `ceph pg dump pgs`, the average size of the log is between 500 and 600 entries, for the pools on SSD having the issue (pool 11 in the dump).

Actions #6

Updated by Neha Ojha about 2 years ago

  • Project changed from Ceph to RADOS
  • Category deleted (OSD)
Actions #7

Updated by Dan van der Ster about 2 years ago

  • Related to Bug #53729: ceph-osd takes all memory before oom on boot added
Actions #8

Updated by Dan van der Ster about 2 years ago

Hi Ruben, Did you make any more progress on this?

I'm going through all the osd pglog memory usage tickets, and it looks like most or all of them are happening to SSDs.
I'm guessing here, but one notable difference is that SSDs have 2 op threads per shard by default, HDDs have 1.

Does the memory usage look more reasonable if you configure the SSD osds to use the _hdd defaults for shards/threads?

[osd]
osd_op_num_shards = 5
osd_op_num_threads_per_shard = 1

Actions #9

Updated by Ruben Kerkhof about 2 years ago

Dan van der Ster wrote:

Hi Ruben, Did you make any more progress on this?

Hi Dan, I missed your update, sorry. Not yet, the customer did upgrade a few hosts to Ubuntu 20.04, but a few of those OSDs also have high memory usage so it seems like tcmalloc is a red herring and the os doesn't affect this.

I'm going through all the osd pglog memory usage tickets, and it looks like most or all of them are happening to SSDs.
I'm guessing here, but one notable difference is that SSDs have 2 op threads per shard by default, HDDs have 1.

Does the memory usage look more reasonable if you configure the SSD osds to use the _hdd defaults for shards/threads?

[osd]
osd_op_num_shards = 5
osd_op_num_threads_per_shard = 1

Excellent idea! I'll ask the customer and get back with the results.

Actions #10

Updated by Ruben Kerkhof about 2 years ago

Excellent idea! I'll ask the customer and get back with the results.

We restarted the OSDs on a single node with these settings, and left them running for a few days, but no change.
The OSDs still use between 3GiB and 4Gib for the pglog mempool, and around 8GB rss.

Actions #11

Updated by Dan van der Ster about 2 years ago

Ruben Kerkhof wrote:

Excellent idea! I'll ask the customer and get back with the results.

We restarted the OSDs on a single node with these settings, and left them running for a few days, but no change.
The OSDs still use between 3GiB and 4Gib for the pglog mempool, and around 8GB rss.

Okay, it is likely this: https://tracker.ceph.com/issues/53729 (see the last week or so of comments)

You can confirm by stopping the osd and checking the number of dups in the PGs (it is normally under 3k per PG -- you may have somethign like 100k to a million in a few PGs, based on your memory usage):

PGS=$(ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0/ --op list-pgs)
for p in $PGS; do echo $p; ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0/ --op log --pgid $p | jq '(.pg_log_t.log|length),(.pg_log_t.dups|length)'; echo; done

There's nothing you can do until that fix is available.

Actions #12

Updated by Ruben Kerkhof about 2 years ago

Dan van der Ster wrote:

Ruben Kerkhof wrote:

Excellent idea! I'll ask the customer and get back with the results.

We restarted the OSDs on a single node with these settings, and left them running for a few days, but no change.
The OSDs still use between 3GiB and 4Gib for the pglog mempool, and around 8GB rss.

Okay, it is likely this: https://tracker.ceph.com/issues/53729 (see the last week or so of comments)

You can confirm by stopping the osd and checking the number of dups in the PGs (it is normally under 3k per PG -- you may have somethign like 100k to a million in a few PGs, based on your memory usage):

[...]

There's nothing you can do until that fix is available.

Hi Dan,

Great catch, that seems to be it. Most pgs show:
11.e4
554
2499

11.b4
579
2499

11.97
599
2499

11.8b
544
2499

But there are certainly outliers:

11.3b1
520
1589213

11.37a
519
602198

11.2c
595
2794697

11.3f
511
998223

Actions #13

Updated by Radoslaw Zarzynski about 2 years ago

  • Related to deleted (Bug #53729: ceph-osd takes all memory before oom on boot)
Actions #14

Updated by Radoslaw Zarzynski about 2 years ago

  • Is duplicate of Bug #53729: ceph-osd takes all memory before oom on boot added
Actions #15

Updated by Neha Ojha almost 2 years ago

  • Status changed from New to Resolved
Actions

Also available in: Atom PDF