Project

General

Profile

Bug #53729

ceph-osd takes all memory before oom on boot

Added by Gonzalo Aguilar Delgado 12 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Urgent
Category:
Performance/Resource Usage
Target version:
% Done:

100%

Source:
Tags:
Backport:
octopus, pacific, quincy
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi, I cannot boot half of my OSD all of them die by oom killed.

It seems they are taking all the memory. Everything goes right until logs "logging to monitors" in the osd LOG then it takes available memory + swap.
I added a big swapfile to let it run:

Seems to be similar to this bug: https://tracker.ceph.com/issues/51609 but I don't see splitting stuff.

MiB Mem :  15905,5 total,    203,3 libre,  15511,0 usado,    191,2 búfer/caché
MiB Intercambio:  35630,0 total,  17065,5 libre,  18564,5 usado.    130,2 dispon Mem 

    PID USUARIO   PR  NI    VIRT    RES    SHR S  %CPU  %MEM     HORA+ ORDEN                                            
 222720 ceph      20   0   34,5g  13,9g      0 S   2,9  89,7   2:31.37 ceph-osd    

Last lines of the log are:

2021-12-25T22:09:11.078+0100 7fdc2b4a7640  1 osd.6 pg_epoch: 1492087 pg[9.e( v 1492033'2437243 lc 0'0 (1491849'2435049,1492033'2437243] local-lis/les=1389428/1389429 n=521 ec=417/417 lis/c=1389428/1389428 les/c/f=1389429/1389429/11759 sis=1492087) [] r=-1 lpr=1492087 pi=[1389428,1492087)/1 crt=1492033'2437243 mlcod 0'0 unknown NOTIFY mbc={}] start_peering_interval up [13] -> [], acting [13] -> [], acting_primary 13 -> -1, up_primary 13 -> -1, role -1 -> -1, features acting 4540138297136906239 upacting 4540138297136906239
2021-12-25T22:09:11.146+0100 7fdc2b4a7640 10 filestore(/var/lib/ceph/osd/ceph-6) read(3471): meta/#-1:b31b205f:::osdmap.1492088:0# 0~11375/11375
2021-12-25T22:09:11.234+0100 7fdc2b4a7640 10 filestore(/var/lib/ceph/osd/ceph-6) read(3471): meta/#-1:b99b205f:::osdmap.1492089:0# 0~11375/11375
2021-12-25T22:09:11.358+0100 7fdc2b4a7640 10 filestore(/var/lib/ceph/osd/ceph-6) read(3471): meta/#-1:bf7b205f:::osdmap.1492090:0# 0~11375/11375
2021-12-25T22:09:11.442+0100 7fdc2b4a7640 10 filestore(/var/lib/ceph/osd/ceph-6) read(3471): meta/#-1:b1fb205f:::osdmap.1492091:0# 0~11375/11375
2021-12-25T22:09:11.498+0100 7fdc2b4a7640 10 filestore(/var/lib/ceph/osd/ceph-6) read(3471): meta/#-1:bafb205f:::osdmap.1492092:0# 0~11375/11375
2021-12-25T22:09:11.606+0100 7fdc2b4a7640  1 osd.6 pg_epoch: 1492093 pg[9.e( v 1492033'2437243 lc 0'0 (1491849'2435049,1492033'2437243] local-lis/les=1389428/1389429 n=521 ec=417/417 lis/c=1389428/1389428 les/c/f=1389429/1389429/11759 sis=1492087) [] r=-1 lpr=1492087 pi=[1389428,1492087)/1 crt=1492033'2437243 mlcod 0'0 unknown NOTIFY mbc={}] state<Start>: transitioning to Stray
2021-12-25T22:09:11.634+0100 7fdc2b4a7640  5 filestore(/var/lib/ceph/osd/ceph-6) queue_transactions(2324): osr 0x5640b8bb1b50 osr(9.e_head)
2021-12-25T22:21:12.083+0100 7fdc69ffb640 10 filestore(/var/lib/ceph/osd/ceph-6) sync_entry(4271): commit took 1.194381356s, interval was 1001.232788086s


Then it start to be much slow and it finally fails...

When it starts to take memory I can see in the cosole the log_to_monitors line...

2021-12-25T22:04:28.407+0100 7fdc843ed2c0 -1 Falling back to public interface
2021-12-25T22:04:30.943+0100 7fdc843ed2c0 -1 journal do_read_entry(2006315008): bad header magic
2021-12-25T22:04:30.943+0100 7fdc843ed2c0 -1 journal do_read_entry(2006315008): bad header magic
2021-12-25T22:05:26.490+0100 7fdc843ed2c0 -1 osd.6 1492093 log_to_monitors {default=true}


Version here is:
ceph version 16.2.6 (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable)

Mempools...
...

{{{

            "bluestore_fsck": {
                "items": 0,
                "bytes": 0
            },
            "bluestore_txc": {
                "items": 0,
                "bytes": 0
            },
            "bluestore_writing_deferred": {
                "items": 0,
                "bytes": 0
            },
            "bluestore_writing": {
                "items": 0,
                "bytes": 0
            },
            "bluefs": {
                "items": 0,
                "bytes": 0
            },
            "bluefs_file_reader": {
                "items": 0,
                "bytes": 0
            },
            "bluefs_file_writer": {
                "items": 0,
                "bytes": 0
            },
            "buffer_anon": {
                "items": 6790059,
                "bytes": 27660504904
            },
            "buffer_meta": {
                "items": 8906,
                "bytes": 783728
            },
            "osd": {
                "items": 188,
                "bytes": 2126656
            },
            "osd_mapbl": {
                "items": 52,
                "bytes": 631590
            },
            "osd_pglog": {
                "items": 23889940,
                "bytes": 2541361472
            },
            "osdmap": {
                "items": 8479,
                "bytes": 726184
            },
            "osdmap_mapping": {
                "items": 0,
                "bytes": 0
            },
            "pgmap": {
                "items": 0,
                "bytes": 0
            },
            "mds_co": {
                "items": 0,
                "bytes": 0
            },
            "unittest_1": {
                "items": 0,
                "bytes": 0
            },
            "unittest_2": {
                "items": 0,
                "bytes": 0
            }
        },
        "total": {
            "items": 30697624,
            "bytes": 30206134534
        }
    }
}

The osd's that are failing are the ones with filestore + leveldb. That's weird... Maybe it's related to that backend...

ceph-osd.3.log View (368 KB) Gonzalo Aguilar Delgado, 01/10/2022 09:34 PM

massif.out.23180 (271 KB) Gonzalo Aguilar Delgado, 01/10/2022 09:34 PM

pg-dump.log View (265 KB) Gonzalo Aguilar Delgado, 01/12/2022 08:30 PM

osd.3.mempools.log View (3.16 KB) Gonzalo Aguilar Delgado, 01/12/2022 08:37 PM

ceph-osd.3-99.log View - With setting indicated... (691 KB) Gonzalo Aguilar Delgado, 01/12/2022 08:40 PM

logs.aa (900 KB) Guillaume Fenollar, 03/10/2022 05:15 AM

logs.ab (900 KB) Guillaume Fenollar, 03/10/2022 05:15 AM

logs.ac (900 KB) Guillaume Fenollar, 03/10/2022 05:15 AM

logs.ad (793 KB) Guillaume Fenollar, 03/10/2022 05:15 AM

logs.bz2 (40.3 KB) Guillaume Fenollar, 03/11/2022 07:07 AM

logs.bz2 (46.1 KB) Guillaume Fenollar, 03/11/2022 01:01 PM

pg_log_dups.txt View (2.5 KB) Mykola Golub, 03/16/2022 02:20 PM

CephBug.png View - All memory out (336 KB) Gonzalo Aguilar Delgado, 04/20/2022 10:50 AM


Related issues

Related to RADOS - Bug #48468: ceph-osd crash before being up again Need More Info
Related to RADOS - Support #51609: OSD refuses to start (OOMK) due to pg split New
Related to RADOS - Feature #54115: Log pglog entry size in OSD log if it exceeds certain size limit In Progress
Related to RADOS - Bug #52261: OSD takes all memory and crashes, after pg_num increase Need More Info
Related to RADOS - Bug #51626: OSD uses all host memory (80g) on startup due to pg_split New
Duplicated by RADOS - Bug #54296: OSDs using too much memory Resolved
Copied to RADOS - Backport #55631: pacific: ceph-osd takes all memory before oom on boot Resolved
Copied to RADOS - Backport #55632: quincy: ceph-osd takes all memory before oom on boot Resolved
Copied to RADOS - Backport #55633: octopus: ceph-osd takes all memory before oom on boot Rejected

History

#1 Updated by Igor Fedotov 12 months ago

Could you please set debug-osd to 5/20 and share relevant OSD startup log?

#2 Updated by Sebastian Wagner 11 months ago

  • Related to Bug #48468: ceph-osd crash before being up again added

#3 Updated by Sebastian Wagner 11 months ago

  • Description updated (diff)

#4 Updated by Igor Fedotov 11 months ago

  • Related to Support #51609: OSD refuses to start (OOMK) due to pg split added

#5 Updated by Igor Fedotov 11 months ago

  • Priority changed from Normal to High

#7 Updated by Neha Ojha 11 months ago

  • Project changed from Ceph to RADOS
  • Category deleted (OSD)

#8 Updated by Gonzalo Aguilar Delgado 11 months ago

I did something better. I added a new OSD with bluestore to see if it's a problem of the filestore backend.

Then I added one of the problematic pgs. That I can share if you want but it's huge. And booted the OSD. It failed same way that old filestore OSDs.

So I ran valgrind with it. It's incredible but one of runs made the OSD to go ahead. I don't know if valgrind gave enough time to the OSD to start compacting so it worked. Or it was because it was done at the very begining of the system boot and the system was free of io or any other resources... The result is that it worked.

So I inserted another one. In this case it was pg.9.c and this time I captured it. The log and the valgrind... Yes!

I attach it.

Command was: valgrind --tool=massif /usr/bin/ceph-osd -d --cluster ceph --id 3 --setuser ceph --setgroup ceph --debug_osd 5 --debug_mon 2 --debug_filestore 7 --debug_monc 99 --debug_ms 5 --debug_auth 5 2>&1 | tee ceph-osd.3.log

I see:
3,1 GiB: ceph::buffer::v15_2_0::list::refill_append_space(unsigned int) (in /usr/bin/ceph-osd)

To be the offending function.

Drilling down I see PGPeeringITem:run to start thread with clone and this kills the system.

Please inspect it. But to me it's clear that the problem is in the split. So if you can provide a workaround it will be appreciated. To just not split or something else.

#9 Updated by Gonzalo Aguilar Delgado 11 months ago

I was taking a look to:
3,1 GiB: OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*) (in /usr/bin/ceph-osd)

Since it's a thread wait for available workers it seems there's something wrong there. I don't have the time to inspect... But I did two things

Set shards to a low number

[osd.3]
osd_op_num_shards_hdd = 0
osd_op_num_shards_ssd = 0

Crashes the OSD on boot.

[osd.3]
osd_op_num_shards_hdd = 1
osd_op_num_shards_ssd = 1

Takes memory in the very early initialization...

[osd.3]
osd_op_num_shards_hdd = 20
osd_op_num_shards_ssd = 20

Makes the OSD to work... Takes about 8Gb but it seems it goes ahead. I will try with another one to confirm. I will report results.

#10 Updated by Gonzalo Aguilar Delgado 11 months ago

Forget about previous comment.

The stack trace is just the opposite, seems that the call to encode in PGog::_write_log_and_missing is doing something wrong...

#11 Updated by Neha Ojha 11 months ago

Like the other case reported in the mailing list ([ceph-users] OSDs use 200GB RAM and crash) and https://tracker.ceph.com/issues/51609, do you see crashes in _write_log_and_missing during split_pgs? If so, can you provide debug logs from that osd with debug_osd=20,debug_ms=1, including the output "ceph pg dump" and "ceph daemon osd.x dump_mempools"?

#12 Updated by Gonzalo Aguilar Delgado 11 months ago

Hi,

The logs I've already provided had:
--debug_osd 90 --debug_mon 2 --debug_filestore 7 --debug_monc 99 --debug_ms 5 --debug_auth 5

I've restricted last run to 8gb because don't want to hurt other running osds...

#13 Updated by Gonzalo Aguilar Delgado 11 months ago

Neha Ojha wrote:

Like the other case reported in the mailing list ([ceph-users] OSDs use 200GB RAM and crash) and https://tracker.ceph.com/issues/51609, do you see crashes in _write_log_and_missing during split_pgs? If so, can you provide debug logs from that osd with debug_osd=20,debug_ms=1, including the output "ceph pg dump" and "ceph daemon osd.x dump_mempools"?

Did you see the Valgrind output? Is just that the problem... You have the calltrace if you want to investigate with the memory consumption along time.

#14 Updated by Gonzalo Aguilar Delgado 11 months ago

Neha Ojha wrote:

Like the other case reported in the mailing list ([ceph-users] OSDs use 200GB RAM and crash) and https://tracker.ceph.com/issues/51609, do you see crashes in _write_log_and_missing during split_pgs? If so, can you provide debug logs from that osd with debug_osd=20,debug_ms=1, including the output "ceph pg dump" and "ceph daemon osd.x dump_mempools"?

Attach the log again

#15 Updated by Neha Ojha 11 months ago

Gonzalo Aguilar Delgado wrote:

Hi,

The logs I've already provided had:
--debug_osd 90 --debug_mon 2 --debug_filestore 7 --debug_monc 99 --debug_ms 5 --debug_auth 5

This did not have debug_osd=20.

I've restricted last run to 8gb because don't want to hurt other running osds...

#16 Updated by Neha Ojha 11 months ago

Gonzalo Aguilar Delgado wrote:

Neha Ojha wrote:

Like the other case reported in the mailing list ([ceph-users] OSDs use 200GB RAM and crash) and https://tracker.ceph.com/issues/51609, do you see crashes in _write_log_and_missing during split_pgs? If so, can you provide debug logs from that osd with debug_osd=20,debug_ms=1, including the output "ceph pg dump" and "ceph daemon osd.x dump_mempools"?

Did you see the Valgrind output? Is just that the problem... You have the calltrace if you want to investigate with the memory consumption along time.

Yes, we have seen it, however we would like to understand what the state of the PGs are and what is leading to the memory usage in _write_log_and_missing. The latest logs have the desired log level, thanks! let us get back to you.

#18 Updated by Gonzalo Aguilar Delgado 11 months ago

Igor Fedotov wrote:

One more case:
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/FQXV452YLHBJW6Y2UK7WUZP7HO5PVIA5/

There's a difference. Our OSD's don't die. They eat up ram forever, and they abort because heartbeat or OOM. But there's no evident software error there... Apart of the bug we want to catch.

#19 Updated by Gonzalo Aguilar Delgado 11 months ago

Neha Ojha wrote:

Gonzalo Aguilar Delgado wrote:

Neha Ojha wrote:

Like the other case reported in the mailing list ([ceph-users] OSDs use 200GB RAM and crash) and https://tracker.ceph.com/issues/51609, do you see crashes in _write_log_and_missing during split_pgs? If so, can you provide debug logs from that osd with debug_osd=20,debug_ms=1, including the output "ceph pg dump" and "ceph daemon osd.x dump_mempools"?

Did you see the Valgrind output? Is just that the problem... You have the calltrace if you want to investigate with the memory consumption along time.

Yes, we have seen it, however we would like to understand what the state of the PGs are and what is leading to the memory usage in _write_log_and_missing. The latest logs have the desired log level, thanks! let us get back to you.

I can post one of the pgs... Have several. Just tell me if you need it.

#20 Updated by Gonzalo Aguilar Delgado 11 months ago

Do you need something else to find a workaround or the full solution?

Is there anything I can do?

#21 Updated by Neha Ojha 11 months ago

  • Priority changed from High to Urgent

#22 Updated by Gonzalo Aguilar Delgado 11 months ago

How can I help to accelerate a bugfix or workaround?

If comment your investigations I can builld a docker image to help debug the problem. Or even debug it... But I don't know where to start...

#23 Updated by Mark Nelson 10 months ago

Hi Gonzalo,

I'm not an expert regarding this code so please take my reply here with a grain of salt (and others please feel free to jump in and correct any mistakes). Neha, Josh, and I investigated this issue for several hours the week before last. Unfortunately with the quincy freeze everything is a little time-crunched right now, but I think we generally agreed that the lack of rate limiting for PG spliting may be the culprit here. Based on the massif output provided we saw high memory usage via pg_log_dup_t::encode which seems to be coming from this block of code:

https://github.com/ceph/ceph/blob/master/src/osd/PGLog.cc#L859-L875

basically we are creating bufferlists, encoding into them, and then moving them into a map of bufferlists where it disappears for queuing and eventual processing as part of a transaction. The thought right now is that we don't do anything to prevent or slow that down so we could end up with tons of queued data sitting around in memory before it is eventually freed somewhere else. We aren't 100% sure that's the full extent of it, but this is our theory right now. To effect a fix we'll need to figure out how to best limit how often this code is called (there may be different approaches). We're planning on meeting again later this week to discuss further.

#24 Updated by Mark Nelson 10 months ago

In the mean time, Neha mentioned that you might be able to prevent the pgs from splitting by turning off the autoscaler:

ceph osd pool set foo pg_autoscale_mode off

Right now you have to do that for each pool, but you'll be able to do it globally when this PR merges:

https://github.com/ceph/ceph/pull/44540

#25 Updated by Gonzalo Aguilar Delgado 10 months ago

Mark Nelson wrote:

Hi Gonzalo,

I'm not an expert regarding this code so please take my reply here with a grain of salt (and others please feel free to jump in and correct any mistakes). Neha, Josh, and I investigated this issue for several hours the week before last. Unfortunately with the quincy freeze everything is a little time-crunched right now, but I think we generally agreed that the lack of rate limiting for PG spliting may be the culprit here. Based on the massif output provided we saw high memory usage via pg_log_dup_t::encode which seems to be coming from this block of code:

https://github.com/ceph/ceph/blob/master/src/osd/PGLog.cc#L859-L875

basically we are creating bufferlists, encoding into them, and then moving them into a map of bufferlists where it disappears for queuing and eventual processing as part of a transaction. The thought right now is that we don't do anything to prevent or slow that down so we could end up with tons of queued data sitting around in memory before it is eventually freed somewhere else. We aren't 100% sure that's the full extent of it, but this is our theory right now. To effect a fix we'll need to figure out how to best limit how often this code is called (there may be different approaches). We're planning on meeting again later this week to discuss further.

Hi Mark,

Looks like a promising path. Indeed it seems the structure is filled of data and filling goes faster than freeing. I will try to give it a shoot this weekend to see if I can support you in any way.

Thank you for feedback.

#26 Updated by Gonzalo Aguilar Delgado 10 months ago

Mark Nelson wrote:

In the mean time, Neha mentioned that you might be able to prevent the pgs from splitting by turning off the autoscaler:

ceph osd pool set foo pg_autoscale_mode off

Right now you have to do that for each pool, but you'll be able to do it globally when this PR merges:

https://github.com/ceph/ceph/pull/44540

Hi,

Nothing a script can't do:

ceph osd pool ls | xargs -n1 -istr ceph osd pool set str pg_autoscale_mode off

But anyway it doesn't improve the situation. The OSD is starving the node and finally dies... So sadly, no improvement observed after disabling autoscaling...

#27 Updated by Mark Nelson 10 months ago

Gonzalo Aguilar Delgado wrote:

Hi,

Nothing a script can't do:

ceph osd pool ls | xargs -n1 -istr ceph osd pool set str pg_autoscale_mode off

But anyway it doesn't improve the situation. The OSD is starving the node and finally dies... So sadly, no improvement observed after disabling autoscaling...

Did you continue to observe attempts to split PGs? I don't know much about the autoscaling code, but perhaps there is some permanence here that disabling the autoscaling doesn't prevent (ie perhaps existing splits still want to proceed even after disabling autoscaling).

#28 Updated by Neha Ojha 10 months ago

  • Status changed from New to Need More Info

#29 Updated by Guillaume Fenollar 9 months ago

Hi everyone,
I'm having this issue as well for several weeks. Something situations stabilizes by themselves, sometimes not. I face it on two OSD which take dozens of GB of RAM on startup (during recovery) and crashe on OoM in a loop. This situation was (unless it's only coincidence) following a autoscaling operation on a pool right after a massive clean (40% of used data deleted in less than 12 hours). Autoscaling was seen in "ceph status" and less than one hour after (in middle of the night, no other activity on disks), two OSDs began to crash.
I'm using vanilla Ceph (no rook here) on Kubernetes. I saw here and there that some parameters could help recovery so I tried following parameters, with no visible effect (verified this was taken into account with ceph daemon osd.9 config get xxx):

osd_min_pg_log_entries=500
osd_max_pg_log_entries=500
bluefs_buffered_io=false

Since it's crashing on a recovery step only and that RAM usage on normal condition is between 4Gi and 7Gi, I tried adding swap to see if I can pass through it, so I added 24GB of swap and it ended on OoM too :

[8573944.896426] Out of memory: Killed process 1035418 (ceph-osd) total-vm:51048328kB, anon-rss:28411524kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:98792kB oom_score_adj:745
[8573951.556912] oom_reaper: reaped process 1035418 (ceph-osd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Here's a dump_mempool result on this OSD seconds before it gets killed. But the total.bytes value is roughly 50% of total RSS process value at the same time, I find it kind of weird but I guess it's normal :

{
    "mempool": {
        "by_pool": {
            "bloom_filter": {
                "items": 0,
                "bytes": 0
            },
            "bluestore_alloc": {
                "items": 4672028,
                "bytes": 97828432
            },
            "bluestore_cache_data": {
                "items": 565,
                "bytes": 2118119
            },
            "bluestore_cache_onode": {
                "items": 395,
                "bytes": 243320
            },
            "bluestore_cache_meta": {
                "items": 18080,
                "bytes": 101032
            },
            "bluestore_cache_other": {
                "items": 808,
                "bytes": 25704
            },
            "bluestore_Buffer": {
                "items": 246,
                "bytes": 23616
            },
            "bluestore_Extent": {
                "items": 247,
                "bytes": 11856
            },
            "bluestore_Blob": {
                "items": 247,
                "bytes": 25688
            },
            "bluestore_SharedBlob": {
                "items": 247,
                "bytes": 27664
            },
            "bluestore_inline_bl": {
                "items": 0,
                "bytes": 0
            },
            "bluestore_fsck": {
                "items": 0,
                "bytes": 0
            },
            "bluestore_txc": {
                "items": 0,
                "bytes": 0
            },
            "bluestore_writing_deferred": {
                "items": 0,
                "bytes": 0
            },
            "bluestore_writing": {
                "items": 0,
                "bytes": 0
            },
            "bluefs": {
                "items": 4464,
                "bytes": 95720
            },
            "bluefs_file_reader": {
                "items": 302,
                "bytes": 25996160
            },
            "bluefs_file_writer": {
                "items": 3,
                "bytes": 672
            },
            "buffer_anon": {
                "items": 3933478,
                "bytes": 15735259905
            },
            "buffer_meta": {
                "items": 328,
                "bytes": 28864
            },
            "osd": {
                "items": 146,
                "bytes": 1888656
            },
            "osd_mapbl": {
                "items": 0,
                "bytes": 0
            },
            "osd_pglog": {
                "items": 57648622,
                "bytes": 6087215888
            },
            "osdmap": {
                "items": 9969,
                "bytes": 481056
            },
            "osdmap_mapping": {
                "items": 0,
                "bytes": 0
            },
            "pgmap": {
                "items": 0,
                "bytes": 0
            },
            "mds_co": {
                "items": 0,
                "bytes": 0
            },
            "unittest_1": {
                "items": 0,
                "bytes": 0
            },
            "unittest_2": {
                "items": 0,
                "bytes": 0
            }
        },
        "total": {
            "items": 66290175,
            "bytes": 21951372352
        }
    }
}

Here the machine configuration :

CPU : 40x E5-2630 v4 @ 2.20GHz
MEM : 32Gi
Debian 10 with Linux 5.10.0-0.bpo.9-amd64

ceph status

  cluster:
    id:     6d981e8d-e93a-42f3-aee3-6862eb2faea8
    health: HEALTH_WARN
            1 MDSs report slow metadata IOs
            1 MDSs report slow requests

  services:
    mon: 3 daemons, quorum ceph-4,ceph-5,ceph-3 (age 20h)
    mgr: b(active, since 34h), standbys: a
    mds: default:1 {0=a=up:active} 1 up:standby
    osd: 6 osds: 5 up (since 19h), 5 in (since 23h)

  data:
    pools:   5 pools, 209 pgs
    objects: 1.77M objects, 3.0 TiB
    usage:   8.5 TiB used, 9.7 TiB / 18 TiB avail
    pgs:     209 active+clean

  io:
    client:   3.2 MiB/s rd, 2.1 MiB/s wr, 121 op/s rd, 98 op/s wr

Ceph.conf

[global]
fsid = 6d981e8d-e93a-42f3-aee3-6862eb2faea8
mon_host = [v2:10.100.0.5:3300/0,v1:10.100.0.5:6789/0] [v2:10.100.0.10:3300/0,v1:10.100.0.10:6789/0] [v2:10.100.0.11:3300/0,v1:10.100.0.11:6789/0] 
public_network = 10.100.0.0/16

mon_max_pg_per_osd = 500 # useless on this cluster as we don't have that much PGs
osd_max_pg_per_osd_hard_ratio = 5

ceph df

--- RAW STORAGE ---
CLASS  SIZE    AVAIL    USED     RAW USED  %RAW USED
nvme   18 TiB  9.7 TiB  8.4 TiB   8.5 TiB      46.65
TOTAL  18 TiB  9.7 TiB  8.4 TiB   8.5 TiB      46.65

--- POOLS ---
POOL                   ID  PGS  STORED   OBJECTS  USED     %USED  MAX AVAIL
volumes                17  128  2.0 TiB  557.21k  6.0 TiB  61.33    1.3 TiB
cephfs_data            50   64  1.3 TiB  884.82k  4.0 TiB  51.29    1.3 TiB
cephfs_metadata        51    8  772 MiB  329.96k  2.3 GiB   0.06    1.3 TiB
device_health_metrics  55    1   10 MiB        7   30 MiB      0    1.3 TiB
vms                    63    8      0 B        0      0 B      0    1.3 TiB

ceph osd df

ID   CLASS  WEIGHT    REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME      
 -1         21.81943         -   18 TiB  8.5 TiB  8.4 TiB  1.6 GiB   36 GiB  9.7 TiB  46.65  1.00    -          root default   
 -4          7.27278         -  3.6 TiB  2.8 TiB  2.8 TiB  476 MiB  8.5 GiB  833 GiB  77.62  1.66    -              host ceph-3
  1   nvme   3.63379   1.00000  3.6 TiB  2.8 TiB  2.8 TiB  476 MiB  8.5 GiB  833 GiB  77.62  1.66  209      up          osd.1  
  9   nvme   3.63899         0      0 B      0 B      0 B      0 B      0 B      0 B      0     0    0    down          osd.9  
 -9          7.27768         -  7.3 TiB  2.8 TiB  2.8 TiB  588 MiB   11 GiB  4.4 TiB  38.85  0.83    -              host ceph-4
  4   nvme   3.63869   1.00000  3.6 TiB  1.3 TiB  1.3 TiB  184 MiB  3.0 GiB  2.4 TiB  34.97  0.75   96      up          osd.4  
 12   nvme   3.63899   1.00000  3.6 TiB  1.6 TiB  1.5 TiB  403 MiB  7.7 GiB  2.1 TiB  42.73  0.92  113      up          osd.12 
-16          7.26897         -  7.3 TiB  2.8 TiB  2.8 TiB  528 MiB   17 GiB  4.4 TiB  38.98  0.84    -              host ceph-5
  0   nvme   3.63898   1.00000  3.6 TiB  1.6 TiB  1.6 TiB  299 MiB  9.2 GiB  2.1 TiB  42.92  0.92  113      up          osd.0  
 11   nvme   3.62999   1.00000  3.6 TiB  1.3 TiB  1.3 TiB  229 MiB  8.0 GiB  2.4 TiB  35.05  0.75   96      up          osd.11 
                         TOTAL   18 TiB  8.5 TiB  8.4 TiB  1.6 GiB   36 GiB  9.7 TiB  46.65                                    
MIN/MAX VAR: 0.75/1.66  STDDEV: 15.87

rados df

POOL_NAME                USED OBJECTS CLONES  COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED     RD_OPS      RD      WR_OPS      WR USED COMPR UNDER COMPR 
cephfs_data           4.0 TiB  884824      0 2654472                  0       0        0    7430961 1.3 TiB     4061453 654 GiB        0 B         0 B 
cephfs_metadata       1.1 GiB  329964      0  989892                  0       0        0     629361 1.8 GiB    12463605  55 GiB        0 B         0 B 
device_health_metrics 5.1 MiB       7      0      21                  0       0        0       2518 8.0 MiB        3839  14 MiB        0 B         0 B 
vms                       0 B       0      0       0                  0       0        0          0     0 B           0     0 B        0 B         0 B 
volumes               6.0 TiB  557213      0 1671639                  0       0        0 2861847949 264 TiB 20752716004 1.6 PiB        0 B         0 B 

total_objects    1772008
total_used       8.5 TiB
total_avail      9.7 TiB
total_space      18 TiB

Only way I found to make the OSD start again is to zap/recreate it. It's really time and ressource consuming so I'd like to avoid doing that. At the moment, I still keep a failing OSD so if there's a way I can help you troubleshoot it, please tell me :)

#30 Updated by Guillaume Fenollar 9 months ago

BTW I'm using Ceph 15.2.16

#31 Updated by Mykola Golub 9 months ago

We seem to observe a similar issue (16.2.7). On a pool with autoscale disabled pg num was changed from 256 to 1024. After this the osd started to consume a lot of memory and was OOM killed.
And now when trying to start an osd, after loading pgs it seems to start doing pg split and eating all memory.

Is there a way (may be with some patch?) to avoid this pg split? Or any other workaround?

#32 Updated by Neha Ojha 9 months ago

Mykola Golub wrote:

We seem to observe a similar issue (16.2.7). On a pool with autoscale disabled pg num was changed from 256 to 1024. After this the osd started to consume a lot of memory and was OOM killed.

Can you share the output of "ceph osd dump"? I suspect that though you may have disabled the autoscaler, the pg num target has not been met yet. You can also set target_max_misplaced_ratio to a negative value to immediately stop any further changes to pg num.

And now when trying to start an osd, after loading pgs it seems to start doing pg split and eating all memory.

Is there a way (may be with some patch?) to avoid this pg split? Or any other workaround?

PGs should no longer split after pg num/target pg num reaches steady state, once the autoscaler has been turned off,

#33 Updated by Neha Ojha 9 months ago

Can anyone provide osd logs with debug_osd=20,debug_ms=1 for OSDs that are hitting OOM?

#34 Updated by Mykola Golub 9 months ago

Neha Ojha wrote:

Can anyone provide osd logs with debug_osd=20,debug_ms=1 for OSDs that are hitting OOM?

I uploaded the log to teuthology frontend in my home directory: /home/trociny/ceph-osd.7.log.gz

After

2022-03-09T19:16:21.422+0000 7f8dd1b4f080  0 osd.7 9085 load_pgs opened 164 pgs

we see
2022-03-09T19:16:21.422+0000 7f8dd1b4f080 20 osd.7 9085 identify_splits_and_merges 2.378 e9085 to e9085 pg_nums {7938=256,9044=1024}
2022-03-09T19:16:21.422+0000 7f8dd1b4f080 20 osd.7 9085 identify_splits_and_merges 2.3f8 e9045 to e9085 pg_nums {7938=256,9044=1024}
2022-03-09T19:16:21.422+0000 7f8dd1b4f080 20 osd.7 9085 identify_splits_and_merges 2.3f0 e9085 to e9085 pg_nums {7938=256,9044=1024}
2022-03-09T19:16:21.422+0000 7f8dd1b4f080 20 osd.7 9085 identify_splits_and_merges 2.278 e9085 to e9085 pg_nums {7938=256,9044=1024}
2022-03-09T19:16:21.422+0000 7f8dd1b4f080 20 osd.7 9085 identify_splits_and_merges 2.290 e9044 to e9085 pg_nums {7938=256,9044=1024}
2022-03-09T19:16:21.422+0000 7f8dd1b4f080 20 osd.7 9085 identify_splits_and_merges 2.178 e9085 to e9085 pg_nums {7938=256,9044=1024}

And here where fast memory grows starts. This time I was not waiting for the osd to be oom killed, stopped when it startes swapping.

#35 Updated by Mykola Golub 9 months ago

Neha Ojha wrote:

Can you share the output of "ceph osd dump"? I suspect that though you may have disabled the autoscaler, the pg num target has not been met yet. You can also set target_max_misplaced_ratio to a negative value to immediately stop any further changes to pg num.

root@ceph6:~# ceph osd dump
epoch 9245
fsid b57907d0-eeea-11eb-b1d1-6b548c3b0e9c
created 2021-07-27T14:56:25.191938+0000
modified 2022-03-09T19:16:22.350691+0000
flags noout,noscrub,nodeep-scrub,sortbitwise,recovery_deletes,purged_snapdirs,pglog_hardlimit
crush_version 402
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85
require_min_compat_client luminous
min_compat_client luminous
require_osd_release pacific
stretch_mode_enabled false
pool 1 'device_health_metrics' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode off last_change 8990 flags hashpspool stripe_width 0 pg_num_min 1 application mgr_devicehealth
pool 2 'ssd' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1024 pgp_num 256 pg_num_target 256 autoscale_mode off last_change 9231 lfor 0/7938/9044 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 9 '.rgw.root' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode off last_change 5070 lfor 0/5070/5068 flags hashpspool stripe_width 0 application rgw
pool 10 'default.rgw.log' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode off last_change 4915 lfor 0/4915/4913 flags hashpspool stripe_width 0 application rgw
pool 11 'default.rgw.control' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode off last_change 5082 lfor 0/5082/5080 flags hashpspool stripe_width 0 application rgw
pool 12 'default.rgw.meta' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode off last_change 5098 lfor 0/5098/5096 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 8 application rgw
max_osd 15
osd.0 down out weight 0 up_from 8965 up_thru 9044 down_at 9051 last_clean_interval [8830,8957) [v2:10.82.12.13:6805/3377940602,v1:10.82.12.13:6807/3377940602] [v2:10.82.13.13:6806/3377940602,v1:10.82.13.13:6807/3377940602] autoout,exists c12bbaa1-249e-4e92-81ee-c168653608fc
osd.1 down out weight 0 up_from 8950 up_thru 9044 down_at 9052 last_clean_interval [8789,8937) [v2:10.82.12.12:6802/2776047067,v1:10.82.12.12:6804/2776047067] [v2:10.82.13.12:6803/2776047067,v1:10.82.13.12:6805/2776047067] autoout,exists 41154034-1e49-45fd-8d40-67c07de9df63
osd.2 down out weight 0 up_from 8988 up_thru 9044 down_at 9047 last_clean_interval [8883,8971) [v2:10.82.12.14:6816/3193772986,v1:10.82.12.14:6817/3193772986] [v2:10.82.13.14:6816/3193772986,v1:10.82.13.14:6817/3193772986] autoout,exists 2c1cd6b5-4b49-49be-9ac6-9416adade39b
osd.3 down in  weight 1 up_from 8961 up_thru 9044 down_at 9214 last_clean_interval [8821,8957) [v2:10.82.12.13:6800/3541614691,v1:10.82.12.13:6801/3541614691] [v2:10.82.13.13:6800/3541614691,v1:10.82.13.13:6801/3541614691] exists 395f4f8f-b049-47bf-acc5-960c8a72ace6
osd.4 down out weight 0 up_from 8947 up_thru 9044 down_at 9050 last_clean_interval [8784,8937) [v2:10.82.12.12:6812/1186063140,v1:10.82.12.12:6813/1186063140] [v2:10.82.13.12:6812/1186063140,v1:10.82.13.12:6813/1186063140] autoout,exists 413bd1e9-a289-4793-a987-3e067c068842
osd.5 down in  weight 1 up_from 8975 up_thru 9044 down_at 9053 last_clean_interval [8865,8971) [v2:10.82.12.14:6801/1377249007,v1:10.82.12.14:6803/1377249007] [v2:10.82.13.14:6802/1377249007,v1:10.82.13.14:6803/1377249007] exists f055fdb6-5aec-4bc0-a643-ed8b08b4f1dd
osd.6 down in  weight 1 up_from 8980 up_thru 9044 down_at 9049 last_clean_interval [8864,8971) [v2:10.82.12.14:6812/1029481558,v1:10.82.12.14:6813/1029481558] [v2:10.82.13.14:6812/1029481558,v1:10.82.13.14:6813/1029481558] exists 40076794-4a59-4156-9236-aa009606c1b7
osd.7 down in  weight 1 up_from 8955 up_thru 9044 down_at 9046 last_clean_interval [8806,8937) [v2:10.82.12.12:6800/4073071257,v1:10.82.12.12:6801/4073071257] [v2:10.82.13.12:6800/4073071257,v1:10.82.13.12:6801/4073071257] exists c768e776-aa51-43ed-b804-7c03b4583aa1
osd.8 down in  weight 1 up_from 8970 up_thru 9044 down_at 9047 last_clean_interval [8847,8957) [v2:10.82.12.13:6812/3651907982,v1:10.82.12.13:6813/3651907982] [v2:10.82.13.13:6812/3651907982,v1:10.82.13.13:6813/3651907982] exists 3d933460-d2f2-4215-8cef-8cdc825992f3
osd.9 down in  weight 1 up_from 8943 up_thru 9044 down_at 9048 last_clean_interval [8777,8937) [v2:10.82.12.12:6808/393928663,v1:10.82.12.12:6809/393928663] [v2:10.82.13.12:6808/393928663,v1:10.82.13.12:6809/393928663] exists 44f57041-5d4b-4ecd-828a-19f6964f6adb
osd.10 down in  weight 1 up_from 8965 up_thru 9044 down_at 9048 last_clean_interval [8825,8957) [v2:10.82.12.13:6804/3004065658,v1:10.82.12.13:6806/3004065658] [v2:10.82.13.13:6804/3004065658,v1:10.82.13.13:6805/3004065658] exists 100ca34e-522f-42d3-9213-355c1a995149
osd.11 down in  weight 1 up_from 8979 up_thru 9044 down_at 9051 last_clean_interval [8859,8971) [v2:10.82.12.14:6800/3864882448,v1:10.82.12.14:6802/3864882448] [v2:10.82.13.14:6800/3864882448,v1:10.82.13.14:6801/3864882448] exists 13a699b3-ac86-4240-8fe0-292a1873c6a2
osd.12 down in  weight 1 up_from 8951 up_thru 9044 down_at 9213 last_clean_interval [8797,8937) [v2:10.82.12.12:6816/355852202,v1:10.82.12.12:6817/355852202] [v2:10.82.13.12:6816/355852202,v1:10.82.13.12:6817/355852202] exists 59052c28-fae4-4d21-8fae-e44dadc35d26
osd.13 down in  weight 1 up_from 8965 up_thru 9044 down_at 9215 last_clean_interval [8840,8957) [v2:10.82.12.13:6814/2097888550,v1:10.82.12.13:6815/2097888550] [v2:10.82.13.13:6814/2097888550,v1:10.82.13.13:6815/2097888550] exists 41090596-c993-4b96-b58c-df6631e3d05b
osd.14 down in  weight 1 up_from 8984 up_thru 9044 down_at 9216 last_clean_interval [8874,8971) [v2:10.82.12.14:6804/90195589,v1:10.82.12.14:6807/90195589] [v2:10.82.13.14:6805/90195589,v1:10.82.13.14:6809/90195589] exists 6ec4028b-bbec-4484-a957-109fa765494f
pg_upmap_items 2.84 [13,3]
pg_upmap_items 2.f6 [13,3]
pg_upmap_items 2.fb [13,3]

The history is that cephfs pools were deleted on this cluster. Then it was decided to increase pg_num for the ssd pool. It was changed to 1024. Some time later after this the issue was noticed. Starting any osd did not help. Late the pg_num was returned to 256 (that is why we have now pg_num 1024 pgp_num 256 pg_num_target 256) hoping it would help to avoid pg split, but it did not help.

#36 Updated by Guillaume Fenollar 9 months ago

Neha Ojha wrote:

Can anyone provide osd logs with debug_osd=20,debug_ms=1 for OSDs that are hitting OOM?

I just started a stuck OSD with the options you mentioned. No swap on the system so it could crash "quickly" to avoid millions of lines of logs. I ended up with a 184k lines' log. You'll find it attached in bzip2 format splited to several files. Just in case, following command will restore the original log file :

cat logs.* | bunzip2 > logs

Good luck with the debug, I keep the OSD in the failing state in case you want me to test something

#37 Updated by Mykola Golub 9 months ago

Mykola Golub wrote:

pool 2 'ssd' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1024 pgp_num 256 pg_num_target 256 autoscale_mode off last_change 9231 lfor 0/7938/9044 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd

Still thinking about workaround, how to disable this splitting process. The previous suggestions (autoscale off or negative target_max_misplaced_ratio) did not work.

As you may see from the osd dump output, after trying to return the old pg_num value with `ceph osd pool set ssd pg_num 256` command we still have pg_num 1024 (and pg_num_target 256). Do you think it might help if we forced pg_num to 256? There seem to be "pg_num_actual" param that could be used to force setting pg_num (bypassing pg_num_target) but it does not work:

root@ceph8:~# ceph osd pool set ssd pg_num_actual 256
Invalid command: pg_num_actual not in size|min_size|pg_num|pgp_num|pgp_num_actual|crush_rule|hashpspool|nodelete|nopgchange|nosizechange|write_fadvise_dontneed|noscrub|nodeep-scrub|hit_set_type|hit_set_period|hit_set_count|hit_set_fpp|use_gmt_hitset|target_max_bytes|target_max_objects|cache_target_dirty_ratio|cache_target_dirty_high_ratio|cache_target_full_ratio|cache_min_flush_age|cache_min_evict_age|min_read_recency_for_promote|min_write_recency_for_promote|fast_read|hit_set_grade_decay_rate|hit_set_search_last_n|scrub_min_interval|scrub_max_interval|deep_scrub_interval|recovery_priority|recovery_op_priority|scrub_priority|compression_mode|compression_algorithm|compression_required_ratio|compression_max_blob_size|compression_min_blob_size|csum_type|csum_min_block|csum_max_block|allow_ec_overwrites|fingerprint_algorithm|pg_autoscale_mode|pg_autoscale_bias|pg_num_min|target_size_bytes|target_size_ratio|dedup_tier|dedup_chunk_algorithm|dedup_cdc_chunk_size

It looks like a bug in MonCommands.h [1] and pg_num_actual is just missing in the list of allowed values.

Do you think it makes sense to try to patch the ceph-mon and force the pg_num? I am asking before trying before patching is troublesome, as it will need to rebuild the container image.

https://github.com/ceph/ceph/blob/40610e8fb7b7bea24bddde6d50c8a65f7a52d701/src/mon/MonCommands.h#L1093

#38 Updated by Dan van der Ster 9 months ago

Guillaume Fenollar wrote:

Neha Ojha wrote:

Can anyone provide osd logs with debug_osd=20,debug_ms=1 for OSDs that are hitting OOM?

I just started a stuck OSD with the options you mentioned. No swap on the system so it could crash "quickly" to avoid millions of lines of logs. I ended up with a 184k lines' log. You'll find it attached in bzip2 format splited to several files. Just in case, following command will restore the original log file :
[...]

Good luck with the debug, I keep the OSD in the failing state in case you want me to test something

Guillaume can you tell approx or exactly when in the logs that the memory usage explodes? Can you somehow annotate the usage over time in the log?

#39 Updated by Dan van der Ster 9 months ago

Can you somehow annotate the usage over time in the log?

Could you please also set debug_prioritycache=5 -- this will add a debug line with memory usage every second.
And also try osd_memory_cache_resize_interval=0.2 -- this will increase the frequency and perhaps make things even more clear.

#40 Updated by Dan van der Ster 9 months ago

  • Related to Bug #54296: OSDs using too much memory added

#41 Updated by Dan van der Ster 9 months ago

  • Related to Feature #54115: Log pglog entry size in OSD log if it exceeds certain size limit added

#42 Updated by Dan van der Ster 9 months ago

  • Related to Bug #52261: OSD takes all memory and crashes, after pg_num increase added

#43 Updated by Dan van der Ster 9 months ago

  • Related to Bug #51626: OSD uses all host memory (80g) on startup due to pg_split added

#44 Updated by Neha Ojha 9 months ago

Mykola Golub wrote:

Mykola Golub wrote:

pool 2 'ssd' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1024 pgp_num 256 pg_num_target 256 autoscale_mode off last_change 9231 lfor 0/7938/9044 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd

Still thinking about workaround, how to disable this splitting process. The previous suggestions (autoscale off or negative target_max_misplaced_ratio) did not work.

These would only work if the autoscaler was scaling the PGs.

As you may see from the osd dump output, after trying to return the old pg_num value with `ceph osd pool set ssd pg_num 256` command we still have pg_num 1024 (and pg_num_target 256). Do you think it might help if we forced pg_num to 256? There seem to be "pg_num_actual" param that could be used to force setting pg_num (bypassing pg_num_target) but it does not work:
[...]

It looks like a bug in MonCommands.h [1] and pg_num_actual is just missing in the list of allowed values.

Do you think it makes sense to try to patch the ceph-mon and force the pg_num? I am asking before trying before patching is troublesome, as it will need to rebuild the container image.

Looking at the history, this seems intentional, https://github.com/ceph/ceph/commit/4dd62dff4c5da4de3d707300ebff32eb8f85f1c9 explains why.

https://github.com/ceph/ceph/blob/40610e8fb7b7bea24bddde6d50c8a65f7a52d701/src/mon/MonCommands.h#L1093

I am not seeing any throttles around controlling pg splitting :( but we need to still to look at the logs to see what's causing the memory growth.

#45 Updated by Guillaume Fenollar 9 months ago

Dan van der Ster wrote:

Can you somehow annotate the usage over time in the log?

Could you please also set debug_prioritycache=5 -- this will add a debug line with memory usage every second.
And also try osd_memory_cache_resize_interval=0.2 -- this will increase the frequency and perhaps make things even more clear.

If I apply this configuration, I only get roughly a hundred lines of logs with only memory usage and nothing else. The other problem I have now is that I added timestamp and memory usage by myself in the log but I consider memory usage "explodes" although no other line of log have been written yet (note that I run binary with parameters --debug_osd 20 --debug_ms 1).
The lines appear to flush into logs when the main process is being OoMed, so I'm not sure it can be of any help. Logs are then full of "ObjectCleanRegions clean_offsets: [0~2834432,2883584~18446744073706668031], clean_omap: 1, new_object: 0" for instance.

So basically I just print in the logs once per second the RSS value of process, and it gives :

13:56:44 VmRSS:     1451648 kB
13:56:45 VmRSS:     1535320 kB
13:56:46 VmRSS:     1615036 kB
13:57:17 VmRSS:     4239124 kB
13:57:18 VmRSS:     4311888 kB
13:57:19 VmRSS:     4396396 kB
13:57:20 VmRSS:     4492472 kB
13:57:21 VmRSS:     4576412 kB
13:57:22 VmRSS:     4497972 kB
13:57:23 VmRSS:     4527944 kB
13:57:24 VmRSS:     4568116 kB
13:57:25 VmRSS:     4609732 kB
13:57:26 VmRSS:     4652232 kB
13:57:27 VmRSS:     4313628 kB
13:57:28 VmRSS:     4357548 kB
13:57:29 VmRSS:     4405764 kB
13:57:30 VmRSS:     4455440 kB
13:57:31 VmRSS:     4504956 kB
13:57:32 VmRSS:     4318140 kB
13:57:33 VmRSS:     4361380 kB
13:57:34 VmRSS:     4405624 kB
13:57:35 VmRSS:     4449592 kB
13:57:36 VmRSS:     4497748 kB
13:57:37 VmRSS:     4544444 kB
13:57:38 VmRSS:     4409140 kB
13:57:39 VmRSS:     4456968 kB
13:57:40 VmRSS:     4508624 kB
13:57:41 VmRSS:     4560516 kB
13:57:42 VmRSS:     4816460 kB
13:57:43 VmRSS:     5691652 kB
13:57:44 VmRSS:     6531160 kB
13:57:45 VmRSS:     7244576 kB
13:57:46 VmRSS:     7804844 kB
13:57:47 VmRSS:     8025284 kB
13:57:48 VmRSS:     7287588 kB
13:57:49 VmRSS:     7519104 kB
13:57:50 VmRSS:     7560144 kB
13:57:51 VmRSS:     7667816 kB
13:57:52 VmRSS:     7838688 kB
13:57:53 VmRSS:     8171496 kB
13:57:54 VmRSS:     8171496 kB
13:57:55 VmRSS:     8171496 kB
13:57:56 VmRSS:     8256980 kB
13:57:57 VmRSS:     8281440 kB
13:57:58 VmRSS:     8425916 kB
13:57:59 VmRSS:     8526752 kB
13:58:00 VmRSS:     8656732 kB
13:58:01 VmRSS:     8787820 kB
13:58:02 VmRSS:     9004496 kB
13:58:03 VmRSS:     9776176 kB
13:58:04 VmRSS:    10567808 kB
13:58:05 VmRSS:    11357568 kB
13:58:06 VmRSS:    12087468 kB
13:58:07 VmRSS:    12194916 kB
13:58:08 VmRSS:    12357540 kB
13:58:09 VmRSS:    12394236 kB
13:58:10 VmRSS:    12530724 kB
13:58:11 VmRSS:    12689736 kB
13:58:12 VmRSS:    12701172 kB
13:58:13 VmRSS:    12870132 kB
13:58:14 VmRSS:    13015860 kB
13:58:15 VmRSS:    13365624 kB
13:58:16 VmRSS:    13365624 kB
13:58:17 VmRSS:    13365624 kB
13:58:18 VmRSS:    13404828 kB
13:58:19 VmRSS:    13584876 kB
13:58:20 VmRSS:    13738260 kB
13:58:21 VmRSS:    13874220 kB
13:58:22 VmRSS:    14003052 kB
13:58:23 VmRSS:    14076316 kB
13:58:24 VmRSS:    14076316 kB
13:58:25 VmRSS:    14076316 kB
13:58:26 VmRSS:    14076316 kB
13:58:27 VmRSS:    14076316 kB
13:58:28 VmRSS:    14073716 kB
13:58:29 VmRSS:    11772916 kB
13:58:30 VmRSS:     9941460 kB
2022-03-11T13:58:31.344645178+11:00 stderr F 2022-03-11T02:58:31.343+0000 7f241800ff00 10 read_log_and_missing done
13:58:31 VmRSS:     8802356 kB
13:58:32 VmRSS:     8648468 kB
2022-03-11T13:58:33.580776774+11:00 stderr F 2022-03-11T02:58:33.578+0000 7f241800ff00 10 osd.9 pg_epoch: 460744 pg[17.5e( v 460741'483110202 (460503'483107764,460741'483110202] local-lis/les=460743/460744 n=4212 ec=143206/213 lis/c=460743/460743 les/c/f=460744/460744/103371 sis=460743) [4,11,9] r=2 lpr=0 crt=460741'483110202 lcod 0'0 mlcod 0'0 unknown mbc={}] handle_initialize
2022-03-11T13:58:33.580835935+11:00 stderr F 2022-03-11T02:58:33.578+0000 7f241800ff00  5 osd.9 pg_epoch: 460744 pg[17.5e( v 460741'483110202 (460503'483107764,460741'483110202] local-lis/les=460743/460744 n=4212 ec=143206/213 lis/c=460743/460743 les/c/f=460744/460744/103371 sis=460743) [4,11,9] r=2 lpr=0 crt=460741'483110202 lcod 0'0 mlcod 0'0 unknown mbc={}] exit Initial 117.026168 0 0.000000
2022-03-11T13:58:33.580855747+11:00 stderr F 2022-03-11T02:58:33.578+0000 7f241800ff00  5 osd.9 pg_epoch: 460744 pg[17.5e( v 460741'483110202 (460503'483107764,460741'483110202] local-lis/les=460743/460744 n=4212 ec=143206/213 lis/c=460743/460743 les/c/f=460744/460744/103371 sis=460743) [4,11,9] r=2 lpr=0 crt=460741'483110202 lcod 0'0 mlcod 0'0 unknown mbc={}] enter Reset
(...50k lines of logs written in a couple of seconds...)

See that it reaches 14GB of RAM in 90 seconds approx and starts writing while crashing (too late). Can you advise about logs parameters I should use ? Should I remove "--debug_ms 1" from arguments ?

#46 Updated by Dan van der Ster 9 months ago

Guillaume Fenollar wrote:

See that it reaches 14GB of RAM in 90 seconds approx and starts writing while crashing (too late). Can you advise about logs parameters I should use ? Should I remove "--debug_ms 1" from arguments ?

Perhaps your VmRss patch is breaking the ceph log flushing?

Could you revert that and try running

ceph-osd --debug_ms=1 --debug_osd=20 --debug_prioritycache=5

#47 Updated by Guillaume Fenollar 9 months ago

Dan van der Ster wrote:

Could you revert that and try running

ceph-osd --debug_ms=1 --debug_osd=20 --debug_prioritycache=5

Ok, here my new test attached. I added 12G of swap so we could have as much as information as necessary. RAM usage then is growing steadily. It was kind of released minutes after a fresh start then it never stopped growing until OoM with 28GB of RSS used by ceph-osd process. In normal conditions, usage is between 3G to 7G.

You can review my cluster's info above if needed (https://tracker.ceph.com/issues/53729#note-29)

#48 Updated by Dan van der Ster 9 months ago

Guillaume Fenollar wrote:

Dan van der Ster wrote:

Could you revert that and try running

ceph-osd --debug_ms=1 --debug_osd=20 --debug_prioritycache=5

Ok, here my new test attached. I added 12G of swap so we could have as much as information as necessary. RAM usage then is growing steadily. It was kind of released minutes after a fresh start then it never stopped growing until OoM with 28GB of RSS used by ceph-osd process. In normal conditions, usage is between 3G to 7G.

You can review my cluster's info above if needed (https://tracker.ceph.com/issues/53729#note-29)

Thanks! Now we're getting somewhere.
It's interesting that the memory fluctuates so much while nothing is going on:

2022-03-11T07:01:37.926+0000 7f5d5fdab700 10 osd.9 461124 tick
2022-03-11T07:01:37.926+0000 7f5d5fdab700 10 osd.9 461124 do_waiters -- start
2022-03-11T07:01:37.926+0000 7f5d5fdab700 10 osd.9 461124 do_waiters -- finish
2022-03-11T07:01:37.926+0000 7f5d5fdab700 20 osd.9 461124 tick last_purged_snaps_scrub 2022-03-11T06:56:46.656591+0000 next 2022-03-12T06:56:46.656591+0000
2022-03-11T07:01:38.150+0000 7f5d5e537700 10 osd.9 461124 tick_without_osd_lock
2022-03-11T07:01:38.830+0000 7f5d57b24700  5 prioritycache tune_memory target: 4294967296 mapped: 22429286400 unmapped: 294912 heap: 22429581312 old mem: 134217728 new mem: 134217728
2022-03-11T07:01:38.922+0000 7f5d5fdab700 10 osd.9 461124 tick
2022-03-11T07:01:38.926+0000 7f5d5fdab700 10 osd.9 461124 do_waiters -- start
2022-03-11T07:01:38.926+0000 7f5d5fdab700 10 osd.9 461124 do_waiters -- finish
2022-03-11T07:01:38.926+0000 7f5d5fdab700 20 osd.9 461124 tick last_purged_snaps_scrub 2022-03-11T06:56:46.656591+0000 next 2022-03-12T06:56:46.656591+0000
2022-03-11T07:01:39.194+0000 7f5d5e537700 10 osd.9 461124 tick_without_osd_lock
2022-03-11T07:01:39.834+0000 7f5d57b24700  5 prioritycache tune_memory target: 4294967296 mapped: 23677075456 unmapped: 311296 heap: 23677386752 old mem: 134217728 new mem: 134217728
2022-03-11T07:01:39.878+0000 7f5d5fdab700 10 osd.9 461124 tick
2022-03-11T07:01:39.878+0000 7f5d5fdab700 10 osd.9 461124 do_waiters -- start
2022-03-11T07:01:39.878+0000 7f5d5fdab700 10 osd.9 461124 do_waiters -- finish
2022-03-11T07:01:39.878+0000 7f5d5fdab700 20 osd.9 461124 tick last_purged_snaps_scrub 2022-03-11T06:56:46.656591+0000 next 2022-03-12T06:56:46.656591+0000
2022-03-11T07:01:40.202+0000 7f5d5e537700 10 osd.9 461124 tick_without_osd_lock
2022-03-11T07:01:40.834+0000 7f5d57b24700  5 prioritycache tune_memory target: 4294967296 mapped: 24458362880 unmapped: 212992 heap: 24458575872 old mem: 134217728 new mem: 134217728
2022-03-11T07:01:40.894+0000 7f5d5fdab700 10 osd.9 461124 tick

So we'll need more debug levels. Can you please run again with

ceph-osd --debug_ms=1 --debug_osd=20 --debug_prioritycache=5 --debug_bluestore=10

Also, could you share the entire log from when the OSD starts, so we can see the pglog reading?
The log is going to be large, so you probably need to use dropbox or something like that.

#49 Updated by Guillaume Fenollar 9 months ago

Dan van der Ster wrote:

Guillaume Fenollar wrote:

Dan van der Ster wrote:

Could you revert that and try running

ceph-osd --debug_ms=1 --debug_osd=20 --debug_prioritycache=5

Ok, here my new test attached. I added 12G of swap so we could have as much as information as necessary. RAM usage then is growing steadily. It was kind of released minutes after a fresh start then it never stopped growing until OoM with 28GB of RSS used by ceph-osd process. In normal conditions, usage is between 3G to 7G.

You can review my cluster's info above if needed (https://tracker.ceph.com/issues/53729#note-29)

Thanks! Now we're getting somewhere.
It's interesting that the memory fluctuates so much while nothing is going on:
[...]

So we'll need more debug levels. Can you please run again with

ceph-osd --debug_ms=1 --debug_osd=20 --debug_prioritycache=5 --debug_bluestore=10

Also, could you share the entire log from when the OSD starts, so we can see the pglog reading?
The log is going to be large, so you probably need to use dropbox or something like that.

I'm using Kubernetes and during several minutes after startup, container's log is being rotated every 10 seconds because it exceeds 200MB...yes, every 10 seconds. I rarely see that amount of logs being written by an app ;)
Also, during 5 minutes, RAM usage is pretty low (150MB/200MB), I guess it's because of log writes having a substantial impact on performances. But memory consumption then rises as expected and fortunately log grow pretty slowly after that.
So bad news is I don't have startup log for you, but good news is that attached logs should features interesting things, when RAM is growing at a steady pace with no apparent reason.

Hope it helps

#50 Updated by Mykola Golub 9 months ago

FYI, in our case I described in [1] and following comments (15 osd cluster, after changing pg_num from 526 to 1026 osds the osd consuming all available memory 64G ram and OOM killed), we recovered our data (rbd images) with ceph-objectstore-tool, rebuilt images and exported in another cluster. But we still keep this cluster for investigation. Unfortunately I don't have much time to troubleshoot this now, but if someone has idea what should I try or what data provide I would be happy to do.

I ran an osd in gdb and confirmed that at the moment when it is "eating" memory it in pg split. Here is a typical "tp_osd_tp" thread backtrace when pressing ctr-c (the upper frames , above split_out_child, differ):

Thread 432 (Thread 0x7fff7cf95640 (LWP 1104054) "tp_osd_tp"):
#0 0x0000555555c3ac25 in std::__detail::_Hashtable_base<osd_reqid_t, std::pair<osd_reqid_t const, pg_log_dup_t*>, std::__detail::_Select1st, std::equal_to<osd_reqid_t>, std::hash<osd_reqid_t>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Hashtable_traits<true, false, true> >::_S_equals (_n=..., __c=<optimized out>) at /usr/include/c++/11/bits/hashtable_policy.h:1596
#1 std::
_detail::_Hashtable_base<osd_reqid_t, std::pair<osd_reqid_t const, pg_log_dup_t*>, std::__detail::_Select1st, std::equal_to<osd_reqid_t>, std::hash<osd_reqid_t>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Hashtable_traits<true, false, true> >::_M_equals (_n=..., __c=<optimized out>, __k=..., this=<optimized out>) at /usr/include/c++/11/bits/hashtable_policy.h:1617
#2 std::_Hashtable<osd_reqid_t, std::pair<osd_reqid_t const, pg_log_dup_t*>, std::allocator<std::pair<osd_reqid_t const, pg_log_dup_t*> >, std::
_detail::_Select1st, std::equal_to<osd_reqid_t>, std::hash<osd_reqid_t>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_find_before_node (this=this@entry=0x7fff7cf916e8, _bkt=_bkt@entry=562494, _k=..., __code=_code@entry=11252949) at /usr/include/c++/11/bits/hashtable.h:1840
#3 0x0000555555c3ae18 in std::_Hashtable<osd_reqid_t, std::pair<osd_reqid_t const, pg_log_dup_t*>, std::allocator<std::pair<osd_reqid_t const, pg_log_dup_t*> >, std::__detail::_Select1st, std::equal_to<osd_reqid_t>, std::hash<osd_reqid_t>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_find_node (_c=11252949, __key=..., __bkt=562494, this=0x7fff7cf916e8) at /usr/include/c++/11/bits/hashtable.h:810
#4 std::
_detail::_Map_base<osd_reqid_t, std::pair<osd_reqid_t const, pg_log_dup_t*>, std::allocator<std::pair<osd_reqid_t const, pg_log_dup_t*> >, std::__detail::_Select1st, std::equal_to<osd_reqid_t>, std::hash<osd_reqid_t>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true>, true>::operator[] (this=this@entry=0x7fff7cf916e8, _k=...) at /usr/include/c++/11/bits/hashtable_policy.h:704
#5 0x0000555555c3b85f in std::unordered_map<osd_reqid_t, pg_log_dup_t*, std::hash<osd_reqid_t>, std::equal_to<osd_reqid_t>, std::allocator<std::pair<osd_reqid_t const, pg_log_dup_t*> > >::operator[] (
_k=..., this=0x7fff7cf916e8) at /usr/include/c++/11/bits/unordered_map.h:979
#6 PGLog::IndexedLog::index (this=0x7fff7cf915b0, to_index=<optimized out>) at ./src/osd/PGLog.h:462
#7 0x0000555555c4c759 in PGLog::IndexedLog::IndexedLog<pg_log_t> (this=0x7fff7cf915b0) at ./src/osd/PGLog.h:222
#8 PGLog::IndexedLog::split_out_child (this=this@entry=0x555566d59d00, child_pgid=..., split_bits=split_bits@entry=10, target=target@entry=0x7ffb7612e210) at ./src/osd/PGLog.cc:50
#9 0x0000555555e56393 in PGLog::split_into (this=this@entry=0x555566d59c58, child_pgid=..., split_bits=split_bits@entry=10, opg_log=opg_log@entry=0x7ffb7612e168) at ./src/osd/PGLog.h:854
#10 0x0000555555e04497 in PeeringState::split_into (this=this@entry=0x555566d58d78, child_pgid=..., child=child@entry=0x7ffb7612d288, split_bits=split_bits@entry=10) at ./src/osd/PeeringState.cc:3266
#11 0x0000555555c31a9f in PG::split_into (this=this@entry=0x555566d57e60, child_pgid=..., child=child@entry=0x7ffb7612c370, split_bits=split_bits@entry=10) at ./src/osd/PG.cc:543
#12 0x0000555555b7d775 in OSD::split_pgs (this=0x55555f1984c0, parent=0x555566d57e60, childpgids=..., out_pgs=<optimized out>, curmap=..., nextmap=..., rctx=...) at ./src/osd/OSD.cc:9131
#13 0x0000555555b7e471 in OSD::advance_pg (this=<optimized out>, osd_epoch=9167, pg=<optimized out>, handle=..., rctx=...) at ./src/osd/OSD.cc:8805
#14 0x0000555555b803f6 in OSD::dequeue_peering_evt (this=0x55555f1984c0, sdata=<optimized out>, pg=0x555566d57e60, evt=std::shared_ptr<PGPeeringEvent> (use count 2, weak count 0) = {...}, handle=...) at ./src/osd/OSDMap.h:683
#15 0x0000555555ddbb47 in ceph::osd::scheduler::PGPeeringItem::run (this=<optimized out>, osd=<optimized out>, sdata=<optimized out>, pg=..., handle=...) at ./src/osd/scheduler/OpSchedulerItem.cc:42
#16 0x0000555555b94c83 in ceph::osd::scheduler::OpSchedulerItem::run (handle=..., pg=..., sdata=<optimized out>, osd=<optimized out>, this=0x7fff7cf922f0) at ./src/osd/scheduler/OpSchedulerItem.h:148
#17 OSD::ShardedOpWQ::_process (this=0x55555f1993e8, thread_index=<optimized out>, hb=<optimized out>) at ./src/osd/OSD.cc:11142
#18 0x0000555556228144 in ShardedThreadPool::shardedthreadpool_worker (this=0x55555f198f18, thread_index=1) at ./src/common/WorkQueue.cc:311
#19 0x000055555622a3f0 in ShardedThreadPool::WorkThreadSharded::entry (this=<optimized out>) at ./src/common/WorkQueue.h:637
#20 0x00007ffff76bed80 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#21 0x00007ffff75d876f in clone () from /lib/x86_64-linux-gnu/libc.so.6

[1] https://tracker.ceph.com/issues/53729#note-31

#51 Updated by Dan van der Ster 9 months ago

Guillaume Fenollar wrote:

Dan van der Ster wrote:

Guillaume Fenollar wrote:

Dan van der Ster wrote:

Could you revert that and try running

ceph-osd --debug_ms=1 --debug_osd=20 --debug_prioritycache=5

Ok, here my new test attached. I added 12G of swap so we could have as much as information as necessary. RAM usage then is growing steadily. It was kind of released minutes after a fresh start then it never stopped growing until OoM with 28GB of RSS used by ceph-osd process. In normal conditions, usage is between 3G to 7G.

You can review my cluster's info above if needed (https://tracker.ceph.com/issues/53729#note-29)

Thanks! Now we're getting somewhere.
It's interesting that the memory fluctuates so much while nothing is going on:
[...]

So we'll need more debug levels. Can you please run again with

ceph-osd --debug_ms=1 --debug_osd=20 --debug_prioritycache=5 --debug_bluestore=10

Also, could you share the entire log from when the OSD starts, so we can see the pglog reading?
The log is going to be large, so you probably need to use dropbox or something like that.

I'm using Kubernetes and during several minutes after startup, container's log is being rotated every 10 seconds because it exceeds 200MB...yes, every 10 seconds. I rarely see that amount of logs being written by an app ;)
Also, during 5 minutes, RAM usage is pretty low (150MB/200MB), I guess it's because of log writes having a substantial impact on performances. But memory consumption then rises as expected and fortunately log grow pretty slowly after that.
So bad news is I don't have startup log for you, but good news is that attached logs should features interesting things, when RAM is growing at a steady pace with no apparent reason.

Hope it helps

In this new log the memory starts to grow just after pg 17.1e splits into 17.5e:

2022-03-11T23:55:22.992064614+11:00 stderr F 2022-03-11T12:55:22.985+0000 7fe832ec7700 10 osd.9 461124 split_pgs splitting pg[17.1e( v 460742'491108481 (460742'491108481,460742'491108481] local-lis/les=460743/460744 n=8448 ec=29241/213 lis/c=460743/460743 les/c/f=460744/460744/103371 sis=460988) [11] r=-1 lpr=460988 pi=[460743,460988)/2 crt=460742'491108481 lcod 0'0 mlcod 0'0 unknown NOTIFY mbc={}] into 17.5e
2022-03-11T23:55:22.992079536+11:00 stderr F 2022-03-11T12:55:22.985+0000 7fe832ec7700 10 osd.9 461124 _make_pg 17.5e
2022-03-11T23:55:22.992102532+11:00 stderr F 2022-03-11T12:55:22.985+0000 7fe832ec7700  5 osd.9 pg_epoch: 460988 pg[17.5e(unlocked)] enter Initial
2022-03-11T23:55:22.99214978+11:00 stderr F 2022-03-11T12:55:22.985+0000 7fe832ec7700 20 osd.9 pg_epoch: 460988 pg[17.5e(unlocked)] enter NotTrimming
2022-03-11T23:55:22.992177225+11:00 stderr F 2022-03-11T12:55:22.985+0000 7fe832ec7700 10 bluestore(/var/lib/ceph/osd/ceph-9) _osr_attach 17.5e_head reusing osr 0x55563cd312c0 from existing coll 0x55563a35f680
2022-03-11T23:55:22.992188881+11:00 stderr F 2022-03-11T12:55:22.985+0000 7fe832ec7700 10 osd.9 461124  pg_num is 128, m_seed 94, split_bits is 7
2022-03-11T23:55:23.931318958+11:00 stderr F 2022-03-11T12:55:23.925+0000 7fe84acf1700  5 prioritycache tune_memory target: 4294967296 mapped: 5833392128 unmapped: 8458125312 heap: 14291517440 old mem: 134217728 new mem: 134217728
...
2022-03-11T23:55:57.194674897+11:00 stderr F 2022-03-11T12:55:57.187+0000 7fe84acf1700  5 prioritycache tune_memory target: 4294967296 mapped: 13845504000 unmapped: 446013440 heap: 14291517440 old mem: 134217728 new mem: 134217728

The last osd/bluestore dout is:

7fe832ec7700 10 osd.9 461124  pg_num is 128, m_seed 94, split_bits is 7

which is here: https://github.com/ceph/ceph/blob/v15.2.16/src/osd/OSD.cc#L8957

The next line is parent->split_colls() from PrimaryLogPG.h which does:

    coll_t target = coll_t(child);
    create_pg_collection(t, child, split_bits);
    t.split_collection(

I can't tell if coll_t(child) should print anything. But create_pg_collection(t, child, split_bits) should eventually reach the end of BlueStore::_create_collection and dout(10) here:
https://github.com/ceph/ceph/blob/v15.2.16/src/os/bluestore/BlueStore.cc#L15605

Since we don't have that dout in your log, I gather the slowdown is somewhere in _create_collection or in the calling of that function.

Does another expert have some input here?

Guillame is it possible to catch this exact situation with debug_bluestore=20, i.e

ceph-osd --debug_ms=1 --debug_osd=20 --debug_prioritycache=5 --debug_bluestore=20

#52 Updated by Guillaume Fenollar 9 months ago

Dan van der Ster wrote:

Guillame is it possible to catch this exact situation with debug_bluestore=20, i.e

ceph-osd --debug_ms=1 --debug_osd=20 --debug_prioritycache=5 --debug_bluestore=20

Wow, it's pretty verbose with this setup, it's writing 15/20GB of log per second... What I did is "tail -f |grep '_make_pg 17.'" to see the first split and copy the file log seconds to a temp location right after so it could be cut easily. I found ~118 occurences of _make_pg logline (I guess number of PGs being created). Finally, I grepped 10000 lines above and below each _make_pg line (grep -C 10000 _make_pg).
Log is 145MB, ~8MB bzipped : https://drive.google.com/file/d/18lz_NXgNVUTtV5gQMcZJYFY7LUPIJfQR/view?usp=sharing

Hope that helps.
Tell me if for instance you want more than 10000 lines but on only one _make_pg operation.

#53 Updated by Dan van der Ster 9 months ago

Updating here based on further investigations offline with Guillaume.

In his logs we saw bluestore iterating through many thousands of omap pg_log dup records, so I asked him to print the log and dup records per pg, with this oneliner:

for p in $PGS; do echo $p; ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0/ --op log --pgid $p | jq '(.pg_log_t.log|length),(.pg_log_t.dups|length)'; echo; done

Here's the output -- it goes oom loading the log so we can't even get the count:

(...)

17.57
2710
544

17.58
2718
619

17.5b
2384
637

17.5e
tcmalloc: large alloc 1073750016 bytes == 0x55d8486b6000 @   0x7f07889d0760 0x7f07889f1c64 0x55d4fb1d96da 0x7f077e22d218 0x7f077e237d5b 0x7f077e227da4 0x7f077e22813b 0x7f077fe3f0eb 0x7f077fe3ff11 0x7f077fe40025 0x7f077fe3fe58 0x7f077fe41df6 0x55d4fb43e129 0x55d4fb469b3d 0x55d4fb1ba849 0x55d4fb1931b8 0x7f077d7e7ca3 0x55d4fb1b7d5e
tcmalloc: large alloc 2147491840 bytes == 0x55d8886b8000 @   0x7f07889d0760 0x7f07889f1c64 0x55d4fb1d96da 0x7f077e22d218 0x7f077e237d5b 0x7f077e227da4 0x7f077e22813b 0x7f077fe3f13b 0x7f077fe3ff11 0x7f077fe40025 0x7f077fe3fe58 0x7f077fe403dc 0x55d4fb469b46 0x55d4fb1ba849 0x55d4fb1931b8 0x7f077d7e7ca3 0x55d4fb1b7d5e
tcmalloc: large alloc 4294975488 bytes == 0x55d9086ba000 @   0x7f07889d0760 0x7f07889f1c64 0x55d4fb1d96da 0x7f077e22d218 0x7f077e237d5b 0x7f077e227da4 0x7f077e22813b 0x7f077fe3f13b 0x7f077fe3ff11 0x7f077fe40025 0x7f077fe3fe58 0x7f077fe41df6 0x55d4fb43e192 0x55d4fb469b3d 0x55d4fb1ba849 0x55d4fb1931b8 0x7f077d7e7ca3 0x55d4fb1b7d5e
tcmalloc: large alloc 8589942784 bytes == 0x55da08ebc000 @   0x7f07889d0760 0x7f07889f1c64 0x55d4fb1d96da 0x7f077e22d218 0x7f077e237d5b 0x7f077e227da4 0x7f077fe3ff76 0x7f077fe40025 0x7f077fe3fe58 0x7f077fe41df6 0x55d4fb43e192 0x55d4fb469b3d 0x55d4fb1ba849 0x55d4fb1931b8 0x7f077d7e7ca3 0x55d4fb1b7d5e
tcmalloc: large alloc 6058819584 bytes == 0x55d8485ec000 @   0x7f07889d0760 0x7f07889f1c64 0x55d4fb1e057d 0x55d4fb1e36db 0x7f077e22cb0f 0x7f077fe40b2b 0x55d4fb1ba85e 0x55d4fb1931b8 0x7f077d7e7ca3 0x55d4fb1b7d5e

And Out Of Memory less than a minute after, consuming all server's RAM :)

From other logs, we know that pg 17.5e has 2439 log entries, a reasonable number. So the dups are very likely the root cause here.

Some understanding and questions about dups based on reading the code in PGLog::trim :
  • dups are supposed to be kept for up to osd_pg_log_dups_tracked (3000) versions, but I am not sure they are bound by a number of entries.
  • dups won't be trimmed on a relatively idle PG. (only the dups for ops older than 3000 versions ago are trimmed).
  • I don't think dups are trimmed according to the pg_log head/tail. Does it make sense that we keep dup entries for versions < head ?.
  • splitting probably copies the dups to the new PG ? (I have no log evidence of this, just guessing about why split is the trigger for this issue on many clusters).
  • ceph-objectstore-tool --op trim-pg-log does not trim dups.

Regarding the question about trimming older than head: on my own test cluster I found PGs with 0 log entries but more than 6500 dup entries, e.g.

{
    "pg_log_t": {
        "head": "2611'245",
        "tail": "2611'245",
        "log": [],
        "dups": [
            {
                "reqid": "client.969511.0:6277",
                "version": "2391'1",
                "user_version": "1",
                "return_code": "0" 
            },
...
            {
                "reqid": "osd.1.0:2181",
                "version": "2611'245",
                "user_version": "0",
                "return_code": "0" 
            }
        ]
    },
    "pg_missing_t": {
        "missing": [],
        "may_include_deletes": true
    }
}

#54 Updated by Mykola Golub 9 months ago

Just to confirm Dan's findings, I am attaching the output of the command Dan asked me to run (number ob pg log and pg log dups entries). In our case we also observe large number (millions) of dup entries. The pg log numbers are small because before this I tried to trim the pg log with ceph-objectstore-tool, and it trimmed pg log entries but not dups.

for p in `cat /tmp/pgs`; do echo $p; ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0/ --op log --pgid $p | jq '(.pg_log_t.log|length),(.pg_log_t.dups|length)'; echo; done

#55 Updated by Dan van der Ster 9 months ago

Mykola Golub wrote:

Just to confirm Dan's findings, I am attaching the output of the command Dan asked me to run (number ob pg log and pg log dups entries). In our case we also observe large number (millions) of dup entries. The pg log numbers are small because before this I tried to trim the pg log with ceph-objectstore-tool, and it trimmed pg log entries but not dups.

for p in `cat /tmp/pgs`; do echo $p; ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0/ --op log --pgid $p | jq '(.pg_log_t.log|length),(.pg_log_t.dups|length)'; echo; done

Thanks Mykola & Guillaume!

We have discussed this in the Core Standup.

1. We will check that the dup trimming logic in PGLog::IndexedLog::trim is capping the number of dups correctly.
2. Currently pg split copies the dups, which can be a source of dups increasing without any trimming. It will be checked if pg_split can also trim the dups prior to copying. (This will help PGs which are not getting many writes, because trim is currently only done during writes).
3. ceph-objectstore-tool needs an op to trim pg_log dups, to handle the disaster scenarios. It will be checked why the trim implementation isn't reused between online/offline.

Mykola, to help understand the issue better, could you please attach the entire --op log json output for one of the PGs with millions of dups ?

#56 Updated by Mykola Golub 9 months ago

Dan van der Ster wrote:

Mykola, to help understand the issue better, could you please attach the entire --op log json output for one of the PGs with millions of dups ?

The file is too big to attach (24M). I copied it to teuthology frontend in my home directory: /home/trociny/2.85.log.json.gz (the upstream developers should have access to it). If you have problems with accessing it, please let me know, I will think about some other way to provide it.

#57 Updated by Josh Durgin 9 months ago

  • Backport set to octopus, pacific, quincy

Thanks Mykola and Dan, this shows exactly what the issue is! There are dup entries with a version higher than the log entries.

Relying on version number to trim the log - was a heuristic to avoid old c++ stdlib O(n) list.size(). We should just switch to counting dup entries trimmed instead of using version at all (I don't think we even need to store the version anymore, but that would be a bigger change since it alters the persistent format and introduces extra complexity above what is required to fix this).

We stop trimming the dups as soon as trimming encounters any version number higher than earliest_dup_version, aka the tail of the log - dups_tracked:

void PGLog::IndexedLog::trim(
...

  auto earliest_dup_version =
    log.rbegin()->version.version < cct->_conf->osd_pg_log_dups_tracked
    ? 0u
    : log.rbegin()->version.version - cct->_conf->osd_pg_log_dups_tracked + 1;

...

  while (!dups.empty()) {
    const auto& e = *dups.begin();
    if (e.version.version >= earliest_dup_version)
      break;
    lgeneric_subdout(cct, osd, 20) << "trim dup " << e << dendl;
    if (trimmed_dups)
      trimmed_dups->insert(e.get_key_name());
    unindex(e);
    dups.pop_front();
  }

This means that if there is any dup entry with a higher version than the tail of the log, we will not trim anything past it, but we will keep accumulating new dups as we trim pg_log_entries and add them to the back of the dup list. This explains how we ended up with so many dup entries in this case - Mykola's log showed many at the beginning of the list with much higher versions, and many at the end with lower versions.

We got into this situation to begin with due to the way pg splitting handles the pg log and dups. Take a look at pg_log_t::split_out_child:

  pg_log_t split_out_child(pg_t child_pgid, unsigned split_bits) {
    mempool::osd_pglog::list<pg_log_entry_t> oldlog, childlog;
    oldlog.swap(log);

    eversion_t old_tail;
    unsigned mask = ~((~0)<<split_bits);
    for (auto i = oldlog.begin();
     i != oldlog.end();
      ) {
      if ((i->soid.get_hash() & mask) == child_pgid.m_seed) {
    childlog.push_back(*i);
      } else {
    log.push_back(*i);
      }
      oldlog.erase(i++);
    }

    // osd_reqid is unique, so it doesn't matter if there are extra
    // dup entries in each pg. To avoid storing oid with the dup
    // entries, just copy the whole list.
    auto childdups(dups);

    return pg_log_t(
      head,
      tail,
      can_rollback_to,
      rollback_info_trimmed_to,
      std::move(childlog),
      std::move(childdups));
    }

We are separating the pg log entries into the child and original pg because we have the object id and can figure out which pg they belong to. However, for dup entries, we don't store the object id to save memory, instead copying the full list. This means we'll have dup entries in one pg with versions that are higher than the log entries. In particular, if there are many updates to a single object, all those log entries would belong to one child pg, which could result in a large version gap between the other pg and its dups. This usually isn't an issue because there's normally not much of a difference between versions, so more updates to both pgs would end up quickly catching up the lower version number to the versions in the dups, and they would be trimmed. However, with repeated splits on a pg with a heavily updated object, this version skew could persist and worsen as each child without the hot object would need more updates to catch up. I believe that explains why we haven't seen this as frequently before the autoscaler was on by default.

Fortunately, the fix is relatively small - by using the number of entries rather than the versions when trimming dups, any skew between the dups and log versions does not matter, and we prevent unbounded dup growth. This should apply to all versions that have dups (iirc jewel or luminous and later).

#58 Updated by Neha Ojha 9 months ago

  • Status changed from Need More Info to New
  • Assignee set to Nitzan Mordechai

#59 Updated by Neha Ojha 9 months ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 45529

#60 Updated by Radoslaw Zarzynski 8 months ago

  • Related to deleted (Bug #54296: OSDs using too much memory)

#61 Updated by Radoslaw Zarzynski 8 months ago

  • Duplicated by Bug #54296: OSDs using too much memory added

#62 Updated by Gonzalo Aguilar Delgado 8 months ago

I've built the repo from :NitzanMordhai/ceph.git branch origin/wip-nitzan-pglog-dups-not-trimmed

And gave it a try:

CEPH_ARGS="--osd_pg_log_trim_max=10000 --osd_max_pg_log_entries=2000 " LD_LIBRARY_PATH=lib/ ./ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-3 --op trim-pg-log --pgid 9.12
Doesn't improve things...

CEPH_ARGS="--osd_pg_log_trim_max=500000 --osd_max_pg_log_entries=2000 " LD_LIBRARY_PATH=lib/ ./ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-3 --op log --pgid 9.12 | jq '(.pg_log_t.log|length),(.pg_log_t.dups|length)'
256
984033

Booting ceph osd gives me the same trouble:

2022-04-20T10:27:08.783+0000 7f3509de1240 1 osd.3 1529107 maybe_override_options_for_qos: Changing recovery/backfill/sleep settings for QoS
2022-04-20T10:27:08.795+0000 7f34ef989640 1 osd.3 pg_epoch: 1492060 pg[9.12( v 1492033'2724365 (1491996'2724109,1492033'2724365] local-lis/les=1389428/1389429 n=1953 ec=417/417 lis/c=1389428/1389428 les/c/f=1389429/1389429/11759 sis=1492060) [] r=-1 lpr=1492060 pi=[1389428,1492060)/1 crt=1492033'2724365 lcod 0'0 mlcod 0'0 unknown mbc={}] start_peering_interval up [2,9,13] -> [], acting [2,9,13] -> [], acting_primary 2 -> -1, up_primary 2 -> -1, role -1 -> -1, features acting 4540138303579357183 upacting 4540138303579357183
2022-04-20T10:27:08.799+0000 7f34ef989640 1 osd.3 pg_epoch: 1492067 pg[9.12( v 1492033'2724365 (1491996'2724109,1492033'2724365] local-lis/les=1389428/1389429 n=1953 ec=417/417 lis/c=1389428/1389428 les/c/f=1389429/1389429/11759 sis=1492067) [2] r=-1 lpr=1492067 pi=[1389428,1492067)/1 crt=1492033'2724365 lcod 0'0 mlcod 0'0 unknown NOTIFY mbc={}] start_peering_interval up [] -> [2], acting [] -> [2], acting_primary ? -> 2, up_primary ? -> 2, role -1 -> -1, features acting 4540138294989422591 upacting 4540138294989422591
2022-04-20T10:27:08.799+0000 7f34ef989640 1 osd.3 pg_epoch: 1492072 pg[9.12( v 1492033'2724365 (1491996'2724109,1492033'2724365] local-lis/les=1389428/1389429 n=1953 ec=417/417 lis/c=1389428/1389428 les/c/f=1389429/1389429/11759 sis=1492072) [2] r=-1 lpr=1492072 pi=[1389428,1492072)/1 crt=1492033'2724365 lcod 0'0 mlcod 0'0 unknown NOTIFY mbc={}] start_peering_interval up [2] -> [2], acting [2] -> [2], acting_primary 2 -> 2, up_primary 2 -> 2, role -1 -> -1, features acting 4540138294989422591 upacting 4540138294989422591
^C2022-04-20T10:27:25.143+0000 7f3506403640 -1 received signal: Interrupt, si_code : 128, si_value (int): 0, si_value (ptr): 0, si_errno: 0, si_pid : 0, si_uid : 0, si_addr0, si_status0

And memory oom or just taking all server memory. I will leave it running... just in case...

I post a screenshot

#63 Updated by Gonzalo Aguilar Delgado 8 months ago

After a while it crashed...
-34> 2022-04-20T11:02:25.218+0000 7f72b3b83640 5 rocksdb: commit_cache_size High Pri Pool Ratio set to 0
-33> 2022-04-20T11:02:25.218+0000 7f72b3b83640 5 rocksdb: commit_cache_size High Pri Pool Ratio set to 0
-32> 2022-04-20T11:02:25.218+0000 7f72b3b83640 5 bluestore.MempoolThread(0x55870624ab38) _resize_shards cache_size: 134217728 kv_alloc: 67108864 kv_used: 67087776 kv_onode_alloc: 67108864 kv_onode_used: 6998192 meta_alloc: 67108864 meta_used: 2239838 data_alloc: 67108864 data_used: 2093056
-31> 2022-04-20T11:02:25.294+0000 7f72a5366640 5 osd.3 1529187 heartbeat osd_stat(store_statfs(0xe6ce1cb000/0x0/0xe8e0c00000, data 0x2008a13e6/0x20c233000, compress 0x0/0x0/0x0, omap 0x71871, meta 0x678e78f), peers [] op hist [])
-30> 2022-04-20T11:02:25.514+0000 7f72b2380640 10 monclient: tick
-29> 2022-04-20T11:02:25.514+0000 7f72b2380640 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2022-04-20T11:01:55.520491+0000)
-28> 2022-04-20T11:02:25.718+0000 7f72b3b83640 5 prioritycache tune_memory target: 2147483648 mapped: 18345336832 unmapped: 610516992 heap: 18955853824 old mem: 134217728 new mem: 134217728
-27> 2022-04-20T11:02:26.514+0000 7f72b2380640 10 monclient: tick
-26> 2022-04-20T11:02:26.514+0000 7f72b2380640 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2022-04-20T11:01:56.520630+0000)
-25> 2022-04-20T11:02:26.718+0000 7f72b3b83640 5 prioritycache tune_memory target: 2147483648 mapped: 18346090496 unmapped: 609763328 heap: 18955853824 old mem: 134217728 new mem: 134217728
-24> 2022-04-20T11:02:27.514+0000 7f72b2380640 10 monclient: tick
-23> 2022-04-20T11:02:27.514+0000 7f72b2380640 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2022-04-20T11:01:57.520759+0000)
-22> 2022-04-20T11:02:27.722+0000 7f72b3b83640 5 prioritycache tune_memory target: 2147483648 mapped: 18346508288 unmapped: 609345536 heap: 18955853824 old mem: 134217728 new mem: 134217728
-21> 2022-04-20T11:02:28.514+0000 7f72b2380640 10 monclient: tick
-20> 2022-04-20T11:02:28.514+0000 7f72b2380640 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2022-04-20T11:01:58.520869+0000)
-19> 2022-04-20T11:02:28.722+0000 7f72b3b83640 5 prioritycache tune_memory target: 2147483648 mapped: 18346754048 unmapped: 609099776 heap: 18955853824 old mem: 134217728 new mem: 134217728
-18> 2022-04-20T11:02:29.514+0000 7f72b2380640 10 monclient: tick
-17> 2022-04-20T11:02:29.514+0000 7f72b2380640 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2022-04-20T11:01:59.521010+0000)
-16> 2022-04-20T11:02:29.722+0000 7f72b3b83640 5 prioritycache tune_memory target: 2147483648 mapped: 18347278336 unmapped: 608575488 heap: 18955853824 old mem: 134217728 new mem: 134217728
-15> 2022-04-20T11:02:29.994+0000 7f72a5366640 5 osd.3 1529187 heartbeat osd_stat(store_statfs(0xe6ce1cb000/0x0/0xe8e0c00000, data 0x2008a13e6/0x20c233000, compress 0x0/0x0/0x0, omap 0x71871, meta 0x678e78f), peers [] op hist [])
-14> 2022-04-20T11:02:30.226+0000 7f72b3b83640 5 rocksdb: commit_cache_size High Pri Pool Ratio set to 0
-13> 2022-04-20T11:02:30.226+0000 7f72b3b83640 5 rocksdb: commit_cache_size High Pri Pool Ratio set to 0
-12> 2022-04-20T11:02:30.226+0000 7f72b3b83640 5 bluestore.MempoolThread(0x55870624ab38) _resize_shards cache_size: 134217728 kv_alloc: 67108864 kv_used: 67087776 kv_onode_alloc: 67108864 kv_onode_used: 6998192 meta_alloc: 67108864 meta_used: 2239838 data_alloc: 67108864 data_used: 2093056
-11> 2022-04-20T11:02:30.514+0000 7f72b2380640 10 monclient: tick
-10> 2022-04-20T11:02:30.514+0000 7f72b2380640 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2022-04-20T11:02:00.521139+0000)
-9> 2022-04-20T11:02:30.726+0000 7f72b3b83640 5 prioritycache tune_memory target: 2147483648 mapped: 18347581440 unmapped: 608272384 heap: 18955853824 old mem: 134217728 new mem: 134217728
-8> 2022-04-20T11:02:31.342+0000 7f72a6368640 1 heartbeat_map clear_timeout 'OSD::osd_op_tp thread 0x7f72a6368640' had timed out after 15.000000954s
-7> 2022-04-20T11:02:31.514+0000 7f72b2380640 10 monclient: tick
-6> 2022-04-20T11:02:31.514+0000 7f72b2380640 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2022-04-20T11:02:01.521253+0000)
-5> 2022-04-20T11:02:31.390+0000 7f72a6368640 1 heartbeat_map clear_timeout 'OSD::osd_op_tp thread 0x7f72a6368640' had suicide timed out after 150.000000000s
-4> 2022-04-20T11:02:31.814+0000 7f72b3b83640 5 prioritycache tune_memory target: 2147483648 mapped: 18287927296 unmapped: 667926528 heap: 18955853824 old mem: 134217728 new mem: 134217728
-3> 2022-04-20T11:02:31.938+0000 7f72a6368640 -1 ** Caught signal (Aborted) *
in thread 7f72a6368640 thread_name:tp_osd_tp

ceph version 17.0.0-11084-g20502795a99 (20502795a990145efad918417cafed4c343aaa8b) quincy (dev)
1: /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7f72c08f6520]
2: pthread_kill()
3: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, std::chrono::time_point&lt;ceph::coarse_mono_clock, std::chrono::duration&lt;unsigned long, std::ratio&lt;1l, 1000000000l&gt; > >)+0x485) [0x55870462a965]
4: (ceph::HeartbeatMap::clear_timeout(ceph::heartbeat_handle_d*)+0x61) [0x55870462abb1]
5: (BlueStore::queue_transactions(boost::intrusive_ptr&lt;ObjectStore::CollectionImpl&gt;&, std::vector&lt;ceph::os::Transaction, std::allocator&lt;ceph::os::Transaction&gt; >&, boost::intrusive_ptr&lt;TrackedOp&gt;, ThreadPool::TPHandle*)+0x4d6) [0x558704500556]
6: (ObjectStore::queue_transaction(boost::intrusive_ptr&lt;ObjectStore::CollectionImpl&gt;&, ceph::os::Transaction&&, boost::intrusive_ptr&lt;TrackedOp&gt;, ThreadPool::TPHandle*)+0xad) [0x558703fb193d]
7: (OSD::dispatch_context(PeeringCtx&, PG*, std::shared_ptr&lt;OSDMap const&gt;, ThreadPool::TPHandle*)+0x1ae) [0x558703f58d7e]
8: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr&lt;PGPeeringEvent&gt;, ThreadPool::TPHandle&)+0x2c7) [0x558703f81827]
9: (ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr&lt;PG&gt;&, ThreadPool::TPHandle&)+0x5b) [0x55870422552b]
10: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x993) [0x558703f72633]
11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x414) [0x558704651044]
12: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x5587046535a4]
13: /lib/x86_64-linux-gnu/libc.so.6(+0x94b43) [0x7f72c0948b43]
14: /lib/x86_64-linux-gnu/libc.so.6(+0x126a00) [0x7f72c09daa00]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.
2> 2022-04-20T11:02:32.514+0000 7f72b2380640 10 monclient: tick
-1> 2022-04-20T11:02:32.514+0000 7f72b2380640 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2022-04-20T11:02:02.521412+0000)
0> 2022-04-20T11:02:32.822+0000 7f72b3b83640 5 prioritycache tune_memory target: 2147483648 mapped: 18293686272 unmapped: 662167552 heap: 18955853824 old mem: 134217728 new mem: 134217728
--
logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_mirror
0/ 5 rbd_replay
0/ 5 rbd_pwl
0/ 5 journaler
0/ 5 objectcacher
0/ 5 immutable_obj_cache
0/ 5 client
1/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 journal
0/ 0 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 1 reserver
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/ 5 rgw_sync
1/ 5 rgw_datacache
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 compressor
1/ 5 bluestore
1/ 5 bluefs
1/ 3 bdev
1/ 5 kstore
4/ 5 rocksdb
4/ 5 leveldb
4/ 5 memdb
1/ 5 fuse
2/ 5 mgr
1/ 5 mgrc
1/ 5 dpdk
1/ 5 eventtrace
1/ 5 prioritycache
0/ 5 test
0/ 5 cephfs_mirror
0/ 5 cephsqlite
0/ 5 seastore
0/ 5 seastore_onode
0/ 5 seastore_odata
0/ 5 seastore_omap
0/ 5 seastore_tm
0/ 5 seastore_t
0/ 5 seastore_cleaner
0/ 5 seastore_lba
0/ 5 seastore_lba_details
0/ 5 seastore_cache
0/ 5 seastore_journal
0/ 5 seastore_device
0/ 5 alienstore
1/ 5 mclock
2/-2 (syslog threshold)
99/99 (stderr threshold)
--
pthread ID / name mapping for recent threads ---
7f72a5366640 / osd_srv_heartbt
7f72a6368640 / tp_osd_tp
7f72aeb79640 / ceph-osd
7f72af37a640 / cfin
7f72afb7b640 / bstore_kv_sync
7f72b2380640 / safe_timer
7f72b3382640 / ms_dispatch
7f72b3b83640 / bstore_mempool
7f72b4384640 / rocksdb:high0
7f72b4b85640 / rocksdb:low1
7f72b5386640 / rocksdb:low0
7f72bbdcf640 / io_context_pool
7f72bd5e3640 / io_context_pool
7f72bdde4640 / admin_socket
7f72be5e5640 / msgr-worker-2
7f72bede6640 / msgr-worker-1
7f72bf5e7640 / msgr-worker-0
7f72c07c0240 / ceph-osd
max_recent 10000
max_new 10000
log_file /var/lib/ceph/crash/2022-04-20T11:02:31.853432Z_f42edf35-e681-4beb-95c5-50fe5cacd0de/log
--- end dump of recent events ---
Abortado (`core' generado)

#64 Updated by Nitzan Mordechai 8 months ago

Can you please add the output of trim-pg-log ?
CEPH_ARGS="--osd_pg_log_trim_max=10000 --osd_max_pg_log_entries=2000 " LD_LIBRARY_PATH=lib/ ./ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-3 --op trim-pg-log --pgid 9.12

#65 Updated by Mykola Golub 8 months ago

Just as information that might be useful for someone. Although ceph-objectstore-tool is a more reliable way to confirm an osd is affected by "pglog dups" issue, there is a simpler way to check your cluster if it might be affected:

Run `ceph tell osd.\* perf dump |grep 'osd_pglog\|^osd\.[0-9]'`

The output will be something like below:

osd.0: {
        "osd_pglog_bytes": 207381176,
        "osd_pglog_items": 421823,
osd.1: {
        "osd_pglog_bytes": 110917560,
        "osd_pglog_items": 224615,
...

The osd_pglog_items number is a sum of "normal" pg log entries, pg log dups, and some other items (which are expected to be not large). With the default osd_target_pg_log_entries_per_osd = 300000, if you observe about 300 thousand items for an osd, then these items are most likely "normal" pg log entries, and you are not affected. But if you observe much larger (order of magnitude or more) value then it is most likely due to dups. Then you may confirm it by stopping the osd and running `ceph-objectstore-tool -op log` for every pg from this osd, as described earlier in the ticket.

#66 Updated by Gonzalo Aguilar Delgado 8 months ago

Nitzan Mordechai wrote:

Can you please add the output of trim-pg-log ?
CEPH_ARGS="--osd_pg_log_trim_max=10000 --osd_max_pg_log_entries=2000 " LD_LIBRARY_PATH=lib/ ./ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-3 --op trim-pg-log --pgid 9.12

CEPH_ARGS="--osd_pg_log_trim_max=10000 --osd_max_pg_log_entries=2000 " LD_LIBRARY_PATH=lib/ ./ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-3 --op trim-pg-log --pgid 9.12
Log bounds are: (1491996'2724109,1492033'2724365]
Log not larger than osd_max_pg_log_entries 2000
Finished trimming pg log

#67 Updated by Gonzalo Aguilar Delgado 8 months ago

Mykola Golub wrote:

Just as information that might be useful for someone. Although ceph-objectstore-tool is a more reliable way to confirm an osd is affected by "pglog dups" issue, there is a simpler way to check your cluster if it might be affected:

Run `ceph tell osd.\* perf dump |grep 'osd_pglog\|^osd\.[0-9]'`

The output will be something like below:
[...]

The osd_pglog_items number is a sum of "normal" pg log entries, pg log dups, and some other items (which are expected to be not large). With the default osd_target_pg_log_entries_per_osd = 300000, if you observe about 300 thousand items for an osd, then these items are most likely "normal" pg log entries, and you are not affected. But if you observe much larger (order of magnitude or more) value then it is most likely due to dups. Then you may confirm it by stopping the osd and running `ceph-objectstore-tool -op log` for every pg from this osd, as described earlier in the ticket.

osd.0: {
"osd_pglog_bytes": 874400800,
"osd_pglog_items": 8284984,
osd.1: {
"osd_pglog_bytes": 1091064,
"osd_pglog_items": 8787,
osd.2: {
"osd_pglog_bytes": 1344324904,
"osd_pglog_items": 12857089,
Error ENXIO: problem getting command descriptions from osd.3
osd.4: {
"osd_pglog_bytes": 106298296,
"osd_pglog_items": 789747,
osd.5: {
"osd_pglog_bytes": 3120,
"osd_pglog_items": 30,
osd.6: {
"osd_pglog_bytes": 1071519072,
"osd_pglog_items": 10110172,
osd.7: {
"osd_pglog_bytes": 744536,
"osd_pglog_items": 4983,
osd.9: {
"osd_pglog_bytes": 1107992080,
"osd_pglog_items": 10572406,
osd.10: {
"osd_pglog_bytes": 641016064,
"osd_pglog_items": 6037488,
osd.11: {
"osd_pglog_bytes": 495602016,
"osd_pglog_items": 4608192,
osd.12: {
"osd_pglog_bytes": 383319560,
"osd_pglog_items": 3614277,
osd.13: {
"osd_pglog_bytes": 4248597080,
"osd_pglog_items": 40363483,

#68 Updated by Nitzan Mordechai 8 months ago

Gonzalo Aguilar Delgado wrote:

Nitzan Mordechai wrote:

Can you please add the output of trim-pg-log ?
CEPH_ARGS="--osd_pg_log_trim_max=10000 --osd_max_pg_log_entries=2000 " LD_LIBRARY_PATH=lib/ ./ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-3 --op trim-pg-log --pgid 9.12

CEPH_ARGS="--osd_pg_log_trim_max=10000 --osd_max_pg_log_entries=2000 " LD_LIBRARY_PATH=lib/ ./ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-3 --op trim-pg-log --pgid 9.12
Log bounds are: (1491996'2724109,1492033'2724365]
Log not larger than osd_max_pg_log_entries 2000
Finished trimming pg log

I pushed new code to handle that issue, If the osd_max_pg_log_entries <= number of entries in the pg_log then we are skipping the trimming process. with he new added function, we will also count the duplicate entries, if there is something to trim, we will scan the pg_log.
please try the new code.

#69 Updated by Gonzalo Aguilar Delgado 8 months ago

Nitzan Mordechai wrote:

Gonzalo Aguilar Delgado wrote:

Nitzan Mordechai wrote:

Can you please add the output of trim-pg-log ?
CEPH_ARGS="--osd_pg_log_trim_max=10000 --osd_max_pg_log_entries=2000 " LD_LIBRARY_PATH=lib/ ./ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-3 --op trim-pg-log --pgid 9.12

CEPH_ARGS="--osd_pg_log_trim_max=10000 --osd_max_pg_log_entries=2000 " LD_LIBRARY_PATH=lib/ ./ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-3 --op trim-pg-log --pgid 9.12
Log bounds are: (1491996'2724109,1492033'2724365]
Log not larger than osd_max_pg_log_entries 2000
Finished trimming pg log

I pushed new code to handle that issue, If the osd_max_pg_log_entries <= number of entries in the pg_log then we are skipping the trimming process. with he new added function, we will also count the duplicate entries, if there is something to trim, we will scan the pg_log.
please try the new code.

Hello,
It's giving me a core...

CEPH_ARGS="--osd_pg_log_trim_max=10000 --osd_max_pg_log_entries=2000 " LD_LIBRARY_PATH=lib/ ./ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-3 --op trim-pg-log --pgid 9.12
terminate called after throwing an instance of 'std::system_error'
what(): Invalid argument
Abortado (`core' generado)

#70 Updated by Mykola Golub 8 months ago

Gonzalo Aguilar Delgado wrote:

CEPH_ARGS="--osd_pg_log_trim_max=10000 --osd_max_pg_log_entries=2000 " LD_LIBRARY_PATH=lib/ ./ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-3 --op trim-pg-log --pgid 9.12
terminate called after throwing an instance of 'std::system_error'
what(): Invalid argument
Abortado (`core' generado)

This looks like a library mismatch. Is your LD_LIBRARY_PATH pointing to the right direction? What is the output of `LD_LIBRARY_PATH=lib/ ldd ./ceph-objectstore-tool`?

#71 Updated by Gonzalo Aguilar Delgado 8 months ago

Mykola Golub wrote:

Gonzalo Aguilar Delgado wrote:

CEPH_ARGS="--osd_pg_log_trim_max=10000 --osd_max_pg_log_entries=2000 " LD_LIBRARY_PATH=lib/ ./ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-3 --op trim-pg-log --pgid 9.12
terminate called after throwing an instance of 'std::system_error'
what(): Invalid argument
Abortado (`core' generado)

This looks like a library mismatch. Is your LD_LIBRARY_PATH pointing to the right direction? What is the output of `LD_LIBRARY_PATH=lib/ ldd ./ceph-objectstore-tool`?

Exactly! I rebuilt from scratch and now it's working perfectly.

CEPH_ARGS="--osd_pg_log_trim_max=500000 --osd_max_pg_log_entries=2000 " LD_LIBRARY_PATH=lib/ ./ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-3 --op log --pgid 9.12 | jq '(.pg_log_t.log|length),(.pg_log_t.dups|length)'

100
2999

Will try to boot now...

#72 Updated by Gonzalo Aguilar Delgado 8 months ago

Yesssss!!! Great job team!

It's up & running. It purged dups, booted the ceph-osd and only 1/2Gb RAM full booted.

But the story doesn't end here. We must merge this in and get it released. ;-)

For it's even larger, since I purged all data that was in this situation from the cluster. Now I have to get it in. So it's script time for me. I will post it here when finished.

Thank you again. Will still keeping an eye on it. Mykola, specially thank you for doing the patch.

#73 Updated by Mykola Golub 8 months ago

Gonzalo Aguilar Delgado wrote:

Mykola, specially thank you for doing the patch.

I am not the author of the patch. Nitzan is the author.

#74 Updated by Gonzalo Aguilar Delgado 8 months ago

Mykola Golub wrote:

Gonzalo Aguilar Delgado wrote:

Mykola, specially thank you for doing the patch.

I am not the author of the patch. Nitzan is the author.

Then thank you a lot to Nitzan too. I'm still looking to some troubles while restoring data because the OSD I'm using to this task suddenly crashed. Will take a look to it today.

Another strange thing but this time building ceph is that I have to do:
> while true; do nice -n 99 ninja; sleep 1; done
cause it fails building but each time I run ninja it goes a little bit farther so I can continue building it. It's quite pesky. I will open issues with both when I have time.

Thank you again.

#75 Updated by Gonzalo Aguilar Delgado 8 months ago

I suppose this thread can be closed as soon as the fix is in master. But just for reference, in case has something to do. Now it crashes with:
-33> 2022-04-21T18:09:40.573+0000 7f8b3a120640 5 osd.3 pg_epoch: 1655335 pg[9.e6( v 1492033'3891036 lc 0'0 (1492033'3889036,1492033'3891036] local-lis/les=1389434/1389436 n=521 ec=1492072/417 lis/c=1389434/1389434 les/c/f=1389436/1389436/11759 sis=1655034) [0,13,6] r=-1 lpr=1655335 pi=[1389434,1655034)/14 crt=1492033'3891036 mlcod 0'0 unknown NOTIFY mbc={}] enter Started/Stray
-32> 2022-04-21T18:09:40.573+0000 7f8b3a120640 5 osd.3 scrub-queue::remove_from_osd_queue removing pg[9.96] failed. State was: not registered w/ OSD
-31> 2022-04-21T18:09:40.573+0000 7f8b3a120640 5 osd.3 scrub-queue::remove_from_osd_queue removing pg[9.96] failed. State was: not registered w/ OSD
-30> 2022-04-21T18:09:40.573+0000 7f8b3a120640 1 osd.3 pg_epoch: 1655335 pg[9.96( v 1492033'2995206 lc 0'0 (1492033'2993206,1492033'2995206] local-lis/les=1389434/1389435 n=491 ec=1492072/417 lis/c=1389434/1389434 les/c/f=1389435/1389435/11759 sis=1655049) [4,9,13] r=-1 lpr=1655335 pi=[1389434,1655049)/12 crt=1492033'2995206 mlcod 0'0 unknown mbc={}] start_peering_interval up [4,9,13] -> [4,9,13], acting [4,9,13] -> [4,9,13], acting_primary 4 -> 4, up_primary 4 -> 4, role -1 -> -1, features acting 4540138294989422591 upacting 4540138294989422591
-29> 2022-04-21T18:09:40.573+0000 7f8b3a120640 5 osd.3 pg_epoch: 1655335 pg[9.96( v 1492033'2995206 lc 0'0 (1492033'2993206,1492033'2995206] local-lis/les=1389434/1389435 n=491 ec=1492072/417 lis/c=1389434/1389434 les/c/f=1389435/1389435/11759 sis=1655049) [4,9,13] r=-1 lpr=1655335 pi=[1389434,1655049)/12 crt=1492033'2995206 mlcod 0'0 unknown NOTIFY mbc={}] exit Reset 3.226758 39 0.000147
-28> 2022-04-21T18:09:40.573+0000 7f8b3a120640 5 osd.3 pg_epoch: 1655335 pg[9.96( v 1492033'2995206 lc 0'0 (1492033'2993206,1492033'2995206] local-lis/les=1389434/1389435 n=491 ec=1492072/417 lis/c=1389434/1389434 les/c/f=1389435/1389435/11759 sis=1655049) [4,9,13] r=-1 lpr=1655335 pi=[1389434,1655049)/12 crt=1492033'2995206 mlcod 0'0 unknown NOTIFY mbc={}] enter Started
-27> 2022-04-21T18:09:40.573+0000 7f8b3a120640 5 osd.3 pg_epoch: 1655335 pg[9.96( v 1492033'2995206 lc 0'0 (1492033'2993206,1492033'2995206] local-lis/les=1389434/1389435 n=491 ec=1492072/417 lis/c=1389434/1389434 les/c/f=1389435/1389435/11759 sis=1655049) [4,9,13] r=-1 lpr=1655335 pi=[1389434,1655049)/12 crt=1492033'2995206 mlcod 0'0 unknown NOTIFY mbc={}] enter Start
-26> 2022-04-21T18:09:40.573+0000 7f8b3a120640 1 osd.3 pg_epoch: 1655335 pg[9.96( v 1492033'2995206 lc 0'0 (1492033'2993206,1492033'2995206] local-lis/les=1389434/1389435 n=491 ec=1492072/417 lis/c=1389434/1389434 les/c/f=1389435/1389435/11759 sis=1655049) [4,9,13] r=-1 lpr=1655335 pi=[1389434,1655049)/12 crt=1492033'2995206 mlcod 0'0 unknown NOTIFY mbc={}] state<Start>: transitioning to Stray
-25> 2022-04-21T18:09:40.573+0000 7f8b3a120640 5 osd.3 pg_epoch: 1655335 pg[9.96( v 1492033'2995206 lc 0'0 (1492033'2993206,1492033'2995206] local-lis/les=1389434/1389435 n=491 ec=1492072/417 lis/c=1389434/1389434 les/c/f=1389435/1389435/11759 sis=1655049) [4,9,13] r=-1 lpr=1655335 pi=[1389434,1655049)/12 crt=1492033'2995206 mlcod 0'0 unknown NOTIFY mbc={}] exit Start 0.000015 0 0.000000
-24> 2022-04-21T18:09:40.573+0000 7f8b3a120640 5 osd.3 pg_epoch: 1655335 pg[9.96( v 1492033'2995206 lc 0'0 (1492033'2993206,1492033'2995206] local-lis/les=1389434/1389435 n=491 ec=1492072/417 lis/c=1389434/1389434 les/c/f=1389435/1389435/11759 sis=1655049) [4,9,13] r=-1 lpr=1655335 pi=[1389434,1655049)/12 crt=1492033'2995206 mlcod 0'0 unknown NOTIFY mbc={}] enter Started/Stray
-23> 2022-04-21T18:09:40.577+0000 7f8b3a120640 5 osd.3 scrub-queue::remove_from_osd_queue removing pg[9.52] failed. State was: not registered w/ OSD
-22> 2022-04-21T18:09:40.577+0000 7f8b3a120640 5 osd.3 scrub-queue::remove_from_osd_queue removing pg[9.52] failed. State was: not registered w/ OSD
-21> 2022-04-21T18:09:40.577+0000 7f8b3a120640 1 osd.3 pg_epoch: 1655335 pg[9.52( v 1492033'2724365 (1492033'2722365,1492033'2724365] local-lis/les=1389428/1389429 n=489 ec=1492072/417 lis/c=1389428/1389428 les/c/f=1389429/1389429/11759 sis=1655132) [4,10,13] r=-1 lpr=1655335 pi=[1389428,1655132)/11 crt=1492033'2724365 lcod 0'0 mlcod 0'0 unknown mbc={}] start_peering_interval up [4,10,13] -> [4,10,13], acting [4,10,13] -> [4,10,13], acting_primary 4 -> 4, up_primary 4 -> 4, role -1 -> -1, features acting 4540138294989422591 upacting 4540138294989422591
-20> 2022-04-21T18:09:40.577+0000 7f8b3a120640 5 osd.3 pg_epoch: 1655335 pg[9.52( v 1492033'2724365 (1492033'2722365,1492033'2724365] local-lis/les=1389428/1389429 n=489 ec=1492072/417 lis/c=1389428/1389428 les/c/f=1389429/1389429/11759 sis=1655132) [4,10,13] r=-1 lpr=1655335 pi=[1389428,1655132)/11 crt=1492033'2724365 lcod 0'0 mlcod 0'0 unknown NOTIFY mbc={}] exit Reset 3.168798 39 0.000132
-19> 2022-04-21T18:09:40.577+0000 7f8b3a120640 5 osd.3 pg_epoch: 1655335 pg[9.52( v 1492033'2724365 (1492033'2722365,1492033'2724365] local-lis/les=1389428/1389429 n=489 ec=1492072/417 lis/c=1389428/1389428 les/c/f=1389429/1389429/11759 sis=1655132) [4,10,13] r=-1 lpr=1655335 pi=[1389428,1655132)/11 crt=1492033'2724365 lcod 0'0 mlcod 0'0 unknown NOTIFY mbc={}] enter Started
-18> 2022-04-21T18:09:40.577+0000 7f8b3a120640 5 osd.3 pg_epoch: 1655335 pg[9.52( v 1492033'2724365 (1492033'2722365,1492033'2724365] local-lis/les=1389428/1389429 n=489 ec=1492072/417 lis/c=1389428/1389428 les/c/f=1389429/1389429/11759 sis=1655132) [4,10,13] r=-1 lpr=1655335 pi=[1389428,1655132)/11 crt=1492033'2724365 lcod 0'0 mlcod 0'0 unknown NOTIFY mbc={}] enter Start
-17> 2022-04-21T18:09:40.577+0000 7f8b3a120640 1 osd.3 pg_epoch: 1655335 pg[9.52( v 1492033'2724365 (1492033'2722365,1492033'2724365] local-lis/les=1389428/1389429 n=489 ec=1492072/417 lis/c=1389428/1389428 les/c/f=1389429/1389429/11759 sis=1655132) [4,10,13] r=-1 lpr=1655335 pi=[1389428,1655132)/11 crt=1492033'2724365 lcod 0'0 mlcod 0'0 unknown NOTIFY mbc={}] state<Start>: transitioning to Stray
-16> 2022-04-21T18:09:40.577+0000 7f8b3a120640 5 osd.3 pg_epoch: 1655335 pg[9.52( v 1492033'2724365 (1492033'2722365,1492033'2724365] local-lis/les=1389428/1389429 n=489 ec=1492072/417 lis/c=1389428/1389428 les/c/f=1389429/1389429/11759 sis=1655132) [4,10,13] r=-1 lpr=1655335 pi=[1389428,1655132)/11 crt=1492033'2724365 lcod 0'0 mlcod 0'0 unknown NOTIFY mbc={}] exit Start 0.000011 0 0.000000
-15> 2022-04-21T18:09:40.577+0000 7f8b3a120640 5 osd.3 pg_epoch: 1655335 pg[9.52( v 1492033'2724365 (1492033'2722365,1492033'2724365] local-lis/les=1389428/1389429 n=489 ec=1492072/417 lis/c=1389428/1389428 les/c/f=1389429/1389429/11759 sis=1655132) [4,10,13] r=-1 lpr=1655335 pi=[1389428,1655132)/11 crt=1492033'2724365 lcod 0'0 mlcod 0'0 unknown NOTIFY mbc={}] enter Started/Stray
-14> 2022-04-21T18:09:40.577+0000 7f8b4b943640 -1 osd.3 1655335 set_numa_affinity unable to identify public interface '' numa node: (2) No such file or directory
-13> 2022-04-21T18:09:40.577+0000 7f8b4b943640 1 osd.3 1655335 set_numa_affinity not setting numa affinity
-12> 2022-04-21T18:09:40.597+0000 7f8b4b943640 10 monclient: _send_mon_message to mon.blue-compute at v2:172.16.0.119:3300/0
-11> 2022-04-21T18:09:40.937+0000 7f8b4793b640 5 prioritycache tune_memory target: 2147483648 mapped: 113278976 unmapped: 269148160 heap: 382427136 old mem: 1020054707 new mem: 1020054730
-10> 2022-04-21T18:09:41.585+0000 7f8b4e949640 1 osd.3 1655335 tick checking mon for new map
-9> 2022-04-21T18:09:41.585+0000 7f8b4e949640 10 monclient: _renew_subs
-8> 2022-04-21T18:09:41.585+0000 7f8b4e949640 10 monclient: _send_mon_message to mon.blue-compute at v2:172.16.0.119:3300/0
-7> 2022-04-21T18:09:41.633+0000 7f8b4713a640 3 osd.3 1655335 handle_osd_map epochs [1655336,1655339], i have 1655335, src has [1491297,1655339]
-6> 2022-04-21T18:09:41.725+0000 7f8b5339f640 10 monclient: handle_auth_request added challenge on 0x55b4db912400
-5> 2022-04-21T18:09:41.725+0000 7f8b52b9e640 10 monclient: handle_auth_request added challenge on 0x55b4db912000
-4> 2022-04-21T18:09:41.733+0000 7f8b5239d640 10 monclient: handle_auth_request added challenge on 0x55b4db912800
-3> 2022-04-21T18:09:41.733+0000 7f8b5339f640 10 monclient: handle_auth_request added challenge on 0x55b4eb0fb800
-2> 2022-04-21T18:09:41.733+0000 7f8b52b9e640 10 monclient: handle_auth_request added challenge on 0x55b4eb0fbc00
-1> 2022-04-21T18:09:41.737+0000 7f8b5339f640 -1 ** Caught signal (Segmentation fault) *
in thread 7f8b5339f640 thread_name:msgr-worker-0

ceph version 17.0.0-11722-gb40738fffcc (b40738fffcc55824da2bbbbae704c3bb268ceddd) quincy (dev)
1: /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7f8b546ae520]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.
0> 2022-04-21T18:09:41.745+0000 7f8b4713a640  3 osd.3 1655339 handle_osd_map epochs [1655340,1655340], i have 1655339, src has [1491297,1655340]
--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_mirror
0/ 5 rbd_replay
0/ 5 rbd_pwl
0/ 5 journaler
0/ 5 objectcacher
0/ 5 immutable_obj_cache
0/ 5 client
1/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 journal
0/ 0 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 1 reserver
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/ 5 rgw_sync
1/ 5 rgw_datacache
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 compressor
1/ 5 bluestore
1/ 5 bluefs
1/ 3 bdev
1/ 5 kstore
4/ 5 rocksdb
4/ 5 leveldb
4/ 5 memdb
1/ 5 fuse
2/ 5 mgr
1/ 5 mgrc
1/ 5 dpdk
1/ 5 eventtrace
1/ 5 prioritycache
0/ 5 test
0/ 5 cephfs_mirror
0/ 5 cephsqlite
0/ 5 seastore
0/ 5 seastore_onode
0/ 5 seastore_odata
0/ 5 seastore_omap
0/ 5 seastore_tm
0/ 5 seastore_t
0/ 5 seastore_cleaner
0/ 5 seastore_lba
0/ 5 seastore_lba_details
0/ 5 seastore_cache
0/ 5 seastore_journal
0/ 5 seastore_device
0/ 5 alienstore
1/ 5 mclock
2/-2 (syslog threshold)
99/99 (stderr threshold)
--
pthread ID / name mapping for recent threads ---
7f8b3911e640 / osd_srv_heartbt
7f8b3991f640 / tp_osd_tp
7f8b3a120640 / tp_osd_tp
7f8b42931640 / ceph-osd
7f8b4713a640 / ms_dispatch
7f8b4793b640 / bstore_mempool
7f8b4813c640 / rocksdb:high0
7f8b4893d640 / rocksdb:low1
7f8b4913e640 / rocksdb:low0
7f8b4b943640 / fn_anonymous
7f8b4e949640 / safe_timer
7f8b4fb87640 / io_context_pool
7f8b5139b640 / io_context_pool
7f8b51b9c640 / admin_socket
7f8b5239d640 / msgr-worker-2
7f8b52b9e640 / msgr-worker-1
7f8b5339f640 / msgr-worker-0
7f8b54578240 / ceph-osd
max_recent 10000
max_new 10000
log_file /var/lib/ceph/crash/2022-04-21T18:09:41.738842Z_6313ea0d-dd17-474d-be59-aad526574534/log
--- end dump of recent events ---
Violación de segmento (`core' generado)

#76 Updated by Gonzalo Aguilar Delgado 8 months ago

Gonzalo Aguilar Delgado wrote:

I suppose this thread can be closed as soon as the fix is in master. But just for reference, in case has something to do. Now it crashes with:
-33> 2022-04-21T18:09:40.573+0000 7f8b3a120640 5 osd.3 pg_epoch: 1655335 pg[9.e6( v 1492033'3891036 lc 0'0 (1492033'3889036,1492033'3891036] local-lis/les=1389434/1389436 n=521 ec=1492072/417 lis/c=1389434/1389434 les/c/f=1389436/1389436/11759 sis=1655034) [0,13,6] r=-1 lpr=1655335 pi=[1389434,1655034)/14 crt=1492033'3891036 mlcod 0'0 unknown NOTIFY mbc={}] enter Started/Stray
-32> 2022-04-21T18:09:40.573+0000 7f8b3a120640 5 osd.3 scrub-queue::remove_from_osd_queue removing pg[9.96] failed. State was: not registered w/ OSD
-31> 2022-04-21T18:09:40.573+0000 7f8b3a120640 5 osd.3 scrub-queue::remove_from_osd_queue removing pg[9.96] failed. State was: not registered w/ OSD
-30> 2022-04-21T18:09:40.573+0000 7f8b3a120640 1 osd.3 pg_epoch: 1655335 pg[9.96( v 1492033'2995206 lc 0'0 (1492033'2993206,1492033'2995206] local-lis/les=1389434/1389435 n=491 ec=1492072/417 lis/c=1389434/1389434 les/c/f=1389435/1389435/11759 sis=1655049) [4,9,13] r=-1 lpr=1655335 pi=[1389434,1655049)/12 crt=1492033'2995206 mlcod 0'0 unknown mbc={}] start_peering_interval up [4,9,13] -> [4,9,13], acting [4,9,13] -> [4,9,13], acting_primary 4 -> 4, up_primary 4 -> 4, role -1 -> -1, features acting 4540138294989422591 upacting 4540138294989422591
-29> 2022-04-21T18:09:40.573+0000 7f8b3a120640 5 osd.3 pg_epoch: 1655335 pg[9.96( v 1492033'2995206 lc 0'0 (1492033'2993206,1492033'2995206] local-lis/les=1389434/1389435 n=491 ec=1492072/417 lis/c=1389434/1389434 les/c/f=1389435/1389435/11759 sis=1655049) [4,9,13] r=-1 lpr=1655335 pi=[1389434,1655049)/12 crt=1492033'2995206 mlcod 0'0 unknown NOTIFY mbc={}] exit Reset 3.226758 39 0.000147
-28> 2022-04-21T18:09:40.573+0000 7f8b3a120640 5 osd.3 pg_epoch: 1655335 pg[9.96( v 1492033'2995206 lc 0'0 (1492033'2993206,1492033'2995206] local-lis/les=1389434/1389435 n=491 ec=1492072/417 lis/c=1389434/1389434 les/c/f=1389435/1389435/11759 sis=1655049) [4,9,13] r=-1 lpr=1655335 pi=[1389434,1655049)/12 crt=1492033'2995206 mlcod 0'0 unknown NOTIFY mbc={}] enter Started
-27> 2022-04-21T18:09:40.573+0000 7f8b3a120640 5 osd.3 pg_epoch: 1655335 pg[9.96( v 1492033'2995206 lc 0'0 (1492033'2993206,1492033'2995206] local-lis/les=1389434/1389435 n=491 ec=1492072/417 lis/c=1389434/1389434 les/c/f=1389435/1389435/11759 sis=1655049) [4,9,13] r=-1 lpr=1655335 pi=[1389434,1655049)/12 crt=1492033'2995206 mlcod 0'0 unknown NOTIFY mbc={}] enter Start
-26> 2022-04-21T18:09:40.573+0000 7f8b3a120640 1 osd.3 pg_epoch: 1655335 pg[9.96( v 1492033'2995206 lc 0'0 (1492033'2993206,1492033'2995206] local-lis/les=1389434/1389435 n=491 ec=1492072/417 lis/c=1389434/1389434 les/c/f=1389435/1389435/11759 sis=1655049) [4,9,13] r=-1 lpr=1655335 pi=[1389434,1655049)/12 crt=1492033'2995206 mlcod 0'0 unknown NOTIFY mbc={}] state<Start>: transitioning to Stray
-25> 2022-04-21T18:09:40.573+0000 7f8b3a120640 5 osd.3 pg_epoch: 1655335 pg[9.96( v 1492033'2995206 lc 0'0 (1492033'2993206,1492033'2995206] local-lis/les=1389434/1389435 n=491 ec=1492072/417 lis/c=1389434/1389434 les/c/f=1389435/1389435/11759 sis=1655049) [4,9,13] r=-1 lpr=1655335 pi=[1389434,1655049)/12 crt=1492033'2995206 mlcod 0'0 unknown NOTIFY mbc={}] exit Start 0.000015 0 0.000000
-24> 2022-04-21T18:09:40.573+0000 7f8b3a120640 5 osd.3 pg_epoch: 1655335 pg[9.96( v 1492033'2995206 lc 0'0 (1492033'2993206,1492033'2995206] local-lis/les=1389434/1389435 n=491 ec=1492072/417 lis/c=1389434/1389434 les/c/f=1389435/1389435/11759 sis=1655049) [4,9,13] r=-1 lpr=1655335 pi=[1389434,1655049)/12 crt=1492033'2995206 mlcod 0'0 unknown NOTIFY mbc={}] enter Started/Stray
-23> 2022-04-21T18:09:40.577+0000 7f8b3a120640 5 osd.3 scrub-queue::remove_from_osd_queue removing pg[9.52] failed. State was: not registered w/ OSD
-22> 2022-04-21T18:09:40.577+0000 7f8b3a120640 5 osd.3 scrub-queue::remove_from_osd_queue removing pg[9.52] failed. State was: not registered w/ OSD
-21> 2022-04-21T18:09:40.577+0000 7f8b3a120640 1 osd.3 pg_epoch: 1655335 pg[9.52( v 1492033'2724365 (1492033'2722365,1492033'2724365] local-lis/les=1389428/1389429 n=489 ec=1492072/417 lis/c=1389428/1389428 les/c/f=1389429/1389429/11759 sis=1655132) [4,10,13] r=-1 lpr=1655335 pi=[1389428,1655132)/11 crt=1492033'2724365 lcod 0'0 mlcod 0'0 unknown mbc={}] start_peering_interval up [4,10,13] -> [4,10,13], acting [4,10,13] -> [4,10,13], acting_primary 4 -> 4, up_primary 4 -> 4, role -1 -> -1, features acting 4540138294989422591 upacting 4540138294989422591
-20> 2022-04-21T18:09:40.577+0000 7f8b3a120640 5 osd.3 pg_epoch: 1655335 pg[9.52( v 1492033'2724365 (1492033'2722365,1492033'2724365] local-lis/les=1389428/1389429 n=489 ec=1492072/417 lis/c=1389428/1389428 les/c/f=1389429/1389429/11759 sis=1655132) [4,10,13] r=-1 lpr=1655335 pi=[1389428,1655132)/11 crt=1492033'2724365 lcod 0'0 mlcod 0'0 unknown NOTIFY mbc={}] exit Reset 3.168798 39 0.000132
-19> 2022-04-21T18:09:40.577+0000 7f8b3a120640 5 osd.3 pg_epoch: 1655335 pg[9.52( v 1492033'2724365 (1492033'2722365,1492033'2724365] local-lis/les=1389428/1389429 n=489 ec=1492072/417 lis/c=1389428/1389428 les/c/f=1389429/1389429/11759 sis=1655132) [4,10,13] r=-1 lpr=1655335 pi=[1389428,1655132)/11 crt=1492033'2724365 lcod 0'0 mlcod 0'0 unknown NOTIFY mbc={}] enter Started
-18> 2022-04-21T18:09:40.577+0000 7f8b3a120640 5 osd.3 pg_epoch: 1655335 pg[9.52( v 1492033'2724365 (1492033'2722365,1492033'2724365] local-lis/les=1389428/1389429 n=489 ec=1492072/417 lis/c=1389428/1389428 les/c/f=1389429/1389429/11759 sis=1655132) [4,10,13] r=-1 lpr=1655335 pi=[1389428,1655132)/11 crt=1492033'2724365 lcod 0'0 mlcod 0'0 unknown NOTIFY mbc={}] enter Start
-17> 2022-04-21T18:09:40.577+0000 7f8b3a120640 1 osd.3 pg_epoch: 1655335 pg[9.52( v 1492033'2724365 (1492033'2722365,1492033'2724365] local-lis/les=1389428/1389429 n=489 ec=1492072/417 lis/c=1389428/1389428 les/c/f=1389429/1389429/11759 sis=1655132) [4,10,13] r=-1 lpr=1655335 pi=[1389428,1655132)/11 crt=1492033'2724365 lcod 0'0 mlcod 0'0 unknown NOTIFY mbc={}] state<Start>: transitioning to Stray
-16> 2022-04-21T18:09:40.577+0000 7f8b3a120640 5 osd.3 pg_epoch: 1655335 pg[9.52( v 1492033'2724365 (1492033'2722365,1492033'2724365] local-lis/les=1389428/1389429 n=489 ec=1492072/417 lis/c=1389428/1389428 les/c/f=1389429/1389429/11759 sis=1655132) [4,10,13] r=-1 lpr=1655335 pi=[1389428,1655132)/11 crt=1492033'2724365 lcod 0'0 mlcod 0'0 unknown NOTIFY mbc={}] exit Start 0.000011 0 0.000000
-15> 2022-04-21T18:09:40.577+0000 7f8b3a120640 5 osd.3 pg_epoch: 1655335 pg[9.52( v 1492033'2724365 (1492033'2722365,1492033'2724365] local-lis/les=1389428/1389429 n=489 ec=1492072/417 lis/c=1389428/1389428 les/c/f=1389429/1389429/11759 sis=1655132) [4,10,13] r=-1 lpr=1655335 pi=[1389428,1655132)/11 crt=1492033'2724365 lcod 0'0 mlcod 0'0 unknown NOTIFY mbc={}] enter Started/Stray
-14> 2022-04-21T18:09:40.577+0000 7f8b4b943640 -1 osd.3 1655335 set_numa_affinity unable to identify public interface '' numa node: (2) No such file or directory
-13> 2022-04-21T18:09:40.577+0000 7f8b4b943640 1 osd.3 1655335 set_numa_affinity not setting numa affinity
-12> 2022-04-21T18:09:40.597+0000 7f8b4b943640 10 monclient: _send_mon_message to mon.blue-compute at v2:172.16.0.119:3300/0
-11> 2022-04-21T18:09:40.937+0000 7f8b4793b640 5 prioritycache tune_memory target: 2147483648 mapped: 113278976 unmapped: 269148160 heap: 382427136 old mem: 1020054707 new mem: 1020054730
-10> 2022-04-21T18:09:41.585+0000 7f8b4e949640 1 osd.3 1655335 tick checking mon for new map
-9> 2022-04-21T18:09:41.585+0000 7f8b4e949640 10 monclient: _renew_subs
-8> 2022-04-21T18:09:41.585+0000 7f8b4e949640 10 monclient: _send_mon_message to mon.blue-compute at v2:172.16.0.119:3300/0
-7> 2022-04-21T18:09:41.633+0000 7f8b4713a640 3 osd.3 1655335 handle_osd_map epochs [1655336,1655339], i have 1655335, src has [1491297,1655339]
-6> 2022-04-21T18:09:41.725+0000 7f8b5339f640 10 monclient: handle_auth_request added challenge on 0x55b4db912400
-5> 2022-04-21T18:09:41.725+0000 7f8b52b9e640 10 monclient: handle_auth_request added challenge on 0x55b4db912000
-4> 2022-04-21T18:09:41.733+0000 7f8b5239d640 10 monclient: handle_auth_request added challenge on 0x55b4db912800
-3> 2022-04-21T18:09:41.733+0000 7f8b5339f640 10 monclient: handle_auth_request added challenge on 0x55b4eb0fb800
-2> 2022-04-21T18:09:41.733+0000 7f8b52b9e640 10 monclient: handle_auth_request added challenge on 0x55b4eb0fbc00
-1> 2022-04-21T18:09:41.737+0000 7f8b5339f640 -1 ** Caught signal (Segmentation fault) *
in thread 7f8b5339f640 thread_name:msgr-worker-0

ceph version 17.0.0-11722-gb40738fffcc (b40738fffcc55824da2bbbbae704c3bb268ceddd) quincy (dev)
1: /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7f8b546ae520]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

0> 2022-04-21T18:09:41.745+0000 7f8b4713a640 3 osd.3 1655339 handle_osd_map epochs [1655340,1655340], i have 1655339, src has [1491297,1655340]
--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_mirror
0/ 5 rbd_replay
0/ 5 rbd_pwl
0/ 5 journaler
0/ 5 objectcacher
0/ 5 immutable_obj_cache
0/ 5 client
1/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 journal
0/ 0 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 1 reserver
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/ 5 rgw_sync
1/ 5 rgw_datacache
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 compressor
1/ 5 bluestore
1/ 5 bluefs
1/ 3 bdev
1/ 5 kstore
4/ 5 rocksdb
4/ 5 leveldb
4/ 5 memdb
1/ 5 fuse
2/ 5 mgr
1/ 5 mgrc
1/ 5 dpdk
1/ 5 eventtrace
1/ 5 prioritycache
0/ 5 test
0/ 5 cephfs_mirror
0/ 5 cephsqlite
0/ 5 seastore
0/ 5 seastore_onode
0/ 5 seastore_odata
0/ 5 seastore_omap
0/ 5 seastore_tm
0/ 5 seastore_t
0/ 5 seastore_cleaner
0/ 5 seastore_lba
0/ 5 seastore_lba_details
0/ 5 seastore_cache
0/ 5 seastore_journal
0/ 5 seastore_device
0/ 5 alienstore
1/ 5 mclock
2/-2 (syslog threshold)
99/99 (stderr threshold)
--
pthread ID / name mapping for recent threads ---
7f8b3911e640 / osd_srv_heartbt
7f8b3991f640 / tp_osd_tp
7f8b3a120640 / tp_osd_tp
7f8b42931640 / ceph-osd
7f8b4713a640 / ms_dispatch
7f8b4793b640 / bstore_mempool
7f8b4813c640 / rocksdb:high0
7f8b4893d640 / rocksdb:low1
7f8b4913e640 / rocksdb:low0
7f8b4b943640 / fn_anonymous
7f8b4e949640 / safe_timer
7f8b4fb87640 / io_context_pool
7f8b5139b640 / io_context_pool
7f8b51b9c640 / admin_socket
7f8b5239d640 / msgr-worker-2
7f8b52b9e640 / msgr-worker-1
7f8b5339f640 / msgr-worker-0
7f8b54578240 / ceph-osd
max_recent 10000
max_new 10000
log_file /var/lib/ceph/crash/2022-04-21T18:09:41.738842Z_6313ea0d-dd17-474d-be59-aad526574534/log
--- end dump of recent events ---
Violación de segmento (`core' generado)

Confirmed. Nothing to do with this bug. Created a new one. Tnx
https://tracker.ceph.com/issues/55407

#77 Updated by Ruben Kerkhof 7 months ago

Can I kindly ask if there's an estimate when this will be fixed and backported? We have customers that have been in trouble for months now due to this issue, and I don't have a solid workaround for them.

#78 Updated by Neha Ojha 7 months ago

  • Status changed from Fix Under Review to Pending Backport

#79 Updated by Backport Bot 7 months ago

  • Copied to Backport #55631: pacific: ceph-osd takes all memory before oom on boot added

#80 Updated by Backport Bot 7 months ago

  • Copied to Backport #55632: quincy: ceph-osd takes all memory before oom on boot added

#81 Updated by Backport Bot 7 months ago

  • Copied to Backport #55633: octopus: ceph-osd takes all memory before oom on boot added

#82 Updated by Konstantin Shalygin 7 months ago

  • Category set to Performance/Resource Usage
  • Target version changed from v16.2.7 to v18.0.0

#83 Updated by Neha Ojha 6 months ago

  • Status changed from Pending Backport to Resolved

#84 Updated by Neha Ojha 6 months ago

  • Status changed from Resolved to In Progress

#85 Updated by Gonzalo Aguilar Delgado 4 months ago

Wow I`m quite surprised to see this is taking so much time to be resolved. >
Can someone do a small recap on what's the status of the issue and what's missing on the resolution, please?

I see there are no backports because some trouble found during QA... Can this be?

#86 Updated by Neha Ojha 4 months ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID changed from 45529 to 47046

#87 Updated by Neha Ojha 4 months ago

Gonzalo Aguilar Delgado wrote:

Wow I`m quite surprised to see this is taking so much time to be resolved. >
Can someone do a small recap on what's the status of the issue and what's missing on the resolution, please?

I see there are no backports because some trouble found during QA... Can this be?

I had summarized the status of this ticket in an email to the list https://www.spinics.net/lists/dev-ceph/msg04555.html.
The current status is that the online trimming version of the fix(along with the offline version) will be made available in the next Quincy point release. https://github.com/ceph/ceph/pull/47046 is the relevant PR.

#88 Updated by Konstantin Shalygin 3 months ago

  • Status changed from Fix Under Review to Pending Backport
  • Tags set to backport_processed

#89 Updated by Konstantin Shalygin about 1 month ago

  • Status changed from Pending Backport to Resolved
  • % Done changed from 0 to 100
  • Tags deleted (backport_processed)

Also available in: Atom PDF