Project

General

Profile

Bug #53729

Updated by Sebastian Wagner about 2 years ago

Hi, I cannot boot half of my OSD all of them die by oom killed.

It seems they are taking all the memory. Everything goes right until logs "logging to monitors" in the osd LOG then it takes available memory + swap.
I added a big swapfile to let it run:

Seems to be similar to this bug: https://tracker.ceph.com/issues/51609 but I don't see splitting stuff.

<pre>
MiB Mem : 15905,5 total, 203,3 libre, 15511,0 usado, 191,2 búfer/caché
MiB Intercambio: 35630,0 total, 17065,5 libre, 18564,5 usado. 130,2 dispon Mem

PID USUARIO PR NI VIRT RES SHR S %CPU %MEM HORA+ ORDEN
222720 ceph 20 0 34,5g 13,9g 0 S 2,9 89,7 2:31.37 ceph-osd

</pre>



Last lines of the log are:

<pre>
2021-12-25T22:09:11.078+0100 7fdc2b4a7640 1 osd.6 pg_epoch: 1492087 pg[9.e( v 1492033'2437243 lc 0'0 (1491849'2435049,1492033'2437243] local-lis/les=1389428/1389429 n=521 ec=417/417 lis/c=1389428/1389428 les/c/f=1389429/1389429/11759 sis=1492087) [] r=-1 lpr=1492087 pi=[1389428,1492087)/1 crt=1492033'2437243 mlcod 0'0 unknown NOTIFY mbc={}] start_peering_interval up [13] -> [], acting [13] -> [], acting_primary 13 -> -1, up_primary 13 -> -1, role -1 -> -1, features acting 4540138297136906239 upacting 4540138297136906239
2021-12-25T22:09:11.146+0100 7fdc2b4a7640 10 filestore(/var/lib/ceph/osd/ceph-6) read(3471): meta/#-1:b31b205f:::osdmap.1492088:0# 0~11375/11375
2021-12-25T22:09:11.234+0100 7fdc2b4a7640 10 filestore(/var/lib/ceph/osd/ceph-6) read(3471): meta/#-1:b99b205f:::osdmap.1492089:0# 0~11375/11375
2021-12-25T22:09:11.358+0100 7fdc2b4a7640 10 filestore(/var/lib/ceph/osd/ceph-6) read(3471): meta/#-1:bf7b205f:::osdmap.1492090:0# 0~11375/11375
2021-12-25T22:09:11.442+0100 7fdc2b4a7640 10 filestore(/var/lib/ceph/osd/ceph-6) read(3471): meta/#-1:b1fb205f:::osdmap.1492091:0# 0~11375/11375
2021-12-25T22:09:11.498+0100 7fdc2b4a7640 10 filestore(/var/lib/ceph/osd/ceph-6) read(3471): meta/#-1:bafb205f:::osdmap.1492092:0# 0~11375/11375
2021-12-25T22:09:11.606+0100 7fdc2b4a7640 1 osd.6 pg_epoch: 1492093 pg[9.e( v 1492033'2437243 lc 0'0 (1491849'2435049,1492033'2437243] local-lis/les=1389428/1389429 n=521 ec=417/417 lis/c=1389428/1389428 les/c/f=1389429/1389429/11759 sis=1492087) [] r=-1 lpr=1492087 pi=[1389428,1492087)/1 crt=1492033'2437243 mlcod 0'0 unknown NOTIFY mbc={}] state<Start>: transitioning to Stray
2021-12-25T22:09:11.634+0100 7fdc2b4a7640 5 filestore(/var/lib/ceph/osd/ceph-6) queue_transactions(2324): osr 0x5640b8bb1b50 osr(9.e_head)
2021-12-25T22:21:12.083+0100 7fdc69ffb640 10 filestore(/var/lib/ceph/osd/ceph-6) sync_entry(4271): commit took 1.194381356s, interval was 1001.232788086s

</pre>
Then it start to be much slow and it finally fails...

When it starts to take memory I can see in the cosole the log_to_monitors line...

<pre>
2021-12-25T22:04:28.407+0100 7fdc843ed2c0 -1 Falling back to public interface
2021-12-25T22:04:30.943+0100 7fdc843ed2c0 -1 journal do_read_entry(2006315008): bad header magic
2021-12-25T22:04:30.943+0100 7fdc843ed2c0 -1 journal do_read_entry(2006315008): bad header magic
2021-12-25T22:05:26.490+0100 7fdc843ed2c0 -1 osd.6 1492093 log_to_monitors {default=true}

</pre>
Version here is:
ceph version 16.2.6 (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable)

Mempools...
...

<pre><code class="json">
{{{


"bluestore_fsck": {
"items": 0,
"bytes": 0
},
"bluestore_txc": {
"items": 0,
"bytes": 0
},
"bluestore_writing_deferred": {
"items": 0,
"bytes": 0
},
"bluestore_writing": {
"items": 0,
"bytes": 0
},
"bluefs": {
"items": 0,
"bytes": 0
},
"bluefs_file_reader": {
"items": 0,
"bytes": 0
},
"bluefs_file_writer": {
"items": 0,
"bytes": 0
},
"buffer_anon": {
"items": 6790059,
"bytes": 27660504904
},
"buffer_meta": {
"items": 8906,
"bytes": 783728
},
"osd": {
"items": 188,
"bytes": 2126656
},
"osd_mapbl": {
"items": 52,
"bytes": 631590
},
"osd_pglog": {
"items": 23889940,
"bytes": 2541361472
},
"osdmap": {
"items": 8479,
"bytes": 726184
},
"osdmap_mapping": {
"items": 0,
"bytes": 0
},
"pgmap": {
"items": 0,
"bytes": 0
},
"mds_co": {
"items": 0,
"bytes": 0
},
"unittest_1": {
"items": 0,
"bytes": 0
},
"unittest_2": {
"items": 0,
"bytes": 0
}
},
"total": {
"items": 30697624,
"bytes": 30206134534
}
}
}
</code></pre>



The osd's that are failing are the ones with filestore + leveldb. That's weird... Maybe it's related to that backend...

Back