Project

General

Profile

Actions

Support #51609

open

OSD refuses to start (OOMK) due to pg split

Added by Tor Martin Ølberg almost 3 years ago. Updated over 2 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Tags:
Reviewed:
Affected Versions:
Component(RADOS):
Pull request ID:

Description

After an upgrade to 15.2.13 from 15.2.4 my small home lab cluster ran into issues with OSDs failing on all four hosts. This might be unrelated to the upgrade but it looks like the trigger has been an autoscaling event where the RBD PG pool has been scaled from 128 PGs to 512 PGs. Unfortunately I didnt see that there were pgs being split before I initiated the reboot of one of the hosts to load the latest linux kernel

Only some OSDs are affected, and during the OSD startup the following output can be observed:

2021-07-08T03:57:55.496+0200 7fc7303ff700 10 osd.17 146136 split_pgs splitting pg[5.25( v 146017'38948152 (146011'38947652,146017'38948152] local-lis/les=146012/146013 n=1168 ec=2338/46 lis/c=146012/145792 les/c/f=146013/145793/36878 sis=146019) [17,6] r=0 lpr=146019 pi=[145792,146019)/1 crt=146017'38948152 lcod 0'0 mlcod 0'0 unknown mbc={}] into 5.a5

Exporting/Remove the PGs belonging to poolid(5) seems to resolve the issue of OOMK but yields dataloss (naturally). Pool 5 was the one which were being split.

There isn't a lot of activity in the log (20/20 logging) but everything seems to revolve around splitting PGs. Full OSD startup log attached.

At this point i've exported all the troublesome PGs and gotten all the OSDs online. Trying to import the pgs again causes the OSD to OOMK on startup again.

Attempting to start one of the troubled OSDs with the troubled PG will result in all memory (80GiB) to be exhausted before the OOM killer steps in. Looking at dump of mempools buffer_anon looks severely high? Memory leak?

{
    "mempool": {
        "by_pool": {
            "bloom_filter": {
                "items": 0,
                "bytes": 0
            },
            "bluestore_alloc": {
                "items": 2894923,
                "bytes": 83560688
            },
            "bluestore_cache_data": {
                "items": 228,
                "bytes": 136914
            },
            "bluestore_cache_onode": {
                "items": 214,
                "bytes": 131824
            },
            "bluestore_cache_meta": {
                "items": 8226,
                "bytes": 48900
            },
            "bluestore_cache_other": {
                "items": 571,
                "bytes": 26252
            },
            "bluestore_Buffer": {
                "items": 10,
                "bytes": 960
            },
            "bluestore_Extent": {
                "items": 13,
                "bytes": 624
            },
            "bluestore_Blob": {
                "items": 13,
                "bytes": 1352
            },
            "bluestore_SharedBlob": {
                "items": 13,
                "bytes": 1456
            },
            "bluestore_inline_bl": {
                "items": 0,
                "bytes": 0
            },
            "bluestore_fsck": {
                "items": 0,
                "bytes": 0
            },
            "bluestore_txc": {
                "items": 0,
                "bytes": 0
            },
            "bluestore_writing_deferred": {
                "items": 0,
                "bytes": 0
            },
            "bluestore_writing": {
                "items": 0,
                "bytes": 0
            },
            "bluefs": {
                "items": 421,
                "bytes": 14728
            },
            "bluefs_file_reader": {
                "items": 56,
                "bytes": 5512704
            },
            "bluefs_file_writer": {
                "items": 3,
                "bytes": 672
            },
            "buffer_anon": {
                "items": 16048567,
                "bytes": 65862454806
            },
            "buffer_meta": {
                "items": 1048,
                "bytes": 92224
            },
            "osd": {
                "items": 209,
                "bytes": 2703624
            },
            "osd_mapbl": {
                "items": 0,
                "bytes": 0
            },
            "osd_pglog": {
                "items": 24310250,
                "bytes": 2547517232
            },
            "osdmap": {
                "items": 1578,
                "bytes": 92744
            },
            "osdmap_mapping": {
                "items": 0,
                "bytes": 0
            },
            "pgmap": {
                "items": 0,
                "bytes": 0
            },
            "mds_co": {
                "items": 0,
                "bytes": 0
            },
            "unittest_1": {
                "items": 0,
                "bytes": 0
            },
            "unittest_2": {
                "items": 0,
                "bytes": 0
            }
        },
        "total": {
            "items": 43266343,
            "bytes": 68502297704
        }
    }
}

Any guidance on how to further troubleshoot this issue would be greatly appreciated.


Related issues 1 (0 open1 closed)

Related to RADOS - Bug #53729: ceph-osd takes all memory before oom on bootResolvedNitzan Mordechai

Actions
Actions

Also available in: Atom PDF