Project

General

Profile

Actions

Bug #48452

open

pg merge explodes osdmap mempool size

Added by Dan van der Ster over 3 years ago. Updated over 3 years ago.

Status:
New
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Yes
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We have one cluster with several osds having >500MB osdmap mempools.

Here is one example from today:

            "osdmap": {
                "items": 20479405,
                "bytes": 510621880
            },

The osd is correctly trimming osdmaps; it keeps less than 1k maps:

# ceph daemon osd.528 status
{
    "cluster_fsid": "03dfe28e-ecb1-4d03-b7d7-aac8e172319e",
    "osd_fsid": "84af8d8e-3d63-4ab2-ab0c-1ddfb0c1656e",
    "whoami": 528,
    "state": "active",
    "oldest_map": 200260,
    "newest_map": 201001,
    "num_pgs": 18
}

I traced the mempool usage back in our monitoring and found it started growing on Nov 12. See the attached plot of osdmap mempool bytes and osdmap last committed on Nov 12.

I have sent the ceph log and ceph.audit.log from that day.
ceph-post-file: 5b4985c0-8126-4da9-9d16-b2c336d826aa
ceph-post-file: c86c8952-5748-4d51-97e7-4bea604670e8

You will see that around 08:37 am we started decreasing pg_num for the default.rgw.users.swift pool. The merging took until 10:38, when the osdmaps stopped churning and mempool stopped growing.
But for the next two weeks (until I noticed today) the mempool kept growing. (see attached chart from today)

Seems there is a leaked ref to the osdmap in pg merging.

(P.S. in the audit log you will see that moments before we started the pg merging, we set some nodelete, etc flags on some pools; this may also be the root cause but I was thinking that pg merging is more likely.)


Files

Screenshot from 2020-12-03 19-02-36.png (43.4 KB) Screenshot from 2020-12-03 19-02-36.png Nov 12 2020 Dan van der Ster, 12/03/2020 06:03 PM
Screenshot from 2020-12-03 19-09-15.png (76.8 KB) Screenshot from 2020-12-03 19-09-15.png Dec 3 2020 Dan van der Ster, 12/03/2020 06:09 PM
Actions #1

Updated by Neha Ojha over 3 years ago

  • Priority changed from Normal to High
Actions

Also available in: Atom PDF