Project

General

Profile

Actions

Bug #57460

closed

Json formatted ceph pg dump hangs on large clusters

Added by Ponnuvel P over 1 year ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
low-hanging-fruit
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

PG dump command `ceph pg dump --format json-pretty` hangs on large clusters.

The ceph-mgr daemon hangs and eventually fails over to the other ceph-mgr.

This happens in a cluster with 4391 OSDs and 66675 PGs.

The non-json `ceph pg dump` is about 35M in this cluster. So by the json formatted output would be considerably large (perhaps close to 2G depending on the number .

This was observed on an Octopus (15.2.13) cluster. But I'd expect the same happen in Pacific or Quincy on any sufficiently large cluster.

One option could be to disable network_ping_times from the Python (ceph-mgr) module which would is the major component (size-wise) for each of the OSDs.

Actions #1

Updated by Radoslaw Zarzynski over 1 year ago

  • Status changed from New to Need More Info
  • Tags set to low-hanging-fruit

Have you tried ceph pg dump pgs_brief? It doesn't contain network_ping_times. Perhaps we could switch to it by default (though, it would be a change in the operator's interface).

Actions #2

Updated by Ponnuvel P over 1 year ago

I'll check with the user whether pg_brief works (it obviously doesn't contain network_ping_times but whether pg dump still too big and/or mgr fails over).

Instead of changing changing the interface, perhaps it could just be disabled for PGMap? That would be:

diff --git a/src/mon/PGMap.cc b/src/mon/PGMap.cc
index 190b93bb82..a95f1aed5c 100644
--- a/src/mon/PGMap.cc
+++ b/src/mon/PGMap.cc
@@ -3518,7 +3518,7 @@ int process_pg_map_command(
     if (f) {
       if (what.count("all")) {
        f->open_object_section("pg_map");
-       pg_map.dump(f);
+       pg_map.dump(f, false);
        f->close_section();
       } else if (what.count("summary") || what.count("sum")) {
        f->open_object_section("pg_map");

This isn't too different what's done for the direct calls from src/mgr/ActiveModules.cc and confines this to just PGMap.
(network_osd_times is still available via other means such as via dump_osd_network).

Actions #3

Updated by Ponnuvel P over 1 year ago

  • Pull request ID set to 48124
Actions #4

Updated by Ponnuvel P over 1 year ago

The user has confirmed that pgs_brief is sufficiently small and critically the command doesn't hand and the ceph-mgr doesn't fail over.

I have proposed a change that makes pgs_brief as the default. As Radoslaw Zarzynski noted earlier, this is a change in the interface (not any change in the command itself) i.e. Previously a user would get a different output for 'ceph pg dump' now. Though the current output would still be available via 'ceph pg dump all'.

Actions #5

Updated by Ponnuvel P over 1 year ago

Some further analysis...

I've deployed Ceph clusters at various OSD counts (from 3 to 20) and looked at how
the output of ceph pg dump --format json-pretty grows.

The sections that increase in size as OSDs grow are:

1. heartbeat peers (hb_peers)
This is just a single additional integer for each additional OSD (for its id).
So for 10k OSDs, it'd only be 48k increase (seq 10000 | wc -c = 48894). So we can ignore this.

2. pool stats (pool_statfs)
Each additional corresponding to the pool in which the OSD participates - this grows linearly.
Each of the section is approximately about 2800 bytes. So for 10k OSDs, it would be an additional ~28M.
While not insignificant, it's still not at alarming levels in terms of size.

3. osd_stats (of which network_ping_times is a sub tree)
The network_ping_times section grows at N^2. To be precise N * (N-1)/2 where N is the number of heartbeat peers.
Each of additional section is about 2100 bytes (it varies from 1900 - 2500 range but mostly around 2100).

For a 10k OSD cluster,
If we consider there are 400 heartbeat peers for each OSD, then
400 * (400 - 1)/2 * 2100 ~= 160M
If we consider there are 1200 heartbeat peers for each OSD, then
1200 * (1199 - 1)/2 * 2100 ~= 1440M
If we consider there are 3000 heartbeat peers for each OSD, then
3000 * (3000 - 1)/2 * 2100 ~= 7662M

Clearly network_ping_times is the most dominate part in terms of size. So I think it's probably better to:
- disable it, or
- create new option that includes network stats while the default can be all - network_ping_times

This would also avoid changing the default interface/output as it is today with ceph pg dump.
I'd think that ceph pg dump is far more widely used (and/or people are used to its output) than
ceph pg dump --format json-pretty in admin scripts.

For anyone using/parsing the json-pretty option, it's likely they are using a json parser. So it wouldn't break anything
unless they specifically read the network_ping_times section. I think disabling network_ping_times is less risky in terms
of regression/breaking any admin scripts.

Actions #6

Updated by Ponnuvel P over 1 year ago

Further looking at `ceph pg dump` to get the total PGs and total heartbeat peers in the user environment where this was noticed,
there are 839640 total number heartbeat peers for ALL the OSDs (simply counting the total HB_PEERS OSDs as each OSD will have a section for ALL of its peers).

If we consider each network_ping_times section's size as 2500 bytes, then
839640 * 2500 ~= 2001M

There are ~67k PGs, and each section is about 4400 bytes, that's another
4400 * 67000 ~= 281M

pool_statfs section would be
2800 * 4391 ~= 11M

There are few other sections but they don't grow as OSDs increase which would be several KBs in total.

So for this Ceph cluster:

- If it were to work OK, `ceph pg dump --format json-pretty` would be around 2300M - 2400M range.

- If we disable just network_ping_times, then it'd still be around 300M - 400M range.

Actions #7

Updated by Radoslaw Zarzynski over 1 year ago

Hi Ponnuvel! Thanks for the analysis. The problem is genuine and the exponential explosion is simply a no-no.
I just wonder how to fix it without changing the behavior for already deployed clusters (we don't want to surprise operators). One of the way might be to differentiate the behavior basing on a feature flag. As we just branched the v18 Reef, you could piggy-back one of its flags (e.g. SERVER_REEF).

Actions #8

Updated by Ponnuvel P over 1 year ago

Thanks, Radoslow! I'll look into modifying the patch as you suggested, targeting Reef.

Actions #9

Updated by Radoslaw Zarzynski over 1 year ago

  • Status changed from Need More Info to Fix Under Review
Actions #10

Updated by Ponnuvel P about 1 year ago

  • Assignee set to Ponnuvel P
Actions #11

Updated by Ponnuvel P 5 months ago

This was discussed in the CDM (https://tracker.ceph.com/projects/ceph/wiki/CDM_06-DEC-2023) and the conclusion was:
- Considered a scalability/performance bug
- Solution is to remove `network_ping_times` subsection from pg dump json output
- We will add a release note about this
- No deprecation period since introduced to a new release

So we'll drop the biggest component `network_ping_times` altogether and not change the current behaviour of `ceph pg dump`.

This would mean silently changing the behaviour `ceph pg dump --format json-pretty`. But this is
considered less of an issue as `network_ping_times` doesn't seem to be heavily utilised by test
scripts. It's possible it's used in the wild by users. To that, we'll add clearly communicate
this change in behaviour in the release notes in Squid's release -- it's a breaking change.

I'll re-work the patch to drop the `network_ping_times` from the json part.

Actions #12

Updated by Ponnuvel P 5 months ago

  • Pull request ID changed from 48124 to 54922
Actions #13

Updated by Radoslaw Zarzynski 3 months ago

  • Status changed from Fix Under Review to Resolved
Actions

Also available in: Atom PDF