Project

General

Profile

Bug #40104

maybe_remove_pg_upmap can be super inefficient for large clusters

Added by xie xingguo 4 months ago. Updated 24 days ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
OSDMap
Target version:
-
Start date:
06/01/2019
Due date:
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
luminous,mimic,nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

Report from Tom Byrne, Senior storage system administrator at the Rutherford Appleton Laboratory (RAL), part of STFC:

I wasn’t sure if I had managed to explain my problem well enough, so I would like to explain it below in writing and get your thoughts on it.

I am worried about the amount of time to create a new OSD map on our large 5000 OSD, 25000 PG cluster that uses upmap.

Before we used upmap, our OSD map creation time was 2-3 seconds. After the upmap balancer had balanced the cluster and added ~14000 upmap item entries, the OSD map creation time after any cluster change (stopping OSD, reweighting OSD) took about 15 seconds, which was significantly longer and is causing us issues with the monitors hanging, and blocked requests as the cluster continues to try and talk to the down OSDs.

Looking at the logs with debugging turned up while the leader monitor generates a new OSD map, I traced the extra OSDmap creation time to the maybe_remove_pg_upmap function. It appears that for any change to the cluster, no matter how small, the maybe_remove_pg_upmap function checks all upmap entries in the OSD map for validity when creating the new OSD map. It seems to do this in a single thread, so the time taken scales with more upmap entries.

Does this sound like the behaviour you expect? I’m not massively familiar with the Ceph codebase so I may be wrong about this.

It seems to me that there are two possible ways to improve the situation:

- Reduce the amount of pg_upmaps that have to be checked for removal. Possibly only check for removal on OSDs that have changed state?

Are either of these options sensible? Or do you think I have a different problem than I have described.

Thank you,
Tom


Related issues

Copied to Ceph - Backport #40229: luminous: maybe_remove_pg_upmap can be super inefficient for large clusters Resolved
Copied to Ceph - Backport #40230: mimic: maybe_remove_pg_upmap can be super inefficient for large clusters Resolved
Copied to Ceph - Backport #40231: nautilus: maybe_remove_pg_upmap can be super inefficient for large clusters Resolved

History

#1 Updated by xie xingguo 4 months ago

  • Description updated (diff)

#2 Updated by xie xingguo 4 months ago

  • Pull request ID set to 28373

#3 Updated by xie xingguo 3 months ago

  • Status changed from Verified to Pending Backport

#4 Updated by Nathan Cutler 3 months ago

  • Copied to Backport #40229: luminous: maybe_remove_pg_upmap can be super inefficient for large clusters added

#5 Updated by Nathan Cutler 3 months ago

  • Copied to Backport #40230: mimic: maybe_remove_pg_upmap can be super inefficient for large clusters added

#6 Updated by Nathan Cutler 3 months ago

  • Copied to Backport #40231: nautilus: maybe_remove_pg_upmap can be super inefficient for large clusters added

#7 Updated by Nathan Cutler 24 days ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF