Project

General

Profile

Actions

Bug #40104

closed

maybe_remove_pg_upmap can be super inefficient for large clusters

Added by xie xingguo almost 5 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
OSDMap
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
luminous,mimic,nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Report from Tom Byrne, Senior storage system administrator at the Rutherford Appleton Laboratory (RAL), part of STFC:

I wasn’t sure if I had managed to explain my problem well enough, so I would like to explain it below in writing and get your thoughts on it.

I am worried about the amount of time to create a new OSD map on our large 5000 OSD, 25000 PG cluster that uses upmap.

Before we used upmap, our OSD map creation time was 2-3 seconds. After the upmap balancer had balanced the cluster and added ~14000 upmap item entries, the OSD map creation time after any cluster change (stopping OSD, reweighting OSD) took about 15 seconds, which was significantly longer and is causing us issues with the monitors hanging, and blocked requests as the cluster continues to try and talk to the down OSDs.

Looking at the logs with debugging turned up while the leader monitor generates a new OSD map, I traced the extra OSDmap creation time to the maybe_remove_pg_upmap function. It appears that for any change to the cluster, no matter how small, the maybe_remove_pg_upmap function checks all upmap entries in the OSD map for validity when creating the new OSD map. It seems to do this in a single thread, so the time taken scales with more upmap entries.

Does this sound like the behaviour you expect? I’m not massively familiar with the Ceph codebase so I may be wrong about this.

It seems to me that there are two possible ways to improve the situation:

- Reduce the amount of pg_upmaps that have to be checked for removal. Possibly only check for removal on OSDs that have changed state?

Are either of these options sensible? Or do you think I have a different problem than I have described.

Thank you,
Tom


Related issues 3 (0 open3 closed)

Copied to Ceph - Backport #40229: luminous: maybe_remove_pg_upmap can be super inefficient for large clustersResolvedxie xingguoActions
Copied to Ceph - Backport #40230: mimic: maybe_remove_pg_upmap can be super inefficient for large clustersResolvedxie xingguoActions
Copied to Ceph - Backport #40231: nautilus: maybe_remove_pg_upmap can be super inefficient for large clustersResolvedNathan CutlerActions
Actions #1

Updated by xie xingguo almost 5 years ago

  • Description updated (diff)
Actions #2

Updated by xie xingguo almost 5 years ago

  • Pull request ID set to 28373
Actions #3

Updated by xie xingguo almost 5 years ago

  • Status changed from 12 to Pending Backport
Actions #4

Updated by Nathan Cutler almost 5 years ago

  • Copied to Backport #40229: luminous: maybe_remove_pg_upmap can be super inefficient for large clusters added
Actions #5

Updated by Nathan Cutler almost 5 years ago

  • Copied to Backport #40230: mimic: maybe_remove_pg_upmap can be super inefficient for large clusters added
Actions #6

Updated by Nathan Cutler almost 5 years ago

  • Copied to Backport #40231: nautilus: maybe_remove_pg_upmap can be super inefficient for large clusters added
Actions #7

Updated by Nathan Cutler over 4 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF