maybe_remove_pg_upmap can be super inefficient for large clusters
Report from Tom Byrne, Senior storage system administrator at the Rutherford Appleton Laboratory (RAL), part of STFC:
I wasn’t sure if I had managed to explain my problem well enough, so I would like to explain it below in writing and get your thoughts on it.
I am worried about the amount of time to create a new OSD map on our large 5000 OSD, 25000 PG cluster that uses upmap.
Before we used upmap, our OSD map creation time was 2-3 seconds. After the upmap balancer had balanced the cluster and added ~14000 upmap item entries, the OSD map creation time after any cluster change (stopping OSD, reweighting OSD) took about 15 seconds, which was significantly longer and is causing us issues with the monitors hanging, and blocked requests as the cluster continues to try and talk to the down OSDs.
Looking at the logs with debugging turned up while the leader monitor generates a new OSD map, I traced the extra OSDmap creation time to the maybe_remove_pg_upmap function. It appears that for any change to the cluster, no matter how small, the maybe_remove_pg_upmap function checks all upmap entries in the OSD map for validity when creating the new OSD map. It seems to do this in a single thread, so the time taken scales with more upmap entries.
Does this sound like the behaviour you expect? I’m not massively familiar with the Ceph codebase so I may be wrong about this.
It seems to me that there are two possible ways to improve the situation:
- Reduce the amount of pg_upmaps that have to be checked for removal. Possibly only check for removal on OSDs that have changed state?
Are either of these options sensible? Or do you think I have a different problem than I have described.