Bug #37940
closedupmap balancer won't refill underfull osds if zero overfull found
0%
Description
The following was seen on v12.2.10.
One pool has been upmap balanced for awhile, so there are now zero overfull osds. But there is a new osd (557) which gets stuck severely underfull forever.
Here is the relevant log in calc_pg_upmaps:
2019-01-16 14:17:49.103173 7f203d36d700 20 osd.553 pgs 85 target 85.5565 deviation -0.556458 2019-01-16 14:17:49.103176 7f203d36d700 20 osd.554 pgs 58 target 57.0376 deviation 0.962406 2019-01-16 14:17:49.103178 7f203d36d700 20 osd.555 pgs 57 target 57.0376 deviation -0.0375938 2019-01-16 14:17:49.103180 7f203d36d700 20 osd.556 pgs 58 target 57.0376 deviation 0.962406 2019-01-16 14:17:49.103183 7f203d36d700 20 osd.557 pgs 24 target 57.0376 deviation -33.0376 2019-01-16 14:17:49.103185 7f203d36d700 20 osd.558 pgs 58 target 57.0376 deviation 0.962406 2019-01-16 14:17:49.103188 7f203d36d700 20 osd.559 pgs 58 target 57.0376 deviation 0.962406 2019-01-16 14:17:49.103190 7f203d36d700 20 osd.560 pgs 58 target 57.0376 deviation 0.962406 2019-01-16 14:17:49.103192 7f203d36d700 20 osd.561 pgs 58 target 57.0376 deviation 0.962406 2019-01-16 14:17:49.103194 7f203d36d700 20 osd.562 pgs 58 target 57.0376 deviation 0.962406 2019-01-16 14:17:49.103197 7f203d36d700 20 osd.563 pgs 57 target 57.0376 deviation -0.0375938 2019-01-16 14:17:49.103199 7f203d36d700 20 osd.564 pgs 57 target 57.0376 deviation -0.0375938 2019-01-16 14:17:49.103201 7f203d36d700 20 osd.565 pgs 57 target 57.0376 deviation -0.0375938 2019-01-16 14:17:49.103203 7f203d36d700 20 osd.566 pgs 57 target 57.0376 deviation -0.0375938 2019-01-16 14:17:49.103206 7f203d36d700 20 osd.567 pgs 56 target 57.0376 deviation -1.03759 2019-01-16 14:17:49.103208 7f203d36d700 20 osd.568 pgs 57 target 57.0376 deviation -0.0375938 2019-01-16 14:17:49.103217 7f203d36d700 20 osd.569 pgs 57 target 57.0376 deviation -0.0375938 2019-01-16 14:17:49.103223 7f203d36d700 10 total_deviation 144.336 overfull underfull [557,466,469,471,478,483,485,542,567] 2019-01-16 14:17:49.104091 7f203d36d700 10 start deviation 144.336 2019-01-16 14:17:49.104096 7f203d36d700 10 end deviation 144.336
In this case, we have no overfull osds, one underfull osd 557 with a large negation deviation. (The handful of other underfull osds have deviation just under minus 1).
But because of this break, that single underfull osd is never re-filled:
@@ -4088,8 +4088,8 @@ int OSDMap::calc_pg_upmaps( if (overfull.empty() || underfull.empty()) break;
One way to fix this would be to populate overfull more aggressively:
diff --git a/src/osd/OSDMap.cc b/src/osd/OSDMap.cc index 2bb8beb94e..51bc4e7bdf 100644 --- a/src/osd/OSDMap.cc +++ b/src/osd/OSDMap.cc @@ -4067,7 +4067,7 @@ int OSDMap::calc_pg_upmaps( << dendl; osd_deviation[i.first] = deviation; deviation_osd.insert(make_pair(deviation, i.first)); - if (deviation >= 1.0) + if (deviation >= 0.5) // magic number, maybe 0.1 is better, maybe a configurable overfull.insert(i.first); total_deviation += abs(deviation); }
This way, the balancing would continue as long as there are underfull osds.
I can imagine a similar scenario with few outlier overfull and zero underfull osds, but I haven't seen that in the wild yet.
Thoughts?