Seconding this feature request. My environment makes use of an extremely large size=2,min_size=2
pool where I want writes to block in such a case (for data integrity on write), but where I also need clients to still be able to read data from the pool when degraded (trusting the one copy is "good enough"), and where going to size=3
would be extremely prohibitive. Back when I deployed the cluster in the Jewel days, I definitely got the impression that this was how things work, and I actually never noticed the problem until Nautilus (what I currently run), though that could just be me mis-remembering or missing the impact. But today it does cause a major impact as the blocked reads make my entire cluster effectively unavailable after losing a host to maintenance.
The code itself does explicitly state that min_size
is for writes, not reads, i.e. in src/common/options.cc
:
2615 Option("osd_pool_default_min_size", Option::TYPE_UINT, Option::LEVEL_ADVANCED)
2616 .set_default(0)
2617 .set_min_max(0, 255)
2618 .set_flag(Option::FLAG_RUNTIME)
2619 .set_description("the minimal number of copies allowed to write to a degraded pool for new replicated pools")
2620 .set_long_description("0 means no specific default; ceph will use size-size/2")
2621 .add_see_also("osd_pool_default_size")
2622 .add_service("mon"),
So at the very least this explanation is wrong. However I would myself prefer it to be correct, and for Ceph to allow reads from undersized+degraded+peered
PGs.
There's already the option osd_allow_recovery_below_min_size
which from a cursory check fulfills the same function but for recoveries; I don't imagine it would be very hard to implement a similar option, e.g. osd_pool_allow_dirty_reads
for reads from degraded PGs, so that administrators can configure whether they want to enable this or not.