Feature #53746
openallow reading from replicas/shards less than min_size
0%
Description
Hello all,
It would be a nice improvement to Ceph to allow reading from replicas/shards which are less than min_size (but not 0).
At the moment the documentation accurately says "min_size - Sets the minimum number of replicas required for I/O." This means writes and reads.
However, AFAIK there should be no data corruption concerns with READING when replicas/shards are less than min_size.
Also, it would make the cluster much more available under some circumstances. In our case a drive failure and node failure occurred at the same time with 3 replicas and min_size 2 caused some files to be inaccessible for days. (We have large drives relative to the network speed.) Being able to access these files read-only with the remaining replica would have been splendid.
Thanks for your work!
C.
Updated by Joshua Boniface about 1 year ago
Seconding this feature request. My environment makes use of an extremely large size=2,min_size=2
pool where I want writes to block in such a case (for data integrity on write), but where I also need clients to still be able to read data from the pool when degraded (trusting the one copy is "good enough"), and where going to size=3
would be extremely prohibitive. Back when I deployed the cluster in the Jewel days, I definitely got the impression that this was how things work, and I actually never noticed the problem until Nautilus (what I currently run), though that could just be me mis-remembering or missing the impact. But today it does cause a major impact as the blocked reads make my entire cluster effectively unavailable after losing a host to maintenance.
The code itself does explicitly state that min_size
is for writes, not reads, i.e. in src/common/options.cc
:
2615 Option("osd_pool_default_min_size", Option::TYPE_UINT, Option::LEVEL_ADVANCED)
2616 .set_default(0)
2617 .set_min_max(0, 255)
2618 .set_flag(Option::FLAG_RUNTIME)
2619 .set_description("the minimal number of copies allowed to write to a degraded pool for new replicated pools")
2620 .set_long_description("0 means no specific default; ceph will use size-size/2")
2621 .add_see_also("osd_pool_default_size")
2622 .add_service("mon"),
So at the very least this explanation is wrong. However I would myself prefer it to be correct, and for Ceph to allow reads from undersized+degraded+peered
PGs.
There's already the option osd_allow_recovery_below_min_size
which from a cursory check fulfills the same function but for recoveries; I don't imagine it would be very hard to implement a similar option, e.g. osd_pool_allow_dirty_reads
for reads from degraded PGs, so that administrators can configure whether they want to enable this or not.