Project

General

Profile

Actions

Feature #53746

open

allow reading from replicas/shards less than min_size

Added by c sights over 2 years ago. Updated about 1 year ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Reviewed:
Affected Versions:
Pull request ID:

Description

Hello all,
It would be a nice improvement to Ceph to allow reading from replicas/shards which are less than min_size (but not 0).

At the moment the documentation accurately says "min_size - Sets the minimum number of replicas required for I/O." This means writes and reads.

However, AFAIK there should be no data corruption concerns with READING when replicas/shards are less than min_size.

Also, it would make the cluster much more available under some circumstances. In our case a drive failure and node failure occurred at the same time with 3 replicas and min_size 2 caused some files to be inaccessible for days. (We have large drives relative to the network speed.) Being able to access these files read-only with the remaining replica would have been splendid.

Thanks for your work!
C.

Actions #1

Updated by Joshua Boniface about 1 year ago

Seconding this feature request. My environment makes use of an extremely large size=2,min_size=2 pool where I want writes to block in such a case (for data integrity on write), but where I also need clients to still be able to read data from the pool when degraded (trusting the one copy is "good enough"), and where going to size=3 would be extremely prohibitive. Back when I deployed the cluster in the Jewel days, I definitely got the impression that this was how things work, and I actually never noticed the problem until Nautilus (what I currently run), though that could just be me mis-remembering or missing the impact. But today it does cause a major impact as the blocked reads make my entire cluster effectively unavailable after losing a host to maintenance.

The code itself does explicitly state that min_size is for writes, not reads, i.e. in src/common/options.cc:

2615     Option("osd_pool_default_min_size", Option::TYPE_UINT, Option::LEVEL_ADVANCED)                                                                                                                                                        
2616     .set_default(0)
2617     .set_min_max(0, 255)
2618     .set_flag(Option::FLAG_RUNTIME)
2619     .set_description("the minimal number of copies allowed to write to a degraded pool for new replicated pools")
2620     .set_long_description("0 means no specific default; ceph will use size-size/2")
2621     .add_see_also("osd_pool_default_size")
2622     .add_service("mon"),

So at the very least this explanation is wrong. However I would myself prefer it to be correct, and for Ceph to allow reads from undersized+degraded+peered PGs.

There's already the option osd_allow_recovery_below_min_size which from a cursory check fulfills the same function but for recoveries; I don't imagine it would be very hard to implement a similar option, e.g. osd_pool_allow_dirty_reads for reads from degraded PGs, so that administrators can configure whether they want to enable this or not.

Actions

Also available in: Atom PDF