Bug #61861: LRC Erasure Coding profile not working as expected - RADOS - Ceph

Actions

Copy link

Bug #61861

open

LRC Erasure Coding profile not working as expected

Added by Eugen Block 11 months ago. Updated 10 months ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

I was advised to create a ticket for this since there was no further response on the ML thread [1]. We hope to get some attention from the developers on this, but since the code hasn't been maintained in years it's unclear if the LRC plugin is supposed to work like it currently does or if it's actually a bug.
I'll write up the main points here, there are more details in the mentioned thread (although it has become quite long).

On a hardware cluster with 18 HDD nodes across 3 rooms (or DCs), I intend to use 15 nodes to be able to recover if one node fails. Given that I need one additional locality chunk per DC I need a profile with k + m = 12. So I chose k=9, m=3, l=4 which creates 15 chunks in total across those 3 DCs, one chunk per host, I checked the chunk placement and it is correct. This is the profile I created:

ceph osd erasure-code-profile set lrc1 plugin=lrc k=9 m=3 l=4 crush-failure-domain=host crush-locality=room crush-device-class=hdd

I created a pool with only one PG to make the output more readable. This profile should allow the cluster to sustain the loss of three chunks, the results are interesting. This is what I tested:

I stopped all OSDs on one host and the PG was still active with one missing chunk, everything's good.
Stopping a second host in the same DC resulted in the PG being marked as "down". That was unexpected since with m=3 I expected the PG to still be active but degraded. Before test #3 I started all OSDs again to have clean PGs.
I stopped one host per DC, so in total 3 chunks were missing and the PG was still active.

Apparently, this profile is able to sustain the loss of m chunks, but not an entire DC. I get the impression that LRC with this implementation is either designed to loose only single OSDs which can be recovered quicker with fewer surviving OSDs and saving bandwidth. Or this is a bug because according to the low-level description [2].
So if a whole DC fails and the chunks from step 3 (from the low-level description) can not be recovered, and maybe step 2 also fails, but eventually step 1 contains the actual k and m chunks which should sustain the loss of an entire DC. My impression is that the algorithm somehow doesn't arrive at step 1 and therefore the PG stays down although there are enough surviving chunks. I'm not sure if my observations and conclusion are correct, I'd love to have a comment from the developers on this topic.

[1] https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/RLTF4NTN5KGRSI4LEO43XUGHHP2GTKKO/
[2] https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/#low-level-plugin-configuration