Project

General

Profile

Actions

Bug #65008

open

EC pool - PGs down even if min size is satisfied

Added by Bartosz Rabiega about 1 month ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Peering
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hello I've been evaluating erasure coding ceph setup with following requirements:

- k+m 7+5
- 3 racks
- 5 hosts per rack
- 24 osds per host
- min_size set to 8

Crush rule is defined to pick 4 OSDs per rack (so we end up with 4 ec pieces per rack)

The setup works pretty well however there is one peculiar issue I hit related to a lot of writes, peering and recovery.

So here is the scenario, in my setup 100% reproducible, checked on multiple ceph versions (selected affected versions).

- to narrow down the issue, set norebalance,norecover,noscrub,nodeep-scrub
- run intensive 4k IO rand writes to saturate the cluster
- during IO turn off ALL OSDs from rack A - all PGs are active
- during IO turn on ALL OSDs from rack A - all PGs are active
- turn off ALL OSDs from rack A - some PGs are down

During the whole scenario all OSDs from rack B and C are perfectly fine, which shall give us 8 available pieces of ec data all the time.
PGs should be active all all the time.

Now this bug has something to do with async recovery. The issue reported here https://tracker.ceph.com/issues/62338 was a bit similar so I tried to test the fix, unfortunately without a luck.
But I got lucky with applying the workaround which disables async recovery.

All PGs are always active when async recovery is disabled.

I've attached several files showing cluster state in case of PGs being down when OSDs from rack A are down.


Files

pg-3.7cc-query.json (9.95 KB) pg-3.7cc-query.json PG which is down Bartosz Rabiega, 03/20/2024 02:33 PM
osdtree.txt (30.4 KB) osdtree.txt Bartosz Rabiega, 03/20/2024 02:33 PM
crush-rule-dump.txt (2.13 KB) crush-rule-dump.txt Bartosz Rabiega, 03/20/2024 02:33 PM
osdmap.bin (164 KB) osdmap.bin Bartosz Rabiega, 03/20/2024 02:33 PM
crushmap.bin (16.6 KB) crushmap.bin Bartosz Rabiega, 03/20/2024 02:33 PM

No data to display

Actions

Also available in: Atom PDF