Bug #62338: osd: choose_async_recovery_ec may select an acting set < min_size - RADOS - Ceph

Actions

Copy link

Bug #62338

closed

osd: choose_async_recovery_ec may select an acting set < min_size

Added by Samuel Just 9 months ago. Updated 3 days ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Samuel Just

Category:

EC Pools

Target version:

Ceph - v19.1.0

% Done:

100%

Source:

Community (user)

Tags:

backport_processed

Backport:

pacific,quincy,reef

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

52823

Crash signature (v1):

Crash signature (v2):

Description

choose_async_recovery_ec may remove OSDs from the acting set as long as PeeringState::recoverable evaluates to true. Prior to 90022b35 (merge of PR 17619), the condition was PeeringState::recoverable_and_ge_min_size which behaved as the name indicates. 7cb818a85 weakened the condition in PeeringState::recoverable_and_ge_min_size to only check min_size if !cct->_conf.get_val<bool>("osd_allow_recovery_below_min_size") (name was changed to PeeringState::recoverable in a subsequent commit in that PR e4c8bee88). PeeringState::recoverable_and_ge_min_size had (and has) two callers: choose_acting and choose_async_recovery_ec. For choose_acting, this change is correct. However, for choose_async_recovery_ec, we don't want to reduce the acting set size below min_size as it would prevent the PG doing IO during recovery.

The main observable symptom will be a PG that ends up in peered state during recovery (peered+recovering, peered+recovery_wait) unable to do IO until recovery completes although there are sufficient pretty much up-to-date osds.

Related issues 3 (1 open — 2 closed)

Actions

Copy link

Updated by Samuel Just 9 months ago

Description updated (diff)

Actions

Copy link

Updated by Prashant D 9 months ago

Workaround for this issue is to set osd_async_recovery_min_cost to a very large value.

# ceph config set osd osd_async_recovery_min_cost 1099511627776

Notes from Sam : The async recovery cost is the number of pg log entries behind on the replica + the number of missing objects. The osd_target_pg_log_entries_per_osd is 30000, so an OSD with a single PG could have 30000 entries. The osd_async_recovery_min_cost is a 64-bit integer so set it to 2^40 (1<<40) i.e to 1099511627776 value which we cannot hit.

Actions

Copy link

Updated by Radoslaw Zarzynski 9 months ago

Status changed from New to Fix Under Review
Pull request ID set to 52823

Actions

Copy link

Updated by Neha Ojha 9 months ago

Project changed from Ceph to RADOS

Actions

Copy link

Updated by Radoslaw Zarzynski 8 months ago

Status changed from Fix Under Review to Pending Backport
Backport changed from octopus, pacific, quincy, reef to pacific,quincy,reef

Actions

Copy link

Updated by Backport Bot 8 months ago

Copied to Backport #62817: quincy: osd: choose_async_recovery_ec may select an acting set < min_size added

Actions

Copy link

Updated by Backport Bot 8 months ago

Copied to Backport #62818: pacific: osd: choose_async_recovery_ec may select an acting set < min_size added

Actions

Copy link

Updated by Backport Bot 8 months ago

Copied to Backport #62819: reef: osd: choose_async_recovery_ec may select an acting set < min_size added

Actions

Copy link

Updated by Backport Bot 8 months ago

Tags set to backport_processed

Actions

Copy link

#10

Updated by Bartosz Rabiega about 2 months ago

Hello. Just FYI, this fixes a very nasty issue in my EC setup.
Here are some details.

The EC setup and crush rules are defined to have:
3 racks
2 hosts per rack
12 disks per host

EC configuration 7+5
Crush rule picks 1 rack, 2 hosts and 2 disks so 4 chunks of data in each rack.

Now here is the funny thing I end up thanks to this bug.

1. Start some IO
2. Shutdown OSDs from rack A (pg's are active+undersized)
3. Start OSDs from rack A (pg's are active+undersized)
4. Again Shutdown OSDs from rack A (some PG's are down)

So in theory when rack is down 4 chunks are unavailable but 8 are still present - all PGs should remain active.

Now even more weird, continuing the case described above:

4a. Stop IO
5. Disable recovery/rebalance to make sure no chunks are recovered
6. Start OSDs from rack A (again pg's are active+undersized)
7. Again Shutdown OSDs from rack A (some PG's are down again but much less than in step 4, e.g. 50 instead of 300)
8. Start OSDs from rack A (again pg's are active+undersized)
9. Again Shutdown OSDs from rack A (some PG's are down again but much less than in step 7, e.g. 5 instead of 50)
10. Start OSDs from rack A (again pg's are active+undersized)
11. Again Shutdown OSDs from rack A (all pg's are active+undersized)

Further rack restarts won't cause PG down any more, unless there is some IO.

So my guess here is that async recovery kicks in when rack A goes up for the first time, it messes up with the acting set and as a result when rack A goes down again some unfortunate PGs end up in down state.

I retested everything a couple of times with `osd_async_recovery_min_cost 1099511627776` on reef - no more down PGs.

Thank you very very much for the fix.

Actions

Copy link

#11

Updated by Bartosz Rabiega about 2 months ago

Hello again.

Apparently I got a tiny little bit too excited.

I tested the case described above with 16.2.15 and unfortunately the problem still exists.
However if I disable async recovery (osd_async_recovery_min_cost 1099511627776) then cluster works as desired, all PGs states are as expected (active + undersized, never down)

I'd appreciate any tips on how to narrow this bug down.

Actions

Copy link

#12

Updated by Konstantin Shalygin 3 days ago

Category set to EC Pools
Status changed from Pending Backport to Resolved
Target version set to v19.1.0
% Done changed from 0 to 100
Source set to Community (user)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #62338

osd: choose_async_recovery_ec may select an acting set < min_size

Updated by Samuel Just 9 months ago

Updated by Prashant D 9 months ago

Updated by Radoslaw Zarzynski 9 months ago

Updated by Neha Ojha 9 months ago

Updated by Radoslaw Zarzynski 8 months ago

Updated by Backport Bot 8 months ago

Updated by Backport Bot 8 months ago

Updated by Backport Bot 8 months ago

Updated by Backport Bot 8 months ago

Updated by Bartosz Rabiega about 2 months ago

Updated by Bartosz Rabiega about 2 months ago

Updated by Konstantin Shalygin 3 days ago