Bug #62338
closedosd: choose_async_recovery_ec may select an acting set < min_size
100%
Description
choose_async_recovery_ec may remove OSDs from the acting set as long as PeeringState::recoverable evaluates to true. Prior to 90022b35 (merge of PR 17619), the condition was PeeringState::recoverable_and_ge_min_size which behaved as the name indicates. 7cb818a85 weakened the condition in PeeringState::recoverable_and_ge_min_size to only check min_size if !cct->_conf.get_val<bool>("osd_allow_recovery_below_min_size") (name was changed to PeeringState::recoverable in a subsequent commit in that PR e4c8bee88). PeeringState::recoverable_and_ge_min_size had (and has) two callers: choose_acting and choose_async_recovery_ec. For choose_acting, this change is correct. However, for choose_async_recovery_ec, we don't want to reduce the acting set size below min_size as it would prevent the PG doing IO during recovery.
The main observable symptom will be a PG that ends up in peered state during recovery (peered+recovering, peered+recovery_wait) unable to do IO until recovery completes although there are sufficient pretty much up-to-date osds.
Updated by Prashant D 9 months ago
Workaround for this issue is to set osd_async_recovery_min_cost to a very large value.
# ceph config set osd osd_async_recovery_min_cost 1099511627776
Notes from Sam : The async recovery cost is the number of pg log entries behind on the replica + the number of missing objects. The osd_target_pg_log_entries_per_osd is 30000, so an OSD with a single PG could have 30000 entries. The osd_async_recovery_min_cost is a 64-bit integer so set it to 2^40 (1<<40) i.e to 1099511627776 value which we cannot hit.
Updated by Radoslaw Zarzynski 9 months ago
- Status changed from New to Fix Under Review
- Pull request ID set to 52823
Updated by Radoslaw Zarzynski 8 months ago
- Status changed from Fix Under Review to Pending Backport
- Backport changed from octopus, pacific, quincy, reef to pacific,quincy,reef
Updated by Backport Bot 8 months ago
- Copied to Backport #62817: quincy: osd: choose_async_recovery_ec may select an acting set < min_size added
Updated by Backport Bot 8 months ago
- Copied to Backport #62818: pacific: osd: choose_async_recovery_ec may select an acting set < min_size added
Updated by Backport Bot 8 months ago
- Copied to Backport #62819: reef: osd: choose_async_recovery_ec may select an acting set < min_size added
Updated by Bartosz Rabiega 2 months ago
Hello. Just FYI, this fixes a very nasty issue in my EC setup.
Here are some details.
The EC setup and crush rules are defined to have:
3 racks
2 hosts per rack
12 disks per host
EC configuration 7+5
Crush rule picks 1 rack, 2 hosts and 2 disks so 4 chunks of data in each rack.
Now here is the funny thing I end up thanks to this bug.
1. Start some IO
2. Shutdown OSDs from rack A (pg's are active+undersized)
3. Start OSDs from rack A (pg's are active+undersized)
4. Again Shutdown OSDs from rack A (some PG's are down)
So in theory when rack is down 4 chunks are unavailable but 8 are still present - all PGs should remain active.
Now even more weird, continuing the case described above:
4a. Stop IO
5. Disable recovery/rebalance to make sure no chunks are recovered
6. Start OSDs from rack A (again pg's are active+undersized)
7. Again Shutdown OSDs from rack A (some PG's are down again but much less than in step 4, e.g. 50 instead of 300)
8. Start OSDs from rack A (again pg's are active+undersized)
9. Again Shutdown OSDs from rack A (some PG's are down again but much less than in step 7, e.g. 5 instead of 50)
10. Start OSDs from rack A (again pg's are active+undersized)
11. Again Shutdown OSDs from rack A (all pg's are active+undersized)
Further rack restarts won't cause PG down any more, unless there is some IO.
So my guess here is that async recovery kicks in when rack A goes up for the first time, it messes up with the acting set and as a result when rack A goes down again some unfortunate PGs end up in down state.
I retested everything a couple of times with `osd_async_recovery_min_cost 1099511627776` on reef - no more down PGs.
Thank you very very much for the fix.
Updated by Bartosz Rabiega 2 months ago
Hello again.
Apparently I got a tiny little bit too excited.
I tested the case described above with 16.2.15 and unfortunately the problem still exists.
However if I disable async recovery (osd_async_recovery_min_cost 1099511627776) then cluster works as desired, all PGs states are as expected (active + undersized, never down)
I'd appreciate any tips on how to narrow this bug down.
Updated by Konstantin Shalygin 7 days ago
- Category set to EC Pools
- Status changed from Pending Backport to Resolved
- Target version set to v19.1.0
- % Done changed from 0 to 100
- Source set to Community (user)