Bug #64802
openrados: generalize stretch mode pg temp handling to be usable without stretch mode
0%
Description
PeeringState::calc_replicated_acting_stretch encodes special behavior for stretch clusters which prohibits the primary from selecting a pg_temp which packs too many OSDs into the same crush bucket.
An example of the scenario we're worried about here would be the following sequence of events on a cluster with 6 replicas spread across 3 DCs with a min_size of 3 intended to prevent the pg from going active with a single dc:
1. pg is originally mapped to osds [a1, a2, b1, b2, c1, c2] where a1 and a2 are in dc a and b1, b2 are in dc b and c1, c2 are in dc c.
2. dcs a and b become unavailable (let's say power failed completely)
3. the pg acting set becomes [c1, c2]
4. the raw mapping (for whatever reason) shifts to [a1, a2, b1, b2, c1, c3] resulting in an acting set of [c1, c3]
5. c1 requests a temp mapping for [c1, c2] while backfilling [c3]
6. upon backfill completion, c1 requests a temp mapping of [c1, c3, c2]. Because the acting set now has 3 members, the pg goes active and accepts writes
7. dc c loses power, but dcs a and b come back online
The user expectation is that with dcs a and b back, the cluster should be fully available (though degraded). The actual outcome is that the above pg would be unable to complete peering because there would be an interval that might have gone active from which no osds can be contacted.
Stretch mode has a solution for this problem. PeeringState::calc_replicated_acting_stretch considers a few additional parameters to avoid the above situation:
- pg_pool_t::peering_crush_bucket_barrier
- pg_pool_t::peering_crush_bucket_target
pg_pool_t::peering_crush_bucket_target is the intended number of buckets of type pg_pool_t::peering_crush_bucket_barrier over which the acting set should be split. It uses these two and pg_pool_g::size to derive bucket_max, the maximum allowable number of acting set members from any single bucket of type pg_pool_t::peering_crush_bucket_barrier (note to reader: please confirm everything I've indicated here independently :).
So far so good. The problem is that there is no way to set these values independently of the larger stretch_mode feature. OSDMonitor::prepare_new_pool and OSDMonitor::try_enable_stretch_mode together ensure that these values are set uniformly on all pools if the cluster as a whole has stretch mode enabled (OSDMap::stretch_mode_enabled).
I think the solution is likely as simple as allowing the user to specify pg_pool_t::peering_crush_bucket_barrier and pg_pool_t::peering_crush_bucket_target upon pool creation or afterwards without enabling stretch mode, but testing will need to be done (and added to teuthology!!) to ensure that there aren't hidden assumptions linking these to stretch mode.
Steps:
- audit code to ensure that the above assumptions are correct
- list areas that need to be changed
- create plan for teuthology test
- create test, confirm that it fails without changes
- create code changes
Updated by Vikhyat Umrao about 2 months ago
- Subject changed from rados: generalize stretch mode pg temp handling to be usable withotu stretch mode to rados: generalize stretch mode pg temp handling to be usable without stretch mode
Updated by Neha Ojha about 2 months ago
- Assignee set to Kamoltat (Junior) Sirivadhna
Updated by Kamoltat (Junior) Sirivadhna about 2 months ago
- Pull request ID set to 1924
Updated by Kamoltat (Junior) Sirivadhna about 2 months ago
- Pull request ID deleted (
1924)
Updated by Kamoltat (Junior) Sirivadhna about 2 months ago
My plan current script to setup a vstart to test out the above hypothesis:
MON=3 MDS=2 MGR=3 OSD=6 ../src/vstart.sh -n -x -l -d --without-dashboard NUM_OSDS_UP=$(./bin/ceph osd df | grep "up" | wc -l) if [ $NUM_OSDS_UP -lt 6 ]; then echo "test requires at least 6 OSDs up and running" exit 1 fi for dc in dc1 dc2 dc3 do ./bin/ceph osd crush add-bucket $dc datacenter ./bin/ceph osd crush move $dc root=default done ./bin/ceph osd crush add-bucket node-1 host ./bin/ceph osd crush add-bucket node-2 host ./bin/ceph osd crush add-bucket node-3 host ./bin/ceph osd crush add-bucket node-4 host ./bin/ceph osd crush add-bucket node-5 host ./bin/ceph osd crush add-bucket node-6 host ./bin/ceph osd crush move node-1 datacenter=dc1 ./bin/ceph osd crush move node-2 datacenter=dc1 ./bin/ceph osd crush move node-3 datacenter=dc2 ./bin/ceph osd crush move node-4 datacenter=dc2 ./bin/ceph osd crush move node-5 datacenter=dc3 ./bin/ceph osd crush move node-6 datacenter=dc3 ./bin/ceph osd crush move osd.0 host=node-1 ./bin/ceph osd crush move osd.1 host=node-2 ./bin/ceph osd crush move osd.2 host=node-3 ./bin/ceph osd crush move osd.3 host=node-4 ./bin/ceph osd crush move osd.4 host=node-5 ./bin/ceph osd crush move osd.5 host=node-6 ./bin/ceph mon set_location a datacenter=dc1 host=node-1 ./bin/ceph mon set_location b datacenter=dc2 host=node-3 ./bin/ceph mon set_location c datacenter=dc3 host=node-5 ./bin/ceph osd getcrushmap > crushmap || return 1 ./bin/crushtool --decompile crushmap > crushmap.txt || return 1 sed 's/^# end crush map$//' crushmap.txt > crushmap_modified.txt || return 1 cat >> crushmap_modified.txt << EOF rule replicated_rule_custom { id 1 type replicated step take dc1 step chooseleaf firstn 2 type host step emit step take dc2 step chooseleaf firstn 2 type host step emit step take dc3 step chooseleaf firstn 2 type host step emit } # end crush map EOF ./bin/crushtool --compile crushmap_modified.txt -o crushmap.bin || return 1 ./bin/ceph osd setcrushmap -i crushmap.bin || return 1 poolname=test_pool ./bin/ceph osd pool create $poolname 32 32 replicated_rule_custom || return 1 ./bin/ceph osd pool set $poolname size 6 || return 1 ./bin/ceph osd pool set $poolname min_size 2 || return 1
Updated by Greg Farnum about 2 months ago
Don't forget that there is also pg_pool_t::peering_crush_bucket_count that directly requires a minimum number of higher-level CRUSH buckets (eg, data centers) in order to go active. We need this available as well.
Updated by Kamoltat (Junior) Sirivadhna about 2 months ago
peering_crush_bucket_[count|target|barrier]
Updated by Kamoltat (Junior) Sirivadhna about 2 months ago
Updated by Kamoltat (Junior) Sirivadhna about 2 months ago
I recently created a draft PR https://github.com/ceph/ceph/pull/56233/, adding the additional arguments peering_bucket_count and peering_crush_bucket_barrier to the pool creation command.
The reason I added only 2 arguments is because after auditing the code I found that peering_crush_bucket_count and peering_crush_bucket_target shares the same value which is stretch_bucket_count.
Here are the two places:
OSDMonitor::prepare_new_pool() if (osdmap.stretch_mode_enabled) { pi->peering_crush_bucket_count = osdmap.stretch_bucket_count; pi->peering_crush_bucket_target = osdmap.stretch_bucket_count; pi->peering_crush_bucket_barrier = osdmap.stretch_mode_bucket; pi->peering_crush_mandatory_member = CRUSH_ITEM_NONE; if (osdmap.degraded_stretch_mode) { pi->peering_crush_bucket_count = osdmap.degraded_stretch_mode; pi->peering_crush_bucket_target = osdmap.degraded_stretch_mode; // pi->peering_crush_bucket_mandatory_member = CRUSH_ITEM_NONE; // TODO: drat, we don't record this ^ anywhere, though given that it // necessarily won't exist elsewhere it likely doesn't matter pi->min_size = pi->min_size / 2; pi->size = pi->size / 2; // only support 2 zones now } }
The only place where these two values (peering_crush_bucket_count and peering_crush_bucket_target) gets changed is when we are in degraded_stretch_mode which then it still shares the value of osdmap.degraded_stretch_mode
OSDMonitor::try_enable_stretch_mode() if (commit) { for (auto pool : pools) { pool->crush_rule = new_rule; pool->peering_crush_bucket_count = bucket_count; pool->peering_crush_bucket_target = bucket_count; pool->peering_crush_bucket_barrier = dividing_id; pool->peering_crush_mandatory_member = CRUSH_ITEM_NONE; pool->size = g_conf().get_val<uint64_t>("mon_stretch_pool_size"); pool->min_size = g_conf().get_val<uint64_t>("mon_stretch_pool_min_size"); } pending_inc.change_stretch_mode = true; pending_inc.stretch_mode_enabled = true; pending_inc.new_stretch_bucket_count = bucket_count; pending_inc.new_degraded_stretch_mode = 0; pending_inc.new_stretch_mode_bucket = dividing_id;
I’m in the process of testing locally to see if there are other areas I need to tweak in order to decouple the capability of handling pg_temp. Will have more results on Monday
Updated by Radoslaw Zarzynski about 2 months ago
- Priority changed from Normal to High
Updated by Kamoltat (Junior) Sirivadhna about 1 month ago
Okay so the latest change that I added will have two commands:
ceph osd pool stretch set <pool> <peering_crush_bucket_count> <peering_crush_bucket_target> <peering_crush_bucket_barrier> <crush_rule> <size> <min_size> and ceph osd pool stretch show <pool>
example output:
pool: cephfs.a.data pool_id: 3 is_stretch_pool: 1 peering_crush_bucket_count: 3 peering_crush_bucket_target: 2 peering_crush_bucket_barrier: datacenter crush_rule: replicated_rule_custom size: 6 min_size: 3
The reason the set command contains all of these parameters is that they won’t break the logic that are already assumed in PeeringState::calc_replicated_acting_stretch and pg_pool_t:stretch_set_can_peer when they are set correct all at once. Some of the validations I did early on when executing ceph osd pool stretch set <pool> commands are:
1.All arguments must be present
2.All arguments must have the correct type (Mostly handled in MonCommands.h)
3.Non existence pool should return appropriate error
4.count|target must be > 0
5.Can’t have a count|target > whatever actual count of the type barrier found in the crush map.
4.Non existence barrier should return appropriate error
5.If the specified crush_rule doesn’t have the same type as the pool, e.g., replicated != ec_pool, should return appropriate error.
I tested and verified that all the cases are handled correctly.
Also, the reason I did not include <peering_crush_bucket_count> <peering_crush_bucket_target> <peering_crush_bucket_barrier> to ceph osd pool create is because the command doesn’t require all arguments to be present, also there is no argument formin_size which is required, so it’s harder to enforce the integrity of the feature. Therefore, for now we rely on setting these values after pool creation.
Moreover, I also simulate the case Sam described in https://tracker.ceph.com/issues/64802
So I set up MON=9 MDS=1 MGR=1 OSD=9
rule replicated_rule_custom { id 1 type replicated step take default step choose firstn 3 type datacenter step chooseleaf firstn 2 type host step emit } ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.78868 root default -5 0.29575 datacenter dc1 -11 0.09859 host node-1 0 hdd 0.09859 osd.0 up 1.00000 1.00000 -12 0.09859 host node-2 1 hdd 0.09859 osd.1 up 1.00000 1.00000 -13 0.09859 host node-3 2 hdd 0.09859 osd.2 up 1.00000 1.00000 -7 0.29576 datacenter dc2 -14 0.09859 host node-4 3 hdd 0.09859 osd.3 up 1.00000 1.00000 -15 0.09859 host node-5 4 hdd 0.09859 osd.4 up 1.00000 1.00000 -16 0.09859 host node-6 5 hdd 0.09859 osd.5 up 1.00000 1.00000 -9 0.19717 datacenter dc3 -17 hdd 0.09859 host node-7 6 0.09859 osd.6 up 1.00000 1.00000 -18 0.09859 host node-8 7 hdd 0.09859 osd.7 up 1.00000 1.00000 -19 0.09859 host node-9 8 hdd 0.09859 osd.8 up 1.00000 1.00000 Pools: cephfs.a.meta cephfs.a.meta .mgr
With the command:
ceph osd pool stretch set <pool> <peering_crush_bucket_count> <peering_crush_bucket_target> <peering_crush_bucket_barrier> <crush_rule> <size> <min_size>
./bin/ceph osd pool stretch set cephfs.a.meta 2 3 datacenter replicated_rule_custom 6 3 ./bin/ceph osd pool stretch set cephfs.a.data 2 3 datacenter replicated_rule_custom 6 3 ./bin/ceph osd pool stretch set .mgr 2 3 datacenter replicated_rule_custom 6 3
I chose to use peering_crush_bucket_target=3 and peering_crush_bucket_count=2 because in this 3 Site setup I want the bucket_max to be 2, meaning when we are choosing OSDs to go into the acting set of a PG, we cannot choose more than 2 OSDs from the same ancestor, in other words, we cannot choose 2 OSDs from the same datacenter. However, I still want the cluster to be fully available when 1/3 DC goes down and that’s why I want bucket_count to be 2, since in pg_pool_t::stretch_set_can_peer the bucket_count determines the minimum number of ancestors you need in your want_acting_set such that OSDs can peer with each other.
I have tested the scenario and it went as I expected where with 2 DC the cluster still is fully available (although some PGs are active+undersized). However, with only 1 DC surviving, the cluster becomes fully unavailable … this is understandable since our min_size is 3 and we are not allowed to shove more than 2 OSDs from the surviving DC into the acting set.
Moreover, I did also test the scenario where we did not use the stretch_set command so that it uses the normal calc_replicated_acting instead and the results were what I had expected, where we were still able to serve IO to some PGs when there is only 1 surviving DC since some PGs can shove 3 OSDs into the acting set which satisfy their min_size of 3. Now once the 2 other DC immediately come up and the surviving DC goes down, the cluster becomes almost fully unavailable which does not satisfy the expectations of the stretch cluster.
I’m currently working on putting these commands into teuthology tests.
Updated by Kamoltat (Junior) Sirivadhna about 1 month ago
- Status changed from New to Fix Under Review
- Pull request ID set to 56233
Updated by Kamoltat (Junior) Sirivadhna about 1 month ago
- Backport set to squid
Updated by Kamoltat (Junior) Sirivadhna 29 days ago
Just created final revision: https://github.com/ceph/ceph/pull/56233
waiting for review
Updated by Kamoltat (Junior) Sirivadhna 26 days ago
- Related to Bug #65371: rados: PeeringState::calc_replicated_acting_stretch populate acting set before checking if < bucket_max added