Bug #64802: rados: generalize stretch mode pg temp handling to be usable without stretch mode - RADOS - Ceph

Actions

Copy link

Bug #64802

open

rados: generalize stretch mode pg temp handling to be usable without stretch mode

Added by Samuel Just about 2 months ago. Updated 29 days ago.

Status:

Fix Under Review

Priority:

High

Assignee:

Kamoltat (Junior) Sirivadhna

Category:

Target version:

% Done:

Source:

Tags:

Backport:

squid

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

56233

Crash signature (v1):

Crash signature (v2):

Description

PeeringState::calc_replicated_acting_stretch encodes special behavior for stretch clusters which prohibits the primary from selecting a pg_temp which packs too many OSDs into the same crush bucket.

An example of the scenario we're worried about here would be the following sequence of events on a cluster with 6 replicas spread across 3 DCs with a min_size of 3 intended to prevent the pg from going active with a single dc:
1. pg is originally mapped to osds [a1, a2, b1, b2, c1, c2] where a1 and a2 are in dc a and b1, b2 are in dc b and c1, c2 are in dc c.
2. dcs a and b become unavailable (let's say power failed completely)
3. the pg acting set becomes [c1, c2]
4. the raw mapping (for whatever reason) shifts to [a1, a2, b1, b2, c1, c3] resulting in an acting set of [c1, c3]
5. c1 requests a temp mapping for [c1, c2] while backfilling [c3]
6. upon backfill completion, c1 requests a temp mapping of [c1, c3, c2]. Because the acting set now has 3 members, the pg goes active and accepts writes
7. dc c loses power, but dcs a and b come back online

The user expectation is that with dcs a and b back, the cluster should be fully available (though degraded). The actual outcome is that the above pg would be unable to complete peering because there would be an interval that might have gone active from which no osds can be contacted.

Stretch mode has a solution for this problem. PeeringState::calc_replicated_acting_stretch considers a few additional parameters to avoid the above situation:
- pg_pool_t::peering_crush_bucket_barrier
- pg_pool_t::peering_crush_bucket_target

pg_pool_t::peering_crush_bucket_target is the intended number of buckets of type pg_pool_t::peering_crush_bucket_barrier over which the acting set should be split. It uses these two and pg_pool_g::size to derive bucket_max, the maximum allowable number of acting set members from any single bucket of type pg_pool_t::peering_crush_bucket_barrier (note to reader: please confirm everything I've indicated here independently :).

So far so good. The problem is that there is no way to set these values independently of the larger stretch_mode feature. OSDMonitor::prepare_new_pool and OSDMonitor::try_enable_stretch_mode together ensure that these values are set uniformly on all pools if the cluster as a whole has stretch mode enabled (OSDMap::stretch_mode_enabled).

I think the solution is likely as simple as allowing the user to specify pg_pool_t::peering_crush_bucket_barrier and pg_pool_t::peering_crush_bucket_target upon pool creation or afterwards without enabling stretch mode, but testing will need to be done (and added to teuthology!!) to ensure that there aren't hidden assumptions linking these to stretch mode.

Steps:
- audit code to ensure that the above assumptions are correct
- list areas that need to be changed
- create plan for teuthology test
- create test, confirm that it fails without changes
- create code changes

Related issues 1 (1 open — 0 closed)

Actions

Copy link

Updated by Samuel Just about 2 months ago

Description updated (diff)

Actions

Copy link

Updated by Vikhyat Umrao about 2 months ago

Subject changed from rados: generalize stretch mode pg temp handling to be usable withotu stretch mode to rados: generalize stretch mode pg temp handling to be usable without stretch mode

Actions

Copy link

Updated by Neha Ojha about 2 months ago

Assignee set to Kamoltat (Junior) Sirivadhna

Actions

Copy link

Updated by Kamoltat (Junior) Sirivadhna about 2 months ago

Pull request ID set to 1924

Actions

Copy link

Updated by Kamoltat (Junior) Sirivadhna about 2 months ago

Pull request ID deleted (~~1924~~)

Actions

Copy link

Updated by Kamoltat (Junior) Sirivadhna about 2 months ago

My plan current script to setup a vstart to test out the above hypothesis:

MON=3 MDS=2 MGR=3 OSD=6 ../src/vstart.sh -n -x -l -d --without-dashboard

NUM_OSDS_UP=$(./bin/ceph osd df | grep "up" | wc -l)

if [ $NUM_OSDS_UP -lt 6 ]; then
    echo "test requires at least 6 OSDs up and running" 
    exit 1
fi

for dc in dc1 dc2 dc3
    do
      ./bin/ceph osd crush add-bucket $dc datacenter
      ./bin/ceph osd crush move $dc root=default
    done

./bin/ceph osd crush add-bucket node-1 host
./bin/ceph osd crush add-bucket node-2 host
./bin/ceph osd crush add-bucket node-3 host
./bin/ceph osd crush add-bucket node-4 host
./bin/ceph osd crush add-bucket node-5 host
./bin/ceph osd crush add-bucket node-6 host

./bin/ceph osd crush move node-1 datacenter=dc1
./bin/ceph osd crush move node-2 datacenter=dc1
./bin/ceph osd crush move node-3 datacenter=dc2
./bin/ceph osd crush move node-4 datacenter=dc2
./bin/ceph osd crush move node-5 datacenter=dc3
./bin/ceph osd crush move node-6 datacenter=dc3

./bin/ceph osd crush move osd.0 host=node-1
./bin/ceph osd crush move osd.1 host=node-2
./bin/ceph osd crush move osd.2 host=node-3
./bin/ceph osd crush move osd.3 host=node-4
./bin/ceph osd crush move osd.4 host=node-5
./bin/ceph osd crush move osd.5 host=node-6

./bin/ceph mon set_location a datacenter=dc1 host=node-1
./bin/ceph mon set_location b datacenter=dc2 host=node-3
./bin/ceph mon set_location c datacenter=dc3 host=node-5

./bin/ceph osd getcrushmap > crushmap || return 1
./bin/crushtool --decompile crushmap > crushmap.txt || return 1
sed 's/^# end crush map$//' crushmap.txt > crushmap_modified.txt || return 1
cat >> crushmap_modified.txt << EOF
rule replicated_rule_custom {
        id 1
        type replicated
        step take dc1
        step chooseleaf firstn 2 type host
        step emit
        step take dc2
        step chooseleaf firstn 2 type host
        step emit
        step take dc3
        step chooseleaf firstn 2 type host
        step emit
}
# end crush map
EOF

./bin/crushtool --compile crushmap_modified.txt -o crushmap.bin || return 1
./bin/ceph osd setcrushmap -i crushmap.bin  || return 1

poolname=test_pool
./bin/ceph osd pool create $poolname 32 32 replicated_rule_custom  || return 1
./bin/ceph osd pool set $poolname size 6 || return 1
./bin/ceph osd pool set $poolname min_size 2 || return 1

Actions

Copy link

Updated by Greg Farnum about 2 months ago

Don't forget that there is also pg_pool_t::peering_crush_bucket_count that directly requires a minimum number of higher-level CRUSH buckets (eg, data centers) in order to go active. We need this available as well.

Actions

Copy link

Updated by Kamoltat (Junior) Sirivadhna about 2 months ago

peering_crush_bucket_[count|target|barrier]

Actions

Copy link

Updated by Kamoltat (Junior) Sirivadhna about 2 months ago

WIP PR: https://github.com/ceph/ceph/pull/56233

Actions

Copy link

#10

Updated by Kamoltat (Junior) Sirivadhna about 2 months ago

I recently created a draft PR https://github.com/ceph/ceph/pull/56233/, adding the additional arguments peering_bucket_count and peering_crush_bucket_barrier to the pool creation command.
The reason I added only 2 arguments is because after auditing the code I found that peering_crush_bucket_count and peering_crush_bucket_target shares the same value which is stretch_bucket_count.

Here are the two places:

OSDMonitor::prepare_new_pool()
  if (osdmap.stretch_mode_enabled) {
    pi->peering_crush_bucket_count = osdmap.stretch_bucket_count;
    pi->peering_crush_bucket_target = osdmap.stretch_bucket_count;
    pi->peering_crush_bucket_barrier = osdmap.stretch_mode_bucket;
    pi->peering_crush_mandatory_member = CRUSH_ITEM_NONE;
    if (osdmap.degraded_stretch_mode) {
      pi->peering_crush_bucket_count = osdmap.degraded_stretch_mode;
      pi->peering_crush_bucket_target = osdmap.degraded_stretch_mode;
      // pi->peering_crush_bucket_mandatory_member = CRUSH_ITEM_NONE;
      // TODO: drat, we don't record this ^ anywhere, though given that it
      // necessarily won't exist elsewhere it likely doesn't matter
      pi->min_size = pi->min_size / 2;
      pi->size = pi->size / 2; // only support 2 zones now
    }
  }

The only place where these two values (peering_crush_bucket_count and peering_crush_bucket_target) gets changed is when we are in degraded_stretch_mode which then it still shares the value of osdmap.degraded_stretch_mode

OSDMonitor::try_enable_stretch_mode()
  if (commit) {
    for (auto pool : pools) {
      pool->crush_rule = new_rule;
      pool->peering_crush_bucket_count = bucket_count;
      pool->peering_crush_bucket_target = bucket_count;
      pool->peering_crush_bucket_barrier = dividing_id;
      pool->peering_crush_mandatory_member = CRUSH_ITEM_NONE;
      pool->size = g_conf().get_val<uint64_t>("mon_stretch_pool_size");
      pool->min_size = g_conf().get_val<uint64_t>("mon_stretch_pool_min_size");
    }
    pending_inc.change_stretch_mode = true;
    pending_inc.stretch_mode_enabled = true;
    pending_inc.new_stretch_bucket_count = bucket_count;
    pending_inc.new_degraded_stretch_mode = 0;
    pending_inc.new_stretch_mode_bucket = dividing_id;

I’m in the process of testing locally to see if there are other areas I need to tweak in order to decouple the capability of handling pg_temp. Will have more results on Monday

Actions

Copy link

#11

Updated by Radoslaw Zarzynski about 2 months ago

Priority changed from Normal to High

Actions

Copy link

#12

Updated by Kamoltat (Junior) Sirivadhna about 1 month ago

Okay so the latest change that I added will have two commands:

ceph osd pool stretch set <pool> <peering_crush_bucket_count> <peering_crush_bucket_target> <peering_crush_bucket_barrier> <crush_rule> <size> <min_size>
and
ceph osd pool stretch show <pool>

example output:

pool: cephfs.a.data
pool_id: 3
is_stretch_pool: 1
peering_crush_bucket_count: 3
peering_crush_bucket_target: 2
peering_crush_bucket_barrier: datacenter
crush_rule: replicated_rule_custom
size: 6
min_size: 3

The reason the set command contains all of these parameters is that they won’t break the logic that are already assumed in PeeringState::calc_replicated_acting_stretch and pg_pool_t:stretch_set_can_peer when they are set correct all at once. Some of the validations I did early on when executing ceph osd pool stretch set <pool> commands are:

1.All arguments must be present
2.All arguments must have the correct type (Mostly handled in MonCommands.h)
3.Non existence pool should return appropriate error
4.count|target must be > 0
5.Can’t have a count|target > whatever actual count of the type barrier found in the crush map.
4.Non existence barrier should return appropriate error
5.If the specified crush_rule doesn’t have the same type as the pool, e.g., replicated != ec_pool, should return appropriate error.

I tested and verified that all the cases are handled correctly.
Also, the reason I did not include <peering_crush_bucket_count> <peering_crush_bucket_target> <peering_crush_bucket_barrier> to ceph osd pool create is because the command doesn’t require all arguments to be present, also there is no argument formin_size which is required, so it’s harder to enforce the integrity of the feature. Therefore, for now we rely on setting these values after pool creation.

Moreover, I also simulate the case Sam described in https://tracker.ceph.com/issues/64802
So I set up MON=9 MDS=1 MGR=1 OSD=9

rule replicated_rule_custom {
       id 1
       type replicated
       step take default
       step choose firstn 3 type datacenter
       step chooseleaf firstn 2 type host
       step emit
}

ID   CLASS  WEIGHT   TYPE NAME            STATUS  REWEIGHT  PRI-AFF
-1         0.78868  root default
-5         0.29575      datacenter dc1
-11         0.09859          host node-1
 0    hdd  0.09859              osd.0        up   1.00000  1.00000
-12         0.09859          host node-2
 1    hdd  0.09859              osd.1        up   1.00000  1.00000
-13         0.09859          host node-3
 2    hdd  0.09859              osd.2        up   1.00000  1.00000
-7         0.29576      datacenter dc2
-14         0.09859          host node-4
 3    hdd  0.09859              osd.3        up   1.00000  1.00000
-15         0.09859          host node-5
 4    hdd  0.09859              osd.4        up   1.00000  1.00000
-16         0.09859          host node-6
 5    hdd  0.09859              osd.5        up   1.00000  1.00000
-9         0.19717      datacenter dc3
-17   hdd  0.09859          host node-7
6           0.09859              osd.6       up   1.00000  1.00000
-18         0.09859         host node-8
 7    hdd  0.09859              osd.7        up   1.00000  1.00000
-19        0.09859          host node-9
 8    hdd  0.09859              osd.8        up   1.00000  1.00000

Pools:
cephfs.a.meta
cephfs.a.meta
.mgr

With the command:
ceph osd pool stretch set <pool> <peering_crush_bucket_count> <peering_crush_bucket_target> <peering_crush_bucket_barrier> <crush_rule> <size> <min_size>

./bin/ceph osd pool stretch set cephfs.a.meta 2 3 datacenter replicated_rule_custom 6 3
./bin/ceph osd pool stretch set cephfs.a.data 2 3 datacenter replicated_rule_custom 6 3
./bin/ceph osd pool stretch set .mgr 2 3 datacenter replicated_rule_custom 6 3

I chose to use peering_crush_bucket_target=3 and peering_crush_bucket_count=2 because in this 3 Site setup I want the bucket_max to be 2, meaning when we are choosing OSDs to go into the acting set of a PG, we cannot choose more than 2 OSDs from the same ancestor, in other words, we cannot choose 2 OSDs from the same datacenter. However, I still want the cluster to be fully available when 1/3 DC goes down and that’s why I want bucket_count to be 2, since in pg_pool_t::stretch_set_can_peer the bucket_count determines the minimum number of ancestors you need in your want_acting_set such that OSDs can peer with each other.
I have tested the scenario and it went as I expected where with 2 DC the cluster still is fully available (although some PGs are active+undersized). However, with only 1 DC surviving, the cluster becomes fully unavailable … this is understandable since our min_size is 3 and we are not allowed to shove more than 2 OSDs from the surviving DC into the acting set.
Moreover, I did also test the scenario where we did not use the stretch_set command so that it uses the normal calc_replicated_acting instead and the results were what I had expected, where we were still able to serve IO to some PGs when there is only 1 surviving DC since some PGs can shove 3 OSDs into the acting set which satisfy their min_size of 3. Now once the 2 other DC immediately come up and the surviving DC goes down, the cluster becomes almost fully unavailable which does not satisfy the expectations of the stretch cluster.
I’m currently working on putting these commands into teuthology tests.

Actions

Copy link

#13