Bug #51729
Upmap verification fails for multi-level crush rule
0%
Description
We have a 6+3 EC crush rule that looks like this:
rule cephfs_ec63 { id 2 type erasure min_size 3 max_size 9 step set_chooseleaf_tries 5 step set_choose_tries 100 step take root-disk step choose indep 3 type pod step choose indep 3 type rack step chooseleaf indep 1 type osd step emit }
When I try to add any upmap for PGs that use this crush rule, I get a verification error from ceph-mon:
Jul 19 12:37:21 cephmon00 ceph-mon[16894]: 2021-07-19 12:37:21.856 7fffe1e96700 -1 verify_upmap number of buckets 9 exceeds desired 3
I've verified that the upmap is correct. An example PG:
9.7ef4 13774 0 0 13740 0 336168034922 0 0 3091 3091 active+remapped+backfill_wait 2021-07-19 12:30:36.136523 1884465'416385 1884831:2962474 [64,2354,1364,3718,252,1265,2505,1093,2759] 64 [64,2354,1364,3718,252,1265,2505,1122,2759] 64 1883485'416346 2021-07-17 22:49:20.660394 1883485'416346 2021-07-17 22:49:20.660394 0
with upmap:
ceph osd pg-upmap-items 9.7ef4 1093 1122
which is correct since osd.1093 and osd.1122 live on the same host (i.e. it doesn't violate crush rules to do this upmap).
The cluster is on 14.2.20.
Related issues
History
#1 Updated by Neha Ojha over 2 years ago
- Status changed from New to Need More Info
Can you share copy of your binary osdmap?
#3 Updated by Chris Durham about 1 year ago
I am now seeing this issue on pacific, 16.2.10 on rocky8 linux.
If I have a >2 level rule on an ec pool (6+2), such as:
rule mypoolname {
id -5
type erasure
step take myroot
step choose indep 4 type rack
step choose indep 2 type chassis
step chooseleaf indep 1 type host
step emit
myroot members are just a list of the rack buckets.
osdmaptool --upmap-cleanup gives:
check_pg_upmaps verify upmap of pool.pgid returning -22
verify_upmap number of buckets 8 exceeds desired 2
For each of my existing upmaps (I am changing my crushmap to this)
If, however, I use:
rule mypoolname {
id -5
type erasure
step take myroot
step choose indep 4 type rack
step chooseleaf indep 2 type chassis
step emit
osdmaptool --upmap-cleanup gives:
verify_upmap multiple osds N,M come from the same failure domain -382
check_pg_upmap verify upmap of pg poolid.pgid returning -22.
For about 1/8 of my upmaps. And it wants to remove these and and add about 100 more.
I actually expect this given that I am changing my rule from something else. But this rule
is only 2 levels, where as the first one is 3 levels, which seems to cause the problem.
But the first rule should NOT give the expected 2 got 8. I want 8 on a 6+2 EC pool. Thus, I think this problem still exists for a multi-level (>2) level
crush rule on ec pools.
This is a result of the following discussion, where Dan van der Ster tells me to report this here, as this appears to still be a bug.
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/AZHAIGY3BIM4SGBUBKX5ZGYTXQWAJ7OO/
Please advise.
#4 Updated by Andras Pataki about 1 year ago
Just to clarify - the error "verify_upmap number of buckets X exceeds desired Y" comes from the C++ code in ceph-mon that tries to validate a newly inserted upmap. It is in CrushWrapper::verify_upmap(...):
case CRUSH_RULE_CHOOSE_INDEP: ... if ((int)parents_of_type.size() > numrep) { lderr(cct) << __func__ << " number of buckets " << parents_of_type.size() << " exceeds desired " << numrep << dendl; return -EINVAL; } ...
which seems to really only consider a single level in the crush rule.
So the ceph-mon process rejects perfectly correct upmaps (all upmaps in our EC pool in fact), whether they come from a manual command, the balancer or Dan's upmap-remapped script. In our cluster as a consequence we can't use any kind of upmaps for balancing.
#5 Updated by Chris Durham about 1 year ago
Andras,
Thanks for the extra info. This needs to be addressed. Anyone?
#6 Updated by Radoslaw Zarzynski about 1 year ago
A note from bug scrub: this is going to be assigned tomorrow.
#7 Updated by Radoslaw Zarzynski about 1 year ago
- Related to Bug #57796: after rebalance of pool via pgupmap balancer, continuous issues in monitor log added
#8 Updated by Neha Ojha about 1 year ago
- Assignee set to Laura Flores
Chris, can you please provide your osdmap binary?
#9 Updated by Chris Durham about 1 year ago
- File osdmap.bad.bin added
- File osdmap.good.bin added
I put together the following contrived example to
illustrate the problem. Again, this is pacific 16.2.9 on rocky8 linux.
3 racks (as part of the default_test root bucket)
1 chassis in each rack
1 host in each chassis (total of 3 hosts)
8 osds per host.
First host has osds 0-7
Second host has osds 8-15
Third host has osds 16-23
I created a 2+1 ecpool, with the following crush rule:
rule ectest {
id 2
type erasure
min_size 3
max_size 3
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default_test
step choose indep 3 type rack
step choose indep 1 type chassis
step chooseleaf indep 1 type host
step emit
}
The pool has 32 pgs, One pg is:
2.1f: [20,9,4]
This is a valid mapping.
I then maually run:
- ceph osd pg-upmap-items 2.1f 20 19
osd 19 is on the same host as osd 20.
The command exits with 0, which seems to indicate that the upmap succeeds.
However, the mon log shows this almost immediately:
2022-10-25T22:11:18.289+0000 7f12e3762700 -1 verify_upmap number of buckets 3 exceeds desired 1
2022-10-25T22:11:18.289+0000 7f12e3762700 0 check_pg_upmaps verify_upmap of pg 2.1f returning -22
And the upmap is NOT created, as shown by
- ceph osd dump --format json-pretty
But I do want 3 buckets.
If I use the following rule:
rule ectest {
id 2
type erasure
min_size 3
max_size 3
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default_test
step choose indep 3 type rack
step chooseleaf indep 1 type chassis
step emit
}
Then, the pg in question is still the same:
2.1f: [20,9,4]
Then, manually doing an upmap:
- ceph osd pg-upmap-items 2.1f 20 19
But this time, the upmap 'takes', and the pg is mapped:
2.1f: [19,9,4]
Further, with the 'good' rule now in place, I changed the rule for the ec pool
back to the first 'bad' rule above. The mon log now shows:
2022-10-25T22:29:10.104+0000 7f12e3762700 -1 verify_upmap number of buckets 3 exceeds desired 1
2022-10-25T22:29:10.104+0000 7f12e3762700 0 check_pg_upmaps verify_upmap of pg 2.1f returning -22
I am not sure when, but this upmap was removed shortly after, perhaps at the same time.
While my first rule is somewhat contrived, as you can do chooseleaf and reduce the number of entries,
I think it does show the problem where there are 3+ levels in a rule for an ec pool.
Thus, it appears that with a 3+ level rule for an ec pool, verify_upmap has a problem that will either:
1. prevent upmaps from being inserted, with verify_upmap entries in the monitor log
2. Throw a verify_upmap error and remove it if a previously valid upmap existed before a rule change to a 3+ level rule.
The removal may be a valid thing to do in #2, so I am not sure about this part.
But this all confirms my and Andras Pataki's situations above.
Attached:
osdmap.bad.bin: osdmap with the first 'bad' rule that then prevents an upmap from being added.
osdmap.good.bin: osdmap with the second 'good' rule with the valid upmap inserted.
Thanks for looking into this,
#10 Updated by Radoslaw Zarzynski about 1 year ago
- Status changed from Need More Info to New
#11 Updated by Radoslaw Zarzynski about 1 year ago
Thanks for the info! Laura, would you mind retaking a look?
#12 Updated by Laura Flores about 1 year ago
Thanks Chris! @Radek I have been taking some time to analyze this scenario, and will post updates soon.
#13 Updated by Laura Flores about 1 year ago
I believe I've reproduced the issue using the osdmaps that Chris provided.
First, I used the osdmaptool to run the upmap balancer on `osdmap.bad.bin`. I wanted it to create the same mapping Chris described: 2.1f 20 19.
I had to run the command multiple times to achieve this mapping, so I wrote a small script to do so. I set --upmap-deviation to 1 so the balancer would would actually try mappings, since this osdmap was already considered "balanced" with the default deviation of 5:
#!/bin/bash
while :
do
./bin/osdmaptool ~/osdmap.bad.bin --upmap out.txt --upmap-pool ectest --upmap-deviation 1
if cat out.txt | grep "2.1f"
then
break
fi
done
With this test script, the osdmaptool eventually suggested the "2.1f 20 19" mapping:
[lflores@folio01 build]$ ~/bad_map.sh
./bin/osdmaptool: osdmap file '/home/lflores/osdmap.bad.bin'
writing upmap command output to: out.txt
checking for upmap cleanups
upmap, max-count 10, max deviation 1
limiting to pools ectest ([2])
pools ectest
prepared 10/10 changes
./bin/osdmaptool: osdmap file '/home/lflores/osdmap.bad.bin'
writing upmap command output to: out.txt
checking for upmap cleanups
upmap, max-count 10, max deviation 1
limiting to pools ectest ([2])
pools ectest
prepared 10/10 changes
ceph osd pg-upmap-items 2.1f 20 19
And here are the exact mappings it suggested:
ceph osd pg-upmap-items 2.2 18 22
ceph osd pg-upmap-items 2.4 21 16
ceph osd pg-upmap-items 2.7 8 13
ceph osd pg-upmap-items 2.8 0 3
ceph osd pg-upmap-items 2.a 11 9
ceph osd pg-upmap-items 2.19 11 14 21 16
ceph osd pg-upmap-items 2.1e 5 4 21 17
ceph osd pg-upmap-items 2.1f 20 19
As we can see, the osdmaptool decides that "2.1f 20 19" is a valid mapping. But at this point, we have not run the mapping through "verify_upmap".
Now, I'll save the results so we can use the `--upmap-cleanup` option, which will trigger "verify_upmap". This is the function where we see the error message "verify_upmap number of buckets 9 exceeds desired 3".
First, I copy "osdmap.bad.bin" into a new file, "osdmap.bad-save.bin".
cp ~/osdmap.bad.bin ~/osdmap.bad-save.bin
Next, I modify the script a bit so we're actually saving the results this time:
#!/bin/bash
while :
do
./bin/osdmaptool ~/osdmap.bad-save.bin --upmap out.txt --upmap-pool ectest --upmap-deviation 1 --save
if cat out.txt | grep "2.1f"
then
break
fi
done
We hit the desired mapping:
[lflores@folio01 build]$ ~/bad_map_save.sh
./bin/osdmaptool: osdmap file '/home/lflores/osdmap.bad-save.bin'
writing upmap command output to: out.txt
checking for upmap cleanups
upmap, max-count 10, max deviation 1
limiting to pools ectest ([2])
pools ectest
prepared 10/10 changes
./bin/osdmaptool: writing epoch 397 to /home/lflores/osdmap.bad-save.bin
ceph osd pg-upmap-items 2.1f 20 19
If I print out `osdmap.bad-save.bin`, it will have saved the mapping. I've removed some details here in case they might be sensitive (although the map is attached to the tracker):
[lflores@folio01 build]$ ./bin/osdmaptool ~/osdmap.bad-save.bin --print
./bin/osdmaptool: osdmap file '/home/lflores/osdmap.bad-save.bin'
.....osdmap details....
pg_upmap_items 2.4 [11,14]
pg_upmap_items 2.7 [8,14]
pg_upmap_items 2.8 [0,3]
pg_upmap_items 2.c [21,16]
pg_upmap_items 2.d [5,4]
pg_upmap_items 2.e [8,13]
pg_upmap_items 2.17 [21,17]
pg_upmap_items 2.19 [11,9,21,16]
pg_upmap_items 2.1f [20,19] -------------------> here the mapping is saved
pg_temp 1.0 [8,3,7]
.... more osdmap details ...
Now, I'll run `--upmap-cleanup` to trigger the verify_upmap error message:
[lflores@folio01 build]$ ./bin/osdmaptool ~/osdmap.bad-save.bin --upmap-cleanup out.txt
./bin/osdmaptool: osdmap file '/home/lflores/osdmap.bad-save.bin'
writing upmap command output to: out.txt
checking for upmap cleanups
2022-11-08T19:16:33.797+0000 7fb80681a100 -1 verify_upmap number of buckets 3 exceeds desired 1
2022-11-08T19:16:33.797+0000 7fb80681a100 0 check_pg_upmaps verify_upmap of pg 2.4 returning -22
2022-11-08T19:16:33.797+0000 7fb80681a100 -1 verify_upmap number of buckets 3 exceeds desired 1
2022-11-08T19:16:33.797+0000 7fb80681a100 0 check_pg_upmaps verify_upmap of pg 2.7 returning -22
2022-11-08T19:16:33.797+0000 7fb80681a100 -1 verify_upmap number of buckets 3 exceeds desired 1
2022-11-08T19:16:33.797+0000 7fb80681a100 0 check_pg_upmaps verify_upmap of pg 2.8 returning -22
2022-11-08T19:16:33.797+0000 7fb80681a100 -1 verify_upmap number of buckets 3 exceeds desired 1
2022-11-08T19:16:33.797+0000 7fb80681a100 0 check_pg_upmaps verify_upmap of pg 2.c returning -22
2022-11-08T19:16:33.797+0000 7fb80681a100 -1 verify_upmap number of buckets 3 exceeds desired 1
2022-11-08T19:16:33.797+0000 7fb80681a100 0 check_pg_upmaps verify_upmap of pg 2.d returning -22
2022-11-08T19:16:33.797+0000 7fb80681a100 -1 verify_upmap number of buckets 3 exceeds desired 1
2022-11-08T19:16:33.797+0000 7fb80681a100 0 check_pg_upmaps verify_upmap of pg 2.e returning -22
2022-11-08T19:16:33.797+0000 7fb80681a100 -1 verify_upmap number of buckets 3 exceeds desired 1
2022-11-08T19:16:33.797+0000 7fb80681a100 0 check_pg_upmaps verify_upmap of pg 2.17 returning -22
2022-11-08T19:16:33.797+0000 7fb80681a100 -1 verify_upmap number of buckets 3 exceeds desired 1
2022-11-08T19:16:33.797+0000 7fb80681a100 0 check_pg_upmaps verify_upmap of pg 2.19 returning -22
2022-11-08T19:16:33.797+0000 7fb80681a100 -1 verify_upmap number of buckets 3 exceeds desired 1
2022-11-08T19:16:33.797+0000 7fb80681a100 0 check_pg_upmaps verify_upmap of pg 2.1f returning -22
And here, it gave us commands to remove mappings that it has deemed "invalid":
ceph osd rm-pg-upmap-items 2.4
ceph osd rm-pg-upmap-items 2.7
ceph osd rm-pg-upmap-items 2.8
ceph osd rm-pg-upmap-items 2.c
ceph osd rm-pg-upmap-items 2.d
ceph osd rm-pg-upmap-items 2.e
ceph osd rm-pg-upmap-items 2.17
ceph osd rm-pg-upmap-items 2.19
ceph osd rm-pg-upmap-items 2.1f
Now, I'm going to try this with "osdmap.good.bin".
If we dump "osdmap.good.bin", we can see that the desired mapping is already there, so there's no need for me to create a "saved" version:
[lflores@folio01 build]$ ./bin/osdmaptool ~/osdmap.good.bin --dump
./bin/osdmaptool: osdmap file '/home/lflores/osdmap.good.bin'
epoch 398
.... osdmap details ....
pg_upmap_items 2.1f [20,19] ---> here is the mapping
pg_temp 1.0 [8,3,7]
.... more osdmap details ....
Now, I'll try running "--upmap-cleanup" on this map to see if the same "verify_upmap" error message appears. It does not:
[lflores@folio01 build]$ ./bin/osdmaptool ~/osdmap.good.bin --upmap-cleanup out.txt
./bin/osdmaptool: osdmap file '/home/lflores/osdmap.good.bin'
writing upmap command output to: out.txt
checking for upmap cleanups
And the "out.txt" file is completely empty since "verify_upmaps" did not see any problems with the mapping. If it did, it would have suggested we remove the mapping with "ceph osd rm-pg-upmap-items 2.1f" as it did with the bad map.
I know some of this was redundant, but reproducing the issue means one step closer to me understanding if there's something wrong in verify_upmaps.
#14 Updated by Chris Durham about 1 year ago
Thanks again for looking at this.
I haven't looked further, but I suspect the issue will come down to the variable 'numrep' in CrushWrapper:verify_upmap(...) that was posted above. (below for reference)
'numrep' is somehow not being set properly soemwhere else for a 3+ level ec rule.
case CRUSH_RULE_CHOOSE_INDEP: ... if ((int)parents_of_type.size() > numrep) { lderr(cct) << __func__ << " number of buckets " << parents_of_type.size() << " exceeds desired " << numrep << dendl; return -EINVAL; } ...
#15 Updated by Radoslaw Zarzynski about 1 year ago
Thanks for formulating the hypothesis!
Just updating to keep this ticket in the front of the tracker.
#16 Updated by Laura Flores 12 months ago
Update on this Tracker: I am discussing this scenario with Josh Salomon, someone who is very knowledgeable about balancing. He and I recently did some work on the verify_upmaps() function, so I want to go over some things with him to reason where an issue might be in that function.
#17 Updated by Radoslaw Zarzynski 12 months ago
- Status changed from New to In Progress
#18 Updated by Chris Durham 10 months ago
Is there any news on this? Thanks.
#19 Updated by Laura Flores 10 months ago
Hi Chris, yes, I will post another update soon with my findings.
#20 Updated by Chris Durham 8 months ago
Laura Flores wrote:
Hi Chris, yes, I will post another update soon with my findings.
Pinging for updates....
#21 Updated by Laura Flores 8 months ago
Hi Chris, this issue was actually discussed at Cephalocon. Looking at the verify_upmap code, it seems that we may need to add another case to handle 3+ level crush rules.
I am in progress working on a solution. When I have something, I will update the tracker with the PR so you and others involved can take a look.
#22 Updated by Laura Flores 7 months ago
I've landed on a potential fix for this problem. After evaluating the examples everyone provided and checking the verify_upmap logic, it looks like we need to multiply the "numrep" variable by the parent bucket value.
If we look at this example crush rule, the current logic causes valid upmaps to fail with "verify_upmap number of buckets 3 exceeds desired 1". However, this implies that we choose 3 racks, then 1 chassis from those 3 racks.
rule ectest {
id 2
type erasure
min_size 3
max_size 3
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default_test
step choose indep 3 type rack
step choose indep 1 type chassis
step chooseleaf indep 1 type host
step emit
}
What we really want is to choose 3 racks, then 1 chassis from each rack, meaning 3 chassis in total.
This is roughly what the fix will look like:
diff --git a/src/crush/CrushWrapper.cc b/src/crush/CrushWrapper.cc
index 530c2413933..dc75f00ba16 100644
--- a/src/crush/CrushWrapper.cc
+++ b/src/crush/CrushWrapper.cc
@@ -908,6 +908,7 @@ int CrushWrapper::verify_upmap(CephContext *cct,
}
int root_bucket = 0;
int cursor = 0;
+ int curr_choose_numrep = 1;
std::map<int, int> type_stack;
for (unsigned step = 0; step < rule->len; ++step) {
auto curstep = &rule->steps[step];
@@ -953,7 +954,8 @@ int CrushWrapper::verify_upmap(CephContext *cct,
case CRUSH_RULE_CHOOSE_FIRSTN:
case CRUSH_RULE_CHOOSE_INDEP:
{
- int numrep = curstep->arg1;
+ int numrep = curr_choose_numrep * curstep->arg1;
+ curr_choose_numrep = numrep;
int type = curstep->arg2;
if (numrep <= 0)
numrep += pool_size;
Also to note, this may be a small change, but it will need to be well tested. I am in progress designing a unit test around this.
#23 Updated by Laura Flores 7 months ago
Hi Chris,
If possible, can you try this change with your crush rule?
https://github.com/ceph/ceph/compare/main...ljflores:ceph:wip-tracker-51729?expand=1
It would help to get some early feedback on the fix as I refine the unit test.