Project

General

Profile

Bug #51729

Upmap verification fails for multi-level crush rule

Added by Andras Pataki over 1 year ago. Updated 12 days ago.

Status:
New
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
CRUSH
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We have a 6+3 EC crush rule that looks like this:

rule cephfs_ec63 {
        id 2
        type erasure
        min_size 3
        max_size 9
        step set_chooseleaf_tries 5
        step set_choose_tries 100
        step take root-disk
        step choose indep 3 type pod
        step choose indep 3 type rack
        step chooseleaf indep 1 type osd
        step emit
}


When I try to add any upmap for PGs that use this crush rule, I get a verification error from ceph-mon:
Jul 19 12:37:21 cephmon00 ceph-mon[16894]: 2021-07-19 12:37:21.856 7fffe1e96700 -1 verify_upmap number of buckets 9 exceeds desired 3

I've verified that the upmap is correct. An example PG:

9.7ef4 13774 0 0 13740 0 336168034922 0 0 3091 3091 active+remapped+backfill_wait 2021-07-19 12:30:36.136523 1884465'416385 1884831:2962474 [64,2354,1364,3718,252,1265,2505,1093,2759] 64 [64,2354,1364,3718,252,1265,2505,1122,2759] 64 1883485'416346 2021-07-17 22:49:20.660394 1883485'416346 2021-07-17 22:49:20.660394 0

with upmap:

ceph osd pg-upmap-items 9.7ef4 1093 1122

which is correct since osd.1093 and osd.1122 live on the same host (i.e. it doesn't violate crush rules to do this upmap).

The cluster is on 14.2.20.

osdmap.bin.gz - binary OSD map (281 KB) Andras Pataki, 07/23/2021 10:09 PM

osdmap.bad.bin - bad osdmap preventing upmap insertion (14.1 KB) Chris Durham, 10/25/2022 11:09 PM

osdmap.good.bin - good osdmap with upmap inserted (14.1 KB) Chris Durham, 10/25/2022 11:09 PM


Related issues

Related to RADOS - Bug #57796: after rebalance of pool via pgupmap balancer, continuous issues in monitor log Need More Info

History

#1 Updated by Neha Ojha over 1 year ago

  • Status changed from New to Need More Info

Can you share copy of your binary osdmap?

#2 Updated by Andras Pataki over 1 year ago

Binary osd map attached.

#3 Updated by Chris Durham about 1 month ago

I am now seeing this issue on pacific, 16.2.10 on rocky8 linux.

If I have a >2 level rule on an ec pool (6+2), such as:

rule mypoolname {
id -5
type erasure
step take myroot
step choose indep 4 type rack
step choose indep 2 type chassis
step chooseleaf indep 1 type host
step emit

myroot members are just a list of the rack buckets.

osdmaptool --upmap-cleanup gives:

check_pg_upmaps verify upmap of pool.pgid returning -22
verify_upmap number of buckets 8 exceeds desired 2

For each of my existing upmaps (I am changing my crushmap to this)

If, however, I use:

rule mypoolname {
id -5
type erasure
step take myroot
step choose indep 4 type rack
step chooseleaf indep 2 type chassis
step emit

osdmaptool --upmap-cleanup gives:

verify_upmap multiple osds N,M come from the same failure domain -382
check_pg_upmap verify upmap of pg poolid.pgid returning -22.

For about 1/8 of my upmaps. And it wants to remove these and and add about 100 more.
I actually expect this given that I am changing my rule from something else. But this rule
is only 2 levels, where as the first one is 3 levels, which seems to cause the problem.

But the first rule should NOT give the expected 2 got 8. I want 8 on a 6+2 EC pool. Thus, I think this problem still exists for a multi-level (>2) level
crush rule on ec pools.

This is a result of the following discussion, where Dan van der Ster tells me to report this here, as this appears to still be a bug.

https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/AZHAIGY3BIM4SGBUBKX5ZGYTXQWAJ7OO/

Please advise.

#4 Updated by Andras Pataki about 1 month ago

Just to clarify - the error "verify_upmap number of buckets X exceeds desired Y" comes from the C++ code in ceph-mon that tries to validate a newly inserted upmap. It is in CrushWrapper::verify_upmap(...):

    case CRUSH_RULE_CHOOSE_INDEP:
...
        if ((int)parents_of_type.size() > numrep) {
          lderr(cct) << __func__ << " number of buckets " 
                     << parents_of_type.size() << " exceeds desired " << numrep
                     << dendl;
          return -EINVAL;
        }
...

which seems to really only consider a single level in the crush rule.
So the ceph-mon process rejects perfectly correct upmaps (all upmaps in our EC pool in fact), whether they come from a manual command, the balancer or Dan's upmap-remapped script. In our cluster as a consequence we can't use any kind of upmaps for balancing.

#5 Updated by Chris Durham about 1 month ago

Andras,

Thanks for the extra info. This needs to be addressed. Anyone?

#6 Updated by Radoslaw Zarzynski about 1 month ago

A note from bug scrub: this is going to be assigned tomorrow.

#7 Updated by Radoslaw Zarzynski about 1 month ago

  • Related to Bug #57796: after rebalance of pool via pgupmap balancer, continuous issues in monitor log added

#8 Updated by Neha Ojha about 1 month ago

  • Assignee set to Laura Flores

Chris, can you please provide your osdmap binary?

#9 Updated by Chris Durham about 1 month ago

I put together the following contrived example to
illustrate the problem. Again, this is pacific 16.2.9 on rocky8 linux.

3 racks (as part of the default_test root bucket)
1 chassis in each rack
1 host in each chassis (total of 3 hosts)
8 osds per host.

First host has osds 0-7
Second host has osds 8-15
Third host has osds 16-23

I created a 2+1 ecpool, with the following crush rule:

rule ectest {
id 2
type erasure
min_size 3
max_size 3
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default_test
step choose indep 3 type rack
step choose indep 1 type chassis
step chooseleaf indep 1 type host
step emit
}

The pool has 32 pgs, One pg is:

2.1f: [20,9,4]

This is a valid mapping.

I then maually run:

  1. ceph osd pg-upmap-items 2.1f 20 19

osd 19 is on the same host as osd 20.

The command exits with 0, which seems to indicate that the upmap succeeds.
However, the mon log shows this almost immediately:

2022-10-25T22:11:18.289+0000 7f12e3762700 -1 verify_upmap number of buckets 3 exceeds desired 1
2022-10-25T22:11:18.289+0000 7f12e3762700 0 check_pg_upmaps verify_upmap of pg 2.1f returning -22

And the upmap is NOT created, as shown by

  1. ceph osd dump --format json-pretty

But I do want 3 buckets.

If I use the following rule:

rule ectest {
id 2
type erasure
min_size 3
max_size 3
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default_test
step choose indep 3 type rack
step chooseleaf indep 1 type chassis
step emit
}

Then, the pg in question is still the same:

2.1f: [20,9,4]

Then, manually doing an upmap:

  1. ceph osd pg-upmap-items 2.1f 20 19

But this time, the upmap 'takes', and the pg is mapped:

2.1f: [19,9,4]

Further, with the 'good' rule now in place, I changed the rule for the ec pool
back to the first 'bad' rule above. The mon log now shows:

2022-10-25T22:29:10.104+0000 7f12e3762700 -1 verify_upmap number of buckets 3 exceeds desired 1
2022-10-25T22:29:10.104+0000 7f12e3762700 0 check_pg_upmaps verify_upmap of pg 2.1f returning -22

I am not sure when, but this upmap was removed shortly after, perhaps at the same time.

While my first rule is somewhat contrived, as you can do chooseleaf and reduce the number of entries,
I think it does show the problem where there are 3+ levels in a rule for an ec pool.

Thus, it appears that with a 3+ level rule for an ec pool, verify_upmap has a problem that will either:

1. prevent upmaps from being inserted, with verify_upmap entries in the monitor log
2. Throw a verify_upmap error and remove it if a previously valid upmap existed before a rule change to a 3+ level rule.

The removal may be a valid thing to do in #2, so I am not sure about this part.
But this all confirms my and Andras Pataki's situations above.

Attached:

osdmap.bad.bin: osdmap with the first 'bad' rule that then prevents an upmap from being added.
osdmap.good.bin: osdmap with the second 'good' rule with the valid upmap inserted.

Thanks for looking into this,

#10 Updated by Radoslaw Zarzynski 21 days ago

  • Status changed from Need More Info to New

#11 Updated by Radoslaw Zarzynski 21 days ago

Thanks for the info! Laura, would you mind retaking a look?

#12 Updated by Laura Flores 21 days ago

Thanks Chris! @Radek I have been taking some time to analyze this scenario, and will post updates soon.

#13 Updated by Laura Flores 20 days ago

I believe I've reproduced the issue using the osdmaps that Chris provided.

First, I used the osdmaptool to run the upmap balancer on `osdmap.bad.bin`. I wanted it to create the same mapping Chris described: 2.1f 20 19.

I had to run the command multiple times to achieve this mapping, so I wrote a small script to do so. I set --upmap-deviation to 1 so the balancer would would actually try mappings, since this osdmap was already considered "balanced" with the default deviation of 5:

#!/bin/bash

while :
do
    ./bin/osdmaptool ~/osdmap.bad.bin --upmap out.txt --upmap-pool ectest --upmap-deviation 1
    if cat out.txt | grep "2.1f" 
    then
        break
    fi
done

With this test script, the osdmaptool eventually suggested the "2.1f 20 19" mapping:

[lflores@folio01 build]$ ~/bad_map.sh
./bin/osdmaptool: osdmap file '/home/lflores/osdmap.bad.bin'
writing upmap command output to: out.txt
checking for upmap cleanups
upmap, max-count 10, max deviation 1
 limiting to pools ectest ([2])
pools ectest 
prepared 10/10 changes
./bin/osdmaptool: osdmap file '/home/lflores/osdmap.bad.bin'
writing upmap command output to: out.txt
checking for upmap cleanups
upmap, max-count 10, max deviation 1
 limiting to pools ectest ([2])
pools ectest 
prepared 10/10 changes
ceph osd pg-upmap-items 2.1f 20 19

And here are the exact mappings it suggested:

ceph osd pg-upmap-items 2.2 18 22
ceph osd pg-upmap-items 2.4 21 16
ceph osd pg-upmap-items 2.7 8 13
ceph osd pg-upmap-items 2.8 0 3
ceph osd pg-upmap-items 2.a 11 9
ceph osd pg-upmap-items 2.19 11 14 21 16
ceph osd pg-upmap-items 2.1e 5 4 21 17
ceph osd pg-upmap-items 2.1f 20 19

As we can see, the osdmaptool decides that "2.1f 20 19" is a valid mapping. But at this point, we have not run the mapping through "verify_upmap".

Now, I'll save the results so we can use the `--upmap-cleanup` option, which will trigger "verify_upmap". This is the function where we see the error message "verify_upmap number of buckets 9 exceeds desired 3".

First, I copy "osdmap.bad.bin" into a new file, "osdmap.bad-save.bin".

cp ~/osdmap.bad.bin ~/osdmap.bad-save.bin

Next, I modify the script a bit so we're actually saving the results this time:

#!/bin/bash

while :
do
    ./bin/osdmaptool ~/osdmap.bad-save.bin --upmap out.txt --upmap-pool ectest --upmap-deviation 1 --save
    if cat out.txt | grep "2.1f" 
    then
        break
    fi
done

We hit the desired mapping:

[lflores@folio01 build]$ ~/bad_map_save.sh
./bin/osdmaptool: osdmap file '/home/lflores/osdmap.bad-save.bin'
writing upmap command output to: out.txt
checking for upmap cleanups
upmap, max-count 10, max deviation 1
 limiting to pools ectest ([2])
pools ectest 
prepared 10/10 changes
./bin/osdmaptool: writing epoch 397 to /home/lflores/osdmap.bad-save.bin
ceph osd pg-upmap-items 2.1f 20 19

If I print out `osdmap.bad-save.bin`, it will have saved the mapping. I've removed some details here in case they might be sensitive (although the map is attached to the tracker):

[lflores@folio01 build]$ ./bin/osdmaptool ~/osdmap.bad-save.bin --print
./bin/osdmaptool: osdmap file '/home/lflores/osdmap.bad-save.bin'

.....osdmap details....

pg_upmap_items 2.4 [11,14]
pg_upmap_items 2.7 [8,14]
pg_upmap_items 2.8 [0,3]
pg_upmap_items 2.c [21,16]
pg_upmap_items 2.d [5,4]
pg_upmap_items 2.e [8,13]
pg_upmap_items 2.17 [21,17]
pg_upmap_items 2.19 [11,9,21,16]
pg_upmap_items 2.1f [20,19] -------------------> here the mapping is saved
pg_temp 1.0 [8,3,7]

.... more osdmap details ...

Now, I'll run `--upmap-cleanup` to trigger the verify_upmap error message:

[lflores@folio01 build]$ ./bin/osdmaptool ~/osdmap.bad-save.bin --upmap-cleanup out.txt
./bin/osdmaptool: osdmap file '/home/lflores/osdmap.bad-save.bin'
writing upmap command output to: out.txt
checking for upmap cleanups
2022-11-08T19:16:33.797+0000 7fb80681a100 -1 verify_upmap number of buckets 3 exceeds desired 1
2022-11-08T19:16:33.797+0000 7fb80681a100  0 check_pg_upmaps verify_upmap of pg 2.4 returning -22
2022-11-08T19:16:33.797+0000 7fb80681a100 -1 verify_upmap number of buckets 3 exceeds desired 1
2022-11-08T19:16:33.797+0000 7fb80681a100  0 check_pg_upmaps verify_upmap of pg 2.7 returning -22
2022-11-08T19:16:33.797+0000 7fb80681a100 -1 verify_upmap number of buckets 3 exceeds desired 1
2022-11-08T19:16:33.797+0000 7fb80681a100  0 check_pg_upmaps verify_upmap of pg 2.8 returning -22
2022-11-08T19:16:33.797+0000 7fb80681a100 -1 verify_upmap number of buckets 3 exceeds desired 1
2022-11-08T19:16:33.797+0000 7fb80681a100  0 check_pg_upmaps verify_upmap of pg 2.c returning -22
2022-11-08T19:16:33.797+0000 7fb80681a100 -1 verify_upmap number of buckets 3 exceeds desired 1
2022-11-08T19:16:33.797+0000 7fb80681a100  0 check_pg_upmaps verify_upmap of pg 2.d returning -22
2022-11-08T19:16:33.797+0000 7fb80681a100 -1 verify_upmap number of buckets 3 exceeds desired 1
2022-11-08T19:16:33.797+0000 7fb80681a100  0 check_pg_upmaps verify_upmap of pg 2.e returning -22
2022-11-08T19:16:33.797+0000 7fb80681a100 -1 verify_upmap number of buckets 3 exceeds desired 1
2022-11-08T19:16:33.797+0000 7fb80681a100  0 check_pg_upmaps verify_upmap of pg 2.17 returning -22
2022-11-08T19:16:33.797+0000 7fb80681a100 -1 verify_upmap number of buckets 3 exceeds desired 1
2022-11-08T19:16:33.797+0000 7fb80681a100  0 check_pg_upmaps verify_upmap of pg 2.19 returning -22
2022-11-08T19:16:33.797+0000 7fb80681a100 -1 verify_upmap number of buckets 3 exceeds desired 1
2022-11-08T19:16:33.797+0000 7fb80681a100  0 check_pg_upmaps verify_upmap of pg 2.1f returning -22

And here, it gave us commands to remove mappings that it has deemed "invalid":

ceph osd rm-pg-upmap-items 2.4
ceph osd rm-pg-upmap-items 2.7
ceph osd rm-pg-upmap-items 2.8
ceph osd rm-pg-upmap-items 2.c
ceph osd rm-pg-upmap-items 2.d
ceph osd rm-pg-upmap-items 2.e
ceph osd rm-pg-upmap-items 2.17
ceph osd rm-pg-upmap-items 2.19
ceph osd rm-pg-upmap-items 2.1f


Now, I'm going to try this with "osdmap.good.bin".

If we dump "osdmap.good.bin", we can see that the desired mapping is already there, so there's no need for me to create a "saved" version:

[lflores@folio01 build]$ ./bin/osdmaptool ~/osdmap.good.bin --dump
./bin/osdmaptool: osdmap file '/home/lflores/osdmap.good.bin'
epoch 398

.... osdmap details ....

pg_upmap_items 2.1f [20,19] ---> here is the mapping
pg_temp 1.0 [8,3,7]

.... more osdmap details ....

Now, I'll try running "--upmap-cleanup" on this map to see if the same "verify_upmap" error message appears. It does not:

[lflores@folio01 build]$ ./bin/osdmaptool ~/osdmap.good.bin --upmap-cleanup out.txt
./bin/osdmaptool: osdmap file '/home/lflores/osdmap.good.bin'
writing upmap command output to: out.txt
checking for upmap cleanups

And the "out.txt" file is completely empty since "verify_upmaps" did not see any problems with the mapping. If it did, it would have suggested we remove the mapping with "ceph osd rm-pg-upmap-items 2.1f" as it did with the bad map.

I know some of this was redundant, but reproducing the issue means one step closer to me understanding if there's something wrong in verify_upmaps.

#14 Updated by Chris Durham 19 days ago

Thanks again for looking at this.

I haven't looked further, but I suspect the issue will come down to the variable 'numrep' in CrushWrapper:verify_upmap(...) that was posted above. (below for reference)
'numrep' is somehow not being set properly soemwhere else for a 3+ level ec rule.

   case CRUSH_RULE_CHOOSE_INDEP:
...
        if ((int)parents_of_type.size() > numrep) {
          lderr(cct) << __func__ << " number of buckets " 
                     << parents_of_type.size() << " exceeds desired " << numrep
                     << dendl;
          return -EINVAL;
        }
...

#15 Updated by Radoslaw Zarzynski 12 days ago

Thanks for formulating the hypothesis!

Just updating to keep this ticket in the front of the tracker.

Also available in: Atom PDF