Project

General

Profile

Actions

Bug #51729

open

Upmap verification fails for multi-level crush rule

Added by Andras Pataki almost 3 years ago. Updated 3 months ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
CRUSH
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We have a 6+3 EC crush rule that looks like this:

rule cephfs_ec63 {
        id 2
        type erasure
        min_size 3
        max_size 9
        step set_chooseleaf_tries 5
        step set_choose_tries 100
        step take root-disk
        step choose indep 3 type pod
        step choose indep 3 type rack
        step chooseleaf indep 1 type osd
        step emit
}


When I try to add any upmap for PGs that use this crush rule, I get a verification error from ceph-mon:
Jul 19 12:37:21 cephmon00 ceph-mon[16894]: 2021-07-19 12:37:21.856 7fffe1e96700 -1 verify_upmap number of buckets 9 exceeds desired 3

I've verified that the upmap is correct. An example PG:

9.7ef4 13774 0 0 13740 0 336168034922 0 0 3091 3091 active+remapped+backfill_wait 2021-07-19 12:30:36.136523 1884465'416385 1884831:2962474 [64,2354,1364,3718,252,1265,2505,1093,2759] 64 [64,2354,1364,3718,252,1265,2505,1122,2759] 64 1883485'416346 2021-07-17 22:49:20.660394 1883485'416346 2021-07-17 22:49:20.660394 0

with upmap:

ceph osd pg-upmap-items 9.7ef4 1093 1122

which is correct since osd.1093 and osd.1122 live on the same host (i.e. it doesn't violate crush rules to do this upmap).

The cluster is on 14.2.20.


Files

osdmap.bin.gz (281 KB) osdmap.bin.gz binary OSD map Andras Pataki, 07/23/2021 10:09 PM
osdmap.bad.bin (14.1 KB) osdmap.bad.bin bad osdmap preventing upmap insertion Chris Durham, 10/25/2022 11:09 PM
osdmap.good.bin (14.1 KB) osdmap.good.bin good osdmap with upmap inserted Chris Durham, 10/25/2022 11:09 PM

Related issues 1 (1 open0 closed)

Related to RADOS - Bug #57796: after rebalance of pool via pgupmap balancer, continuous issues in monitor logNeed More Info

Actions
Actions #1

Updated by Neha Ojha over 2 years ago

  • Status changed from New to Need More Info

Can you share copy of your binary osdmap?

Actions #2

Updated by Andras Pataki over 2 years ago

Binary osd map attached.

Actions #3

Updated by Chris Durham over 1 year ago

I am now seeing this issue on pacific, 16.2.10 on rocky8 linux.

If I have a >2 level rule on an ec pool (6+2), such as:

rule mypoolname {
id -5
type erasure
step take myroot
step choose indep 4 type rack
step choose indep 2 type chassis
step chooseleaf indep 1 type host
step emit

myroot members are just a list of the rack buckets.

osdmaptool --upmap-cleanup gives:

check_pg_upmaps verify upmap of pool.pgid returning -22
verify_upmap number of buckets 8 exceeds desired 2

For each of my existing upmaps (I am changing my crushmap to this)

If, however, I use:

rule mypoolname {
id -5
type erasure
step take myroot
step choose indep 4 type rack
step chooseleaf indep 2 type chassis
step emit

osdmaptool --upmap-cleanup gives:

verify_upmap multiple osds N,M come from the same failure domain -382
check_pg_upmap verify upmap of pg poolid.pgid returning -22.

For about 1/8 of my upmaps. And it wants to remove these and and add about 100 more.
I actually expect this given that I am changing my rule from something else. But this rule
is only 2 levels, where as the first one is 3 levels, which seems to cause the problem.

But the first rule should NOT give the expected 2 got 8. I want 8 on a 6+2 EC pool. Thus, I think this problem still exists for a multi-level (>2) level
crush rule on ec pools.

This is a result of the following discussion, where Dan van der Ster tells me to report this here, as this appears to still be a bug.

https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/AZHAIGY3BIM4SGBUBKX5ZGYTXQWAJ7OO/

Please advise.

Actions #4

Updated by Andras Pataki over 1 year ago

Just to clarify - the error "verify_upmap number of buckets X exceeds desired Y" comes from the C++ code in ceph-mon that tries to validate a newly inserted upmap. It is in CrushWrapper::verify_upmap(...):

    case CRUSH_RULE_CHOOSE_INDEP:
...
        if ((int)parents_of_type.size() > numrep) {
          lderr(cct) << __func__ << " number of buckets " 
                     << parents_of_type.size() << " exceeds desired " << numrep
                     << dendl;
          return -EINVAL;
        }
...

which seems to really only consider a single level in the crush rule.
So the ceph-mon process rejects perfectly correct upmaps (all upmaps in our EC pool in fact), whether they come from a manual command, the balancer or Dan's upmap-remapped script. In our cluster as a consequence we can't use any kind of upmaps for balancing.
Actions #5

Updated by Chris Durham over 1 year ago

Andras,

Thanks for the extra info. This needs to be addressed. Anyone?

Actions #6

Updated by Radoslaw Zarzynski over 1 year ago

A note from bug scrub: this is going to be assigned tomorrow.

Actions #7

Updated by Radoslaw Zarzynski over 1 year ago

  • Related to Bug #57796: after rebalance of pool via pgupmap balancer, continuous issues in monitor log added
Actions #8

Updated by Neha Ojha over 1 year ago

  • Assignee set to Laura Flores

Chris, can you please provide your osdmap binary?

Updated by Chris Durham over 1 year ago

I put together the following contrived example to
illustrate the problem. Again, this is pacific 16.2.9 on rocky8 linux.

3 racks (as part of the default_test root bucket)
1 chassis in each rack
1 host in each chassis (total of 3 hosts)
8 osds per host.

First host has osds 0-7
Second host has osds 8-15
Third host has osds 16-23

I created a 2+1 ecpool, with the following crush rule:

rule ectest {
id 2
type erasure
min_size 3
max_size 3
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default_test
step choose indep 3 type rack
step choose indep 1 type chassis
step chooseleaf indep 1 type host
step emit
}

The pool has 32 pgs, One pg is:

2.1f: [20,9,4]

This is a valid mapping.

I then maually run:

  1. ceph osd pg-upmap-items 2.1f 20 19

osd 19 is on the same host as osd 20.

The command exits with 0, which seems to indicate that the upmap succeeds.
However, the mon log shows this almost immediately:

2022-10-25T22:11:18.289+0000 7f12e3762700 -1 verify_upmap number of buckets 3 exceeds desired 1
2022-10-25T22:11:18.289+0000 7f12e3762700 0 check_pg_upmaps verify_upmap of pg 2.1f returning -22

And the upmap is NOT created, as shown by

  1. ceph osd dump --format json-pretty

But I do want 3 buckets.

If I use the following rule:

rule ectest {
id 2
type erasure
min_size 3
max_size 3
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default_test
step choose indep 3 type rack
step chooseleaf indep 1 type chassis
step emit
}

Then, the pg in question is still the same:

2.1f: [20,9,4]

Then, manually doing an upmap:

  1. ceph osd pg-upmap-items 2.1f 20 19

But this time, the upmap 'takes', and the pg is mapped:

2.1f: [19,9,4]

Further, with the 'good' rule now in place, I changed the rule for the ec pool
back to the first 'bad' rule above. The mon log now shows:

2022-10-25T22:29:10.104+0000 7f12e3762700 -1 verify_upmap number of buckets 3 exceeds desired 1
2022-10-25T22:29:10.104+0000 7f12e3762700 0 check_pg_upmaps verify_upmap of pg 2.1f returning -22

I am not sure when, but this upmap was removed shortly after, perhaps at the same time.

While my first rule is somewhat contrived, as you can do chooseleaf and reduce the number of entries,
I think it does show the problem where there are 3+ levels in a rule for an ec pool.

Thus, it appears that with a 3+ level rule for an ec pool, verify_upmap has a problem that will either:

1. prevent upmaps from being inserted, with verify_upmap entries in the monitor log
2. Throw a verify_upmap error and remove it if a previously valid upmap existed before a rule change to a 3+ level rule.

The removal may be a valid thing to do in #2, so I am not sure about this part.
But this all confirms my and Andras Pataki's situations above.

Attached:

osdmap.bad.bin: osdmap with the first 'bad' rule that then prevents an upmap from being added.
osdmap.good.bin: osdmap with the second 'good' rule with the valid upmap inserted.

Thanks for looking into this,

Actions #10

Updated by Radoslaw Zarzynski over 1 year ago

  • Status changed from Need More Info to New
Actions #11

Updated by Radoslaw Zarzynski over 1 year ago

Thanks for the info! Laura, would you mind retaking a look?

Actions #12

Updated by Laura Flores over 1 year ago

Thanks Chris! @Radoslaw Smigielski I have been taking some time to analyze this scenario, and will post updates soon.

Actions #13

Updated by Laura Flores over 1 year ago

I believe I've reproduced the issue using the osdmaps that Chris provided.

First, I used the osdmaptool to run the upmap balancer on `osdmap.bad.bin`. I wanted it to create the same mapping Chris described: 2.1f 20 19.

I had to run the command multiple times to achieve this mapping, so I wrote a small script to do so. I set --upmap-deviation to 1 so the balancer would would actually try mappings, since this osdmap was already considered "balanced" with the default deviation of 5:

#!/bin/bash

while :
do
    ./bin/osdmaptool ~/osdmap.bad.bin --upmap out.txt --upmap-pool ectest --upmap-deviation 1
    if cat out.txt | grep "2.1f" 
    then
        break
    fi
done

With this test script, the osdmaptool eventually suggested the "2.1f 20 19" mapping:

[lflores@folio01 build]$ ~/bad_map.sh
./bin/osdmaptool: osdmap file '/home/lflores/osdmap.bad.bin'
writing upmap command output to: out.txt
checking for upmap cleanups
upmap, max-count 10, max deviation 1
 limiting to pools ectest ([2])
pools ectest 
prepared 10/10 changes
./bin/osdmaptool: osdmap file '/home/lflores/osdmap.bad.bin'
writing upmap command output to: out.txt
checking for upmap cleanups
upmap, max-count 10, max deviation 1
 limiting to pools ectest ([2])
pools ectest 
prepared 10/10 changes
ceph osd pg-upmap-items 2.1f 20 19

And here are the exact mappings it suggested:

ceph osd pg-upmap-items 2.2 18 22
ceph osd pg-upmap-items 2.4 21 16
ceph osd pg-upmap-items 2.7 8 13
ceph osd pg-upmap-items 2.8 0 3
ceph osd pg-upmap-items 2.a 11 9
ceph osd pg-upmap-items 2.19 11 14 21 16
ceph osd pg-upmap-items 2.1e 5 4 21 17
ceph osd pg-upmap-items 2.1f 20 19

As we can see, the osdmaptool decides that "2.1f 20 19" is a valid mapping. But at this point, we have not run the mapping through "verify_upmap".

Now, I'll save the results so we can use the `--upmap-cleanup` option, which will trigger "verify_upmap". This is the function where we see the error message "verify_upmap number of buckets 9 exceeds desired 3".

First, I copy "osdmap.bad.bin" into a new file, "osdmap.bad-save.bin".

cp ~/osdmap.bad.bin ~/osdmap.bad-save.bin

Next, I modify the script a bit so we're actually saving the results this time:

#!/bin/bash

while :
do
    ./bin/osdmaptool ~/osdmap.bad-save.bin --upmap out.txt --upmap-pool ectest --upmap-deviation 1 --save
    if cat out.txt | grep "2.1f" 
    then
        break
    fi
done

We hit the desired mapping:

[lflores@folio01 build]$ ~/bad_map_save.sh
./bin/osdmaptool: osdmap file '/home/lflores/osdmap.bad-save.bin'
writing upmap command output to: out.txt
checking for upmap cleanups
upmap, max-count 10, max deviation 1
 limiting to pools ectest ([2])
pools ectest 
prepared 10/10 changes
./bin/osdmaptool: writing epoch 397 to /home/lflores/osdmap.bad-save.bin
ceph osd pg-upmap-items 2.1f 20 19

If I print out `osdmap.bad-save.bin`, it will have saved the mapping. I've removed some details here in case they might be sensitive (although the map is attached to the tracker):

[lflores@folio01 build]$ ./bin/osdmaptool ~/osdmap.bad-save.bin --print
./bin/osdmaptool: osdmap file '/home/lflores/osdmap.bad-save.bin'

.....osdmap details....

pg_upmap_items 2.4 [11,14]
pg_upmap_items 2.7 [8,14]
pg_upmap_items 2.8 [0,3]
pg_upmap_items 2.c [21,16]
pg_upmap_items 2.d [5,4]
pg_upmap_items 2.e [8,13]
pg_upmap_items 2.17 [21,17]
pg_upmap_items 2.19 [11,9,21,16]
pg_upmap_items 2.1f [20,19] -------------------> here the mapping is saved
pg_temp 1.0 [8,3,7]

.... more osdmap details ...

Now, I'll run `--upmap-cleanup` to trigger the verify_upmap error message:

[lflores@folio01 build]$ ./bin/osdmaptool ~/osdmap.bad-save.bin --upmap-cleanup out.txt
./bin/osdmaptool: osdmap file '/home/lflores/osdmap.bad-save.bin'
writing upmap command output to: out.txt
checking for upmap cleanups
2022-11-08T19:16:33.797+0000 7fb80681a100 -1 verify_upmap number of buckets 3 exceeds desired 1
2022-11-08T19:16:33.797+0000 7fb80681a100  0 check_pg_upmaps verify_upmap of pg 2.4 returning -22
2022-11-08T19:16:33.797+0000 7fb80681a100 -1 verify_upmap number of buckets 3 exceeds desired 1
2022-11-08T19:16:33.797+0000 7fb80681a100  0 check_pg_upmaps verify_upmap of pg 2.7 returning -22
2022-11-08T19:16:33.797+0000 7fb80681a100 -1 verify_upmap number of buckets 3 exceeds desired 1
2022-11-08T19:16:33.797+0000 7fb80681a100  0 check_pg_upmaps verify_upmap of pg 2.8 returning -22
2022-11-08T19:16:33.797+0000 7fb80681a100 -1 verify_upmap number of buckets 3 exceeds desired 1
2022-11-08T19:16:33.797+0000 7fb80681a100  0 check_pg_upmaps verify_upmap of pg 2.c returning -22
2022-11-08T19:16:33.797+0000 7fb80681a100 -1 verify_upmap number of buckets 3 exceeds desired 1
2022-11-08T19:16:33.797+0000 7fb80681a100  0 check_pg_upmaps verify_upmap of pg 2.d returning -22
2022-11-08T19:16:33.797+0000 7fb80681a100 -1 verify_upmap number of buckets 3 exceeds desired 1
2022-11-08T19:16:33.797+0000 7fb80681a100  0 check_pg_upmaps verify_upmap of pg 2.e returning -22
2022-11-08T19:16:33.797+0000 7fb80681a100 -1 verify_upmap number of buckets 3 exceeds desired 1
2022-11-08T19:16:33.797+0000 7fb80681a100  0 check_pg_upmaps verify_upmap of pg 2.17 returning -22
2022-11-08T19:16:33.797+0000 7fb80681a100 -1 verify_upmap number of buckets 3 exceeds desired 1
2022-11-08T19:16:33.797+0000 7fb80681a100  0 check_pg_upmaps verify_upmap of pg 2.19 returning -22
2022-11-08T19:16:33.797+0000 7fb80681a100 -1 verify_upmap number of buckets 3 exceeds desired 1
2022-11-08T19:16:33.797+0000 7fb80681a100  0 check_pg_upmaps verify_upmap of pg 2.1f returning -22

And here, it gave us commands to remove mappings that it has deemed "invalid":

ceph osd rm-pg-upmap-items 2.4
ceph osd rm-pg-upmap-items 2.7
ceph osd rm-pg-upmap-items 2.8
ceph osd rm-pg-upmap-items 2.c
ceph osd rm-pg-upmap-items 2.d
ceph osd rm-pg-upmap-items 2.e
ceph osd rm-pg-upmap-items 2.17
ceph osd rm-pg-upmap-items 2.19
ceph osd rm-pg-upmap-items 2.1f


Now, I'm going to try this with "osdmap.good.bin".

If we dump "osdmap.good.bin", we can see that the desired mapping is already there, so there's no need for me to create a "saved" version:

[lflores@folio01 build]$ ./bin/osdmaptool ~/osdmap.good.bin --dump
./bin/osdmaptool: osdmap file '/home/lflores/osdmap.good.bin'
epoch 398

.... osdmap details ....

pg_upmap_items 2.1f [20,19] ---> here is the mapping
pg_temp 1.0 [8,3,7]

.... more osdmap details ....

Now, I'll try running "--upmap-cleanup" on this map to see if the same "verify_upmap" error message appears. It does not:

[lflores@folio01 build]$ ./bin/osdmaptool ~/osdmap.good.bin --upmap-cleanup out.txt
./bin/osdmaptool: osdmap file '/home/lflores/osdmap.good.bin'
writing upmap command output to: out.txt
checking for upmap cleanups

And the "out.txt" file is completely empty since "verify_upmaps" did not see any problems with the mapping. If it did, it would have suggested we remove the mapping with "ceph osd rm-pg-upmap-items 2.1f" as it did with the bad map.

I know some of this was redundant, but reproducing the issue means one step closer to me understanding if there's something wrong in verify_upmaps.

Actions #14

Updated by Chris Durham over 1 year ago

Thanks again for looking at this.

I haven't looked further, but I suspect the issue will come down to the variable 'numrep' in CrushWrapper:verify_upmap(...) that was posted above. (below for reference)
'numrep' is somehow not being set properly soemwhere else for a 3+ level ec rule.

   case CRUSH_RULE_CHOOSE_INDEP:
...
        if ((int)parents_of_type.size() > numrep) {
          lderr(cct) << __func__ << " number of buckets " 
                     << parents_of_type.size() << " exceeds desired " << numrep
                     << dendl;
          return -EINVAL;
        }
...
Actions #15

Updated by Radoslaw Zarzynski over 1 year ago

Thanks for formulating the hypothesis!

Just updating to keep this ticket in the front of the tracker.

Actions #16

Updated by Laura Flores over 1 year ago

Update on this Tracker: I am discussing this scenario with Josh Salomon, someone who is very knowledgeable about balancing. He and I recently did some work on the verify_upmaps() function, so I want to go over some things with him to reason where an issue might be in that function.

Actions #17

Updated by Radoslaw Zarzynski over 1 year ago

  • Status changed from New to In Progress
Actions #18

Updated by Chris Durham about 1 year ago

Is there any news on this? Thanks.

Actions #19

Updated by Laura Flores about 1 year ago

Hi Chris, yes, I will post another update soon with my findings.

Actions #20

Updated by Chris Durham 12 months ago

Laura Flores wrote:

Hi Chris, yes, I will post another update soon with my findings.

Pinging for updates....

Actions #21

Updated by Laura Flores 12 months ago

Hi Chris, this issue was actually discussed at Cephalocon. Looking at the verify_upmap code, it seems that we may need to add another case to handle 3+ level crush rules.

I am in progress working on a solution. When I have something, I will update the tracker with the PR so you and others involved can take a look.

Actions #22

Updated by Laura Flores 12 months ago

I've landed on a potential fix for this problem. After evaluating the examples everyone provided and checking the verify_upmap logic, it looks like we need to multiply the "numrep" variable by the parent bucket value.

If we look at this example crush rule, the current logic causes valid upmaps to fail with "verify_upmap number of buckets 3 exceeds desired 1". However, this implies that we choose 3 racks, then 1 chassis from those 3 racks.

rule ectest {
id 2
type erasure
min_size 3
max_size 3
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default_test
step choose indep 3 type rack
step choose indep 1 type chassis
step chooseleaf indep 1 type host
step emit
}

What we really want is to choose 3 racks, then 1 chassis from each rack, meaning 3 chassis in total.

This is roughly what the fix will look like:

diff --git a/src/crush/CrushWrapper.cc b/src/crush/CrushWrapper.cc
index 530c2413933..dc75f00ba16 100644
--- a/src/crush/CrushWrapper.cc
+++ b/src/crush/CrushWrapper.cc
@@ -908,6 +908,7 @@ int CrushWrapper::verify_upmap(CephContext *cct,
   }
   int root_bucket = 0;
   int cursor = 0;
+  int curr_choose_numrep = 1;
   std::map<int, int> type_stack;
   for (unsigned step = 0; step < rule->len; ++step) {
     auto curstep = &rule->steps[step];
@@ -953,7 +954,8 @@ int CrushWrapper::verify_upmap(CephContext *cct,
     case CRUSH_RULE_CHOOSE_FIRSTN:
     case CRUSH_RULE_CHOOSE_INDEP:
       {
-        int numrep = curstep->arg1;
+        int numrep = curr_choose_numrep * curstep->arg1;
+        curr_choose_numrep = numrep;
         int type = curstep->arg2;
         if (numrep <= 0)
           numrep += pool_size;

Also to note, this may be a small change, but it will need to be well tested. I am in progress designing a unit test around this.

Actions #23

Updated by Laura Flores 11 months ago

Hi Chris,

If possible, can you try this change with your crush rule?
https://github.com/ceph/ceph/compare/main...ljflores:ceph:wip-tracker-51729?expand=1

It would help to get some early feedback on the fix as I refine the unit test.

Actions #24

Updated by Andras Pataki 3 months ago

Hi Laura,

I'm returning to this issue since not being able to use upmaps in our cluster has been causing some headaches. I've experimented with the patch above and needed to modify it. Mainly due to the map type_stack which seems to want to have the per level (not cumulative) numrep's in it (they get multiplied up later in that function). My current patch is:

diff -up ./src/crush/CrushWrapper.cc.ORIG ./src/crush/CrushWrapper.cc
--- ./src/crush/CrushWrapper.cc.ORIG    2022-08-09 13:07:01.000000000 -0400
+++ ./src/crush/CrushWrapper.cc    2024-01-17 11:03:54.139061612 -0500
@@ -933,6 +933,7 @@ int CrushWrapper::verify_upmap(CephConte
   }
   int root_bucket = 0;
   int cursor = 0;
+  int total_numrep = 1;
   std::map<int, int> type_stack;
   for (unsigned step = 0; step < rule->len; ++step) {
     auto curstep = &rule->steps[step];
@@ -950,6 +951,7 @@ int CrushWrapper::verify_upmap(CephConte
         int type = curstep->arg2;
         if (numrep <= 0)
           numrep += pool_size;
+        total_numrep *= numrep;
         type_stack.emplace(type, numrep);
         if (type == 0) // osd
           break;
@@ -982,6 +984,7 @@ int CrushWrapper::verify_upmap(CephConte
         int type = curstep->arg2;
         if (numrep <= 0)
           numrep += pool_size;
+        total_numrep *= numrep;
         type_stack.emplace(type, numrep);
         if (type == 0) // osd
           break;
@@ -996,9 +999,10 @@ int CrushWrapper::verify_upmap(CephConte
                           << dendl;
           }
         }
-        if ((int)parents_of_type.size() > numrep) {
+        if ((int)parents_of_type.size() > total_numrep) {
           lderr(cct) << __func__ << " number of buckets " 
                      << parents_of_type.size() << " exceeds desired " << numrep
+             << " in step " << step
                      << dendl;
           return -EINVAL;
         }


This works for my crush rule case:
rule disk-ec63-rule {
        id 2
        type erasure
        min_size 3
        max_size 9
        step set_chooseleaf_tries 5
        step set_choose_tries 100
        step take root-disk
        step choose indep 3 type pod
        step choose indep 3 type rack
        step chooseleaf indep 1 type osd
        step emit
}

Can you have a look and give me any comments?

Thanks,

Andras

Actions #25

Updated by Laura Flores 3 months ago

Hi Andras,

Thanks for testing the patch with your test scenario. I will take a look at your additions to the patch.

Actions

Also available in: Atom PDF