Project

General

Profile

Bug #57796 ยป upmap.txt

Chris Durham, 10/10/2022 06:30 PM

 

I am using pacific 16.2.10 on Rocky 8.6 Linux.

After setting max_deviations to 1 on the ceph balancer, I achieved a near perfect balance of PGs and space on my OSDs. This is great.

However, I started getting the following errors on my ceph-mon logs, every three minutes, for each of the OSDs that had PGs mapped
by the balancer:

2022-10-07T17:10:39.619+0000 7f7c2786d700 1 verify_upmap unable to get parent of osd.497, skipping for now

After banging my head against the wall for a bit trying to figure this out, I think I have discovered the issue:

Currently, I have my pool EC Pool, 480 OSDs, confgured with the following rule.

rule mypoolname {
id -5
type erasure
step take myroot
step choose indep 4 type rack
step choose indep 2 type pod
step chooseleaf indep 1 type host
step emit
}

basically, pick 4 racks, then 2 pods, and then one host in each rack, For a total of
8 chunks. (The pool is a is a 6+2). The 4 racks are chosen from the myroot root entry, which is as follows.


root myroot {
id -400
item rack1 weight N
item rack2 weight N
item rack3 weight N
item rack4 weight N
}

This has worked fine since inception, over a year ago.

The errors above, verify_upmap, started after I had the max_deviations set to 1 in the balancer and having it
move things around, creating pg_upmap entries.

I then discovered, while trying to figure this out, that the device types are:

type 0 osd
type 1 host
type 2 chassis
type 3 rack
...
type 6 pod

So pod is HIGHER on the heirarchy than rack. I have it as lower on my rule.

What I want to do is remove the pods completely to work around this. Something like:

rule mypoolname {
id -5
type erasure
step take myroot
step choose indep 4 type rack
step chooseleaf indep 2 type host
step emit
}

This will pick 4 racks and then 2 hosts in each rack. Will this cause any problems? I can add the pod stuff back later as 'chassis' instead. I can live without the 'pod' separation if needed.

To test this, I tried doing something like this using:

1. grab the osdmap:
ceph osd getmap -o /tmp/om
2. pull out the crushmap:
osdmaptool --export-crush /tmp/crush.bin
3. cnvert it to text:
crushtool -d /tmp/crush.bin -o /tmp/crush.txt

I then edited the rule for this pool as above, to remove the pod and go directly
to pulling from 4 racks then 2 hosts in each rack. I then compiled up the crush map
and then imported it into the extracted osdmap:

crushtool -c /tmp/crush.txt -o /tmp/crush.bin
osdmaptool /tmp/om --import-crush /tmp/crush.bin

I then ran upmap-cleanup on the new osdmap:

osdmaptool /tmp/om --upmap-cleanup

I did NOT get any of the verify_upmap messages. When I did the extraction of the osdmap WITHOUT
and changes to it, and then ran the upmap-cleanup, I got the same verify_upmap errors I am now
seeing in the ceph-mon logs.

So, should I just change the crushmap to remove the wrong rack->pod->host heirarchy, making it rack->host ?
Will I have other issues? I am surprised that crush allowed me to create this out of order rule to begin with.

Thanks for any suggestions.




    (1-1/1)