Bug #48065
closed"ceph osd crush set|reweight-subtree" commands do not set weight on device class subtree
0%
Description
We noticed that if one set an osd crush weight using the command
ceph osd crush set $id $weight host=$host
it updates the osd weight on $host bucket, but does not update it on the device class bucket (${host}~hdd or ${host}~ssd), and as the result the old weight is still used until one runs `ceph osd crush reweight-all` or do some other changes that cause the crushmap recalculation.
The same behavior is for `ceph osd crush reweight-subtree <name> <weight>` command.
At the moment I am not sure if it is a bug. I just would like to report this for discussion. The current behavior might be ok if there were a way to set the desired weight on a device class subtee, but I don't know a way. When I try the above commands with host=${host}~ssd it complains about the invalid char "~" in the bucket name.
Files
Updated by Neha Ojha over 3 years ago
- Status changed from New to Need More Info
- Priority changed from Normal to High
This does sound like a bug. Can you please share the osdmap?
Updated by Mykola Golub over 3 years ago
Actually, the problem with the weight not updated on the class subtree is easily reproducible on a vstart cluster (see the details below). But it turned out, on my vstart cluster the problem looks rather cosmetic: although the weight is not updated on the class subtree (adonis~ssd bucket in my case) the new weight is actually used to distribute pgs (according to `ceph osd df`).
This is not exactly what we observed for our customer (running nautilus). In their case they were redeploying osds (two hosts at once) with "osd crush initial weight = 0" in the config. Then they used "ceph osd crush set|reweight-subtree" commands to set a non-zero weight but were observing that osds were still not used. And only after they did some modifications to the crush map (redeployed another osds or just created/deleted a fake bucket in the crush map) the osds started to be used. As we noticed that "ceph osd crush set|reweight-subtree" commands did not change the weight on the class subtree we decided it was the cause that the osds were not used in their case, and we just recommended them to use "ceph osd crush reweight" command that properly updates weight on all subtrees. Unfortunately they do not need to redeploy in near future so we can not check at the moment. And as I failed to reproduce the case locally right now I am not exactly sure the problem was due to not updated weight on the class subtree. I am going to dig this further and will ask the customer if it is ok to share details about their cluster (osdmap) here.
Instructions for vstart cluster:
adonis:~/ceph/ceph/build% ../src/vstart.sh -n ... adonis:~/ceph/ceph/build% ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.29576 root default -3 0.29576 host adonis 0 ssd 0.09859 osd.0 up 1.00000 1.00000 1 ssd 0.09859 osd.1 up 1.00000 1.00000 2 ssd 0.09859 osd.2 up 1.00000 1.00000 adonis:~/ceph/ceph/build% ceph osd crush set 0 0.666 host=adonis set item id 0 name 'osd.0' weight 0.666 at location {host=adonis} to crush map adonis:~/ceph/ceph/build% ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.86316 root default -3 0.86316 host adonis 0 ssd 0.66599 osd.0 up 1.00000 1.00000 1 ssd 0.09859 osd.1 up 1.00000 1.00000 2 ssd 0.09859 osd.2 up 1.00000 1.00000 adonis:~/ceph/ceph/build% ceph osd crush dump { ... "buckets": [ ... { "id": -3, "name": "adonis", "type_id": 1, "type_name": "host", "weight": 56568, "alg": "straw2", "hash": "rjenkins1", "items": [ { "id": 0, "weight": 43646, "pos": 0 }, { "id": 1, "weight": 6461, "pos": 1 }, { "id": 2, "weight": 6461, "pos": 2 } ] }, { "id": -4, "name": "adonis~ssd", "type_id": 1, "type_name": "host", "weight": 19383, "alg": "straw2", "hash": "rjenkins1", "items": [ { "id": 0, "weight": 6461, "pos": 0 }, { "id": 1, "weight": 6461, "pos": 1 }, { "id": 2, "weight": 6461, "pos": 2 } ] } ],
Note, the weight for the item id=0 is updated in "adonis" bucket but it is not updated in the "adonis~ssd" bucket.
Updated by Mykola Golub over 3 years ago
- File 48065.tar.gz 48065.tar.gz added
I eventually have got approve from the customer to publish their data.
I have attached a tarball that includes `ceph report`, `ceph osd dump`, `ceph osd df tree`, collected after two nodes (carl and hugo) had been redeployed. Also it includes an extract from the audit log listing the operations executed during the redeploy.
The osds were deployed with the initial weight 0 (osd_crush_initial_weight=0 in ceph.conf). And after all osds had been deployed their crush weight was updated to the target with `ceph osd crush set` command. From `ceph osd df tree` you can see that although it reports a non-zero weight for the osds they are not used. And in the crush map (can be found in `ceph report` output) you can see the expected non-zero weight for the osds in "carl" and "hugo" buckets and zero weight in "carl~hdd" and "hugo~hdd" buckets.
Eventually, after some other hosts were redeployed, the weights on "carl~hdd" and "hugo~hdd" buckets updated and these osds started to be used all right (though I don't claim that the second was a consequence of the first, because I was not able to reproduce the situation on a simple test setup).
ceph version 14.2.10-408-gdd63475ce0
Updated by Mykola Golub over 3 years ago
- Status changed from Need More Info to New
Updated by Neha Ojha about 3 years ago
- Assignee set to Sage Weil
- Priority changed from High to Urgent
Updated by Sage Weil about 3 years ago
- Status changed from New to Fix Under Review
- Backport set to pacific,octopus,nautilus
- Pull request ID set to 39629
Updated by Sage Weil about 3 years ago
BTW Mykola I would suggest using 'ceph osd crush reweight osd.N' (which works fine already) instead of the 'ceph osd crush set ...' syntax (which is harder to use and suffers from this bug)
Updated by Mykola Golub about 3 years ago
Sage Weil wrote:
BTW Mykola I would suggest using 'ceph osd crush reweight osd.N' (which works fine already) instead of the 'ceph osd crush set ...' syntax (which is harder to use and suffers from this bug)
Yes, that this what we recommended the customer, as I wrote in the comment #2. And that is why I was not sure it was a bug for that low level commands. Thank you for fixing this!
Updated by Sage Weil about 3 years ago
- Status changed from Fix Under Review to Pending Backport
pacific backport: https://github.com/ceph/ceph/pull/39736
Updated by Backport Bot about 3 years ago
- Copied to Backport #49528: pacific: "ceph osd crush set|reweight-subtree" commands do not set weight on device class subtree added
Updated by Backport Bot about 3 years ago
- Copied to Backport #49529: nautilus: "ceph osd crush set|reweight-subtree" commands do not set weight on device class subtree added
Updated by Backport Bot about 3 years ago
- Copied to Backport #49530: octopus: "ceph osd crush set|reweight-subtree" commands do not set weight on device class subtree added
Updated by Loïc Dachary about 3 years ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".