Bug #47361
invalid upmap not getting cleaned
0%
Description
In v14.2.11 we have some invalid upmaps which don't get cleaned. (And I presume they were created by the balancer).
Here is the pool:
pool 11 'rbd_ec_data' erasure size 4 min_size 3 crush_rule 6 object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 16359 flags hashpspool,ec_overwrites,nodelete,selfmanaged_snaps stripe_width 8192 target_size_ratio -1 application rbd
The ec profile:
crush-device-class= crush-failure-domain=pod crush-root=default jerasure-per-chunk-alignment=false k=2 m=2 plugin=jerasure technique=reed_sol_van w=8
The crush rule:
{ "rule_id": 6, "rule_name": "volumes_ec_k2m2", "ruleset": 6, "type": 3, "min_size": 3, "max_size": 4, "steps": [ { "op": "set_chooseleaf_tries", "num": 5 }, { "op": "set_choose_tries", "num": 100 }, { "op": "take", "item": -1, "item_name": "default" }, { "op": "chooseleaf_indep", "num": 0, "type": "pod" }, { "op": "emit" } ] }
And a PG breaking the rule:
# ceph pg 11.2ed query | jq .up [ 34, 0, 70, 6 ] # ceph osd find 0 | jq .crush_location.pod "DL7873990-253884" # ceph osd find 6 | jq .crush_location.pod "DL7873990-253884"
History
#1 Updated by Dan van der Ster over 3 years ago
We deleted all the pg_upmap_items and let the balancer start again. It created bad upmap rules again in the first iteration:
2020-09-08 15:32:07.663 7f7cc50ec700 4 mgr[balancer] ceph osd pg-upmap-items 11.ba mappings [{'to': 0L, 'from': 74L}] 2020-09-08 15:32:07.663 7f7cc50ec700 4 mgr[balancer] ceph osd pg-upmap-items 11.111 mappings [{'to': 0L, 'from': 68L}] 2020-09-08 15:32:07.663 7f7cc50ec700 4 mgr[balancer] ceph osd pg-upmap-items 11.1a0 mappings [{'to': 0L, 'from': 37L}] 2020-09-08 15:32:07.663 7f7cc50ec700 4 mgr[balancer] ceph osd pg-upmap-items 11.253 mappings [{'to': 0L, 'from': 23L}] 2020-09-08 15:32:07.663 7f7cc50ec700 4 mgr[balancer] ceph osd pg-upmap-items 11.2e8 mappings [{'to': 0L, 'from': 6L}] 2020-09-08 15:32:07.663 7f7cc50ec700 4 mgr[balancer] ceph osd pg-upmap-items 11.33c mappings [{'to': 0L, 'from': 6L}] 2020-09-08 15:32:07.664 7f7cc50ec700 4 mgr[balancer] ceph osd pg-upmap-items 11.3c2 mappings [{'to': 0L, 'from': 92L}] 2020-09-08 15:32:07.664 7f7cc50ec700 4 mgr[balancer] ceph osd pg-upmap-items 11.3ff mappings [{'to': 0L, 'from': 59L}]
# ceph pg 11.3ff query | jq .up [ 93, 0, 68, 6 ] # ceph osd find 6 | jq .crush_location.pod "DL7873990-253884" # ceph osd find 0 | jq .crush_location.pod "DL7873990-253884"
#2 Updated by Dan van der Ster over 3 years ago
- File osd.map added
- Affected Versions v16.0.0 added
- Affected Versions deleted (
v14.2.11)
This seems to be still breaking in master.
osdmap is attached.
#3 Updated by Dan van der Ster over 3 years ago
- Affected Versions v14.2.11, v15.2.4 added
- Component(RADOS) CRUSH added
#4 Updated by Dan van der Ster over 3 years ago
As a workaround for our cluster operations I have removed the un-used "rack" level from our osd tree, and now the upmaps generated by the balancer are all valid.
Before we had:
-1 335.25000 root default -5 335.25000 room 0513-R-0060 -14 111.75000 rack BE10 -61 55.87500 pod DL7873990-253885 -13 13.96875 host i78739906410716 -37 13.96875 host i78739906472540 -29 13.96875 host i78739907719701 -51 13.96875 host i78739909416742 -62 55.87500 pod DL7873990-253968 -27 13.96875 host i78739902265430 -33 13.96875 host i78739903943360 -41 13.96875 host i78739907758338 -25 13.96875 host i78739908656183 -4 111.75000 rack BE11 -65 55.87500 pod DL7873990-253884 -11 13.96875 host i78739906505598 -45 13.96875 host i78739906777113 -55 13.96875 host i78739907036976 -3 13.96875 host i78739909344294 -67 55.87500 pod DL7873990-253887 -9 13.96875 host i78739900279212 -17 13.96875 host i78739904028937 -23 13.96875 host i78739906726418 -39 13.96875 host i78739908512467 -20 111.75000 rack BE13 -59 55.87500 pod DL7873990-253883 -43 13.96875 host i78739903380336 -53 13.96875 host i78739903459223 -47 13.96875 host i78739906460270 -35 13.96875 host i78739908429178 -69 55.87500 pod DL7873990-253886 -19 13.96875 host i78739904021991 -31 13.96875 host i78739905002334 -57 13.96875 host i78739907517004 -49 13.96875 host i78739909387898
and now we have
-1 335.25000 root default -5 335.25000 room 0513-R-0060 -59 55.87500 pod DL7873990-253883 -43 13.96875 host i78739903380336 -53 13.96875 host i78739903459223 -47 13.96875 host i78739906460270 -35 13.96875 host i78739908429178 -65 55.87500 pod DL7873990-253884 -11 13.96875 host i78739906505598 -45 13.96875 host i78739906777113 -55 13.96875 host i78739907036976 -3 13.96875 host i78739909344294 -61 55.87500 pod DL7873990-253885 -13 13.96875 host i78739906410716 -37 13.96875 host i78739906472540 -29 13.96875 host i78739907719701 -51 13.96875 host i78739909416742 -69 55.87500 pod DL7873990-253886 -19 13.96875 host i78739904021991 -31 13.96875 host i78739905002334 -57 13.96875 host i78739907517004 -49 13.96875 host i78739909387898 -67 55.87500 pod DL7873990-253887 -9 13.96875 host i78739900279212 -17 13.96875 host i78739904028937 -23 13.96875 host i78739906726418 -39 13.96875 host i78739908512467 -62 55.87500 pod DL7873990-253968 -27 13.96875 host i78739902265430 -33 13.96875 host i78739903943360 -41 13.96875 host i78739907758338 -25 13.96875 host i78739908656183
So it seems that grouping failure domains into higher level buckets which are not used in the crush rule breaks upmap.
#5 Updated by Neha Ojha over 3 years ago
- Assignee set to David Zafman
#6 Updated by David Zafman over 3 years ago
- Status changed from New to Rejected
I diagnosed this issue running the following with the supplied osd map.
CEPH_ARGS=" --debug_osd=30" osdmaptool --upmap-cleanup - osd.map
This problem is caused by a crush map error:
{ "type_id": 3, "name": "rack" }, { "type_id": 4, "name": "row" }, { "type_id": 5, "name": "pdu" }, { "type_id": 6, "name": "pod" },
A "rack" can not have a type_id lower than "pod" since it is higher in the hierarchy. This caused the crush code which is looking for the OSDs under "pod" type 6. When it got to rack and saw type 3 it just stopped looking. Internally we generated the following log message" for every OSD.
2020-09-21T16:54:01.471-0700 7faec3060c40 1 verify_upmap unable to get parent of osd.57, skipping for now