Project

General

Profile

Bug #47361

invalid upmap not getting cleaned

Added by Dan van der Ster over 3 years ago. Updated over 3 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
David Zafman
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Yes
Severity:
2 - major
Reviewed:
ceph-qa-suite:
Component(RADOS):
CRUSH
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

In v14.2.11 we have some invalid upmaps which don't get cleaned. (And I presume they were created by the balancer).

Here is the pool:

pool 11 'rbd_ec_data' erasure size 4 min_size 3 crush_rule 6 object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 16359 flags hashpspool,ec_overwrites,nodelete,selfmanaged_snaps stripe_width 8192 target_size_ratio -1 application rbd

The ec profile:
crush-device-class=
crush-failure-domain=pod
crush-root=default
jerasure-per-chunk-alignment=false
k=2
m=2
plugin=jerasure
technique=reed_sol_van
w=8

The crush rule:
    {
        "rule_id": 6,
        "rule_name": "volumes_ec_k2m2",
        "ruleset": 6,
        "type": 3,
        "min_size": 3,
        "max_size": 4,
        "steps": [
            {
                "op": "set_chooseleaf_tries",
                "num": 5
            },
            {
                "op": "set_choose_tries",
                "num": 100
            },
            {
                "op": "take",
                "item": -1,
                "item_name": "default" 
            },
            {
                "op": "chooseleaf_indep",
                "num": 0,
                "type": "pod" 
            },
            {
                "op": "emit" 
            }
        ]
    }

And a PG breaking the rule:
# ceph pg 11.2ed query | jq .up
[
  34,
  0,
  70,
  6
]
# ceph osd find 0 | jq .crush_location.pod
"DL7873990-253884" 
# ceph osd find 6 | jq .crush_location.pod
"DL7873990-253884" 

osd.map (47.4 KB) Dan van der Ster, 09/09/2020 10:56 AM

History

#1 Updated by Dan van der Ster over 3 years ago

We deleted all the pg_upmap_items and let the balancer start again. It created bad upmap rules again in the first iteration:

2020-09-08 15:32:07.663 7f7cc50ec700  4 mgr[balancer] ceph osd pg-upmap-items 11.ba mappings [{'to': 0L, 'from': 74L}]
2020-09-08 15:32:07.663 7f7cc50ec700  4 mgr[balancer] ceph osd pg-upmap-items 11.111 mappings [{'to': 0L, 'from': 68L}]
2020-09-08 15:32:07.663 7f7cc50ec700  4 mgr[balancer] ceph osd pg-upmap-items 11.1a0 mappings [{'to': 0L, 'from': 37L}]
2020-09-08 15:32:07.663 7f7cc50ec700  4 mgr[balancer] ceph osd pg-upmap-items 11.253 mappings [{'to': 0L, 'from': 23L}]
2020-09-08 15:32:07.663 7f7cc50ec700  4 mgr[balancer] ceph osd pg-upmap-items 11.2e8 mappings [{'to': 0L, 'from': 6L}]
2020-09-08 15:32:07.663 7f7cc50ec700  4 mgr[balancer] ceph osd pg-upmap-items 11.33c mappings [{'to': 0L, 'from': 6L}]
2020-09-08 15:32:07.664 7f7cc50ec700  4 mgr[balancer] ceph osd pg-upmap-items 11.3c2 mappings [{'to': 0L, 'from': 92L}]
2020-09-08 15:32:07.664 7f7cc50ec700  4 mgr[balancer] ceph osd pg-upmap-items 11.3ff mappings [{'to': 0L, 'from': 59L}]

# ceph pg 11.3ff query | jq .up
[
  93,
  0,
  68,
  6
]
# ceph osd find 6 | jq .crush_location.pod
"DL7873990-253884" 
# ceph osd find 0 | jq .crush_location.pod
"DL7873990-253884" 

#2 Updated by Dan van der Ster over 3 years ago

  • File osd.map added
  • Affected Versions v16.0.0 added
  • Affected Versions deleted (v14.2.11)

This seems to be still breaking in master.

osdmap is attached.

#3 Updated by Dan van der Ster over 3 years ago

  • Affected Versions v14.2.11, v15.2.4 added
  • Component(RADOS) CRUSH added

#4 Updated by Dan van der Ster over 3 years ago

As a workaround for our cluster operations I have removed the un-used "rack" level from our osd tree, and now the upmaps generated by the balancer are all valid.

Before we had:

 -1       335.25000 root default                                                 
 -5       335.25000     room 0513-R-0060                                         
-14       111.75000         rack BE10                                            
-61        55.87500             pod DL7873990-253885                             
-13        13.96875                 host i78739906410716                         
-37        13.96875                 host i78739906472540                         
-29        13.96875                 host i78739907719701                         
-51        13.96875                 host i78739909416742                         
-62        55.87500             pod DL7873990-253968                             
-27        13.96875                 host i78739902265430                         
-33        13.96875                 host i78739903943360                         
-41        13.96875                 host i78739907758338                         
-25        13.96875                 host i78739908656183                         
 -4       111.75000         rack BE11                                            
-65        55.87500             pod DL7873990-253884                             
-11        13.96875                 host i78739906505598                         
-45        13.96875                 host i78739906777113                         
-55        13.96875                 host i78739907036976                         
 -3        13.96875                 host i78739909344294                         
-67        55.87500             pod DL7873990-253887                             
 -9        13.96875                 host i78739900279212                         
-17        13.96875                 host i78739904028937                         
-23        13.96875                 host i78739906726418                         
-39        13.96875                 host i78739908512467                         
-20       111.75000         rack BE13                                            
-59        55.87500             pod DL7873990-253883                             
-43        13.96875                 host i78739903380336                         
-53        13.96875                 host i78739903459223                         
-47        13.96875                 host i78739906460270                         
-35        13.96875                 host i78739908429178                         
-69        55.87500             pod DL7873990-253886                             
-19        13.96875                 host i78739904021991                         
-31        13.96875                 host i78739905002334                         
-57        13.96875                 host i78739907517004                         
-49        13.96875                 host i78739909387898                         

and now we have

 -1       335.25000 root default                                             
 -5       335.25000     room 0513-R-0060                                     
-59        55.87500         pod DL7873990-253883                             
-43        13.96875             host i78739903380336                         
-53        13.96875             host i78739903459223                         
-47        13.96875             host i78739906460270                         
-35        13.96875             host i78739908429178                         
-65        55.87500         pod DL7873990-253884                             
-11        13.96875             host i78739906505598                         
-45        13.96875             host i78739906777113                         
-55        13.96875             host i78739907036976                         
 -3        13.96875             host i78739909344294                         
-61        55.87500         pod DL7873990-253885                             
-13        13.96875             host i78739906410716                         
-37        13.96875             host i78739906472540                         
-29        13.96875             host i78739907719701                         
-51        13.96875             host i78739909416742                         
-69        55.87500         pod DL7873990-253886                             
-19        13.96875             host i78739904021991                         
-31        13.96875             host i78739905002334                         
-57        13.96875             host i78739907517004                         
-49        13.96875             host i78739909387898                         
-67        55.87500         pod DL7873990-253887                             
 -9        13.96875             host i78739900279212                         
-17        13.96875             host i78739904028937                         
-23        13.96875             host i78739906726418                         
-39        13.96875             host i78739908512467                         
-62        55.87500         pod DL7873990-253968                             
-27        13.96875             host i78739902265430                         
-33        13.96875             host i78739903943360                         
-41        13.96875             host i78739907758338                         
-25        13.96875             host i78739908656183                         

So it seems that grouping failure domains into higher level buckets which are not used in the crush rule breaks upmap.

#5 Updated by Neha Ojha over 3 years ago

  • Assignee set to David Zafman

#6 Updated by David Zafman over 3 years ago

  • Status changed from New to Rejected

I diagnosed this issue running the following with the supplied osd map.

CEPH_ARGS=" --debug_osd=30" osdmaptool --upmap-cleanup - osd.map

This problem is caused by a crush map error:

        {
            "type_id": 3,
            "name": "rack" 
        },
        {
            "type_id": 4,
            "name": "row" 
        },
        {
            "type_id": 5,
            "name": "pdu" 
        },
        {
            "type_id": 6,
            "name": "pod" 
        },

A "rack" can not have a type_id lower than "pod" since it is higher in the hierarchy. This caused the crush code which is looking for the OSDs under "pod" type 6. When it got to rack and saw type 3 it just stopped looking. Internally we generated the following log message" for every OSD.

2020-09-21T16:54:01.471-0700 7faec3060c40  1 verify_upmap unable to get parent of osd.57, skipping for now

Also available in: Atom PDF