Project

General

Profile

Actions

Bug #22346

closed

OSD_ORPHAN issues after jewel->luminous upgrade, but orphaned osds not in crushmap

Added by Graham Allan over 6 years ago. Updated about 6 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Administration/Usability
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
ceph cli
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

ust updated a fairly long-lived (originally firefly) cluster from jewel to luminous 12.2.2.

One of the issues I see is a new health warning:

OSD_ORPHAN 3 osds exist in the crush map but not in the osdmap
osd.2 exists in crush map but not in osdmap
osd.14 exists in crush map but not in osdmap
osd.19 exists in crush map but not in osdmap

Seemed reasonable enough, these low-numbered OSDs were on long-decommissioned hardware. I thought I had removed them completely though, and it seems I had:

  1. ceph osd crush ls osd.2
    Error ENOENT: node 'osd.2' does not exist
  2. ceph osd crush remove osd.2
    device 'osd.2' does not appear in the crush map

so I wonder where it's getting this warning from, and if it's erroneous, how can I clear it?

Dump of osd map attached...


Files

osdmap.dump.20171207 (105 KB) osdmap.dump.20171207 "ceph osd dump" output Graham Allan, 12/07/2017 10:38 PM
osdmap.20171207 (127 KB) osdmap.20171207 "ceph osd getmap" Graham Allan, 12/07/2017 11:23 PM
Actions #1

Updated by Greg Farnum over 6 years ago

  • Project changed from rgw to RADOS
  • Category set to Administration/Usability
  • Assignee set to Brad Hubbard
  • Component(RADOS) ceph cli added
Actions #3

Updated by Brad Hubbard over 6 years ago

Thanks Graham,

I'll be taking a look into this. I can confirm I can reproduce the issue locally with the osdmaptool and that I will investigate the cause and update here when I have some findings.

$ bin/osdmaptool --health osdmap.20171207
bin/osdmaptool: osdmap file 'osdmap.20171207'
{
    "OSD_ORPHAN": {
        "severity": "HEALTH_WARN",
        "summary": {
            "message": "3 osds exist in the crush map but not in the osdmap" 
        },
        "detail": [
            {
                "message": "osd.2 exists in crush map but not in osdmap" 
            },
            {
                "message": "osd.14 exists in crush map but not in osdmap" 
            },
            {
                "message": "osd.19 exists in crush map but not in osdmap" 
            }
        ]
    }
}
Actions #4

Updated by Brad Hubbard over 6 years ago

  • Status changed from New to 12
Actions #5

Updated by Brad Hubbard over 6 years ago

So this is happening because entries for "device2", "device14", and "device19" still have entries in the "name_map" section of the crushmap. This appears to indicate they have been only partially removed from the crushmap.

$ osdmaptool --health osdmap.20171207
osdmaptool: osdmap file 'osdmap.20171207'
{
    "OSD_ORPHAN": {
        "severity": "HEALTH_WARN",
        "summary": {
            "message": "3 osds exist in the crush map but not in the osdmap" 
        },
        "detail": [
            {
                "message": "osd.2 exists in crush map but not in osdmap" 
            },
            {
                "message": "osd.14 exists in crush map but not in osdmap" 
            },
            {
                "message": "osd.19 exists in crush map but not in osdmap" 
            }
        ]
    }
}
$ osdmaptool --export-crush crushmap osdmap.20171207
osdmaptool: osdmap file 'osdmap.20171207'
osdmaptool: exported crush map to crushmap
$ crushtool -i crushmap --remove-item device2 -o crushmap
crushtool removing item device2
$ crushtool -i crushmap --remove-item device14 -o crushmap
crushtool removing item device14
$ crushtool -i crushmap --remove-item device19 -o crushmap
crushtool removing item device19
$ osdmaptool --import-crush crushmap osdmap.20171207
osdmaptool: osdmap file 'osdmap.20171207'
osdmaptool: imported 17117 byte crush map from crushmap
osdmaptool: writing epoch 493339 to osdmap.20171207
$ osdmaptool --health osdmap.20171207
osdmaptool: osdmap file 'osdmap.20171207'
{}

I suspect the solution here is to perform the same removal on the crushmap of your cluster but I'd like to discuss how this might have come about and the best course of action with some of my colleagues and will advise when I've done that.

Actions #6

Updated by Graham Allan over 6 years ago

Interesting! It seems like we probably removed 30 osds from the old retired hardware, so it's curious that just 3 had these traces left behind.

Thanks, and I'll be ready to try whatever you suggest.

Actions #7

Updated by Brad Hubbard over 6 years ago

Hi Graham,

The consensus is that this was caused by a bug in a previous release which failed to remove the devices in question completely. Therefore the solution should be to get the crushmap, remove the devices as above and reinject the modified crushmap.

Actions #8

Updated by Graham Allan over 6 years ago

That did clean it up, thanks.

It is curious though that if I decompile the crushmap to text, it appears the same both before and after using crushtool to remove the devices.

Actions #9

Updated by Brad Hubbard over 6 years ago

  • Status changed from 12 to Resolved

Not for me.

$ crushtool -d crushmap.bad -o crushmap.bad.txt
$ crushtool -d crushmap.good -o crushmap.good.txt
$ diff crushmap.bad.txt crushmap.good.txt
10,12d9
< device 2 device2
< device 14 device14
< device 19 device19

Actions #10

Updated by huang jun about 6 years ago

Brad Hubbard wrote:

Hi Graham,

The consensus is that this was caused by a bug in a previous release which failed to remove the devices in question completely. Therefore the solution should be to get the crushmap, remove the devices as above and reinject the modified crushmap.

Hi Brad, does the bug already been fixed it in Luminous? can you post the PR number?

Actions #11

Updated by Brad Hubbard about 6 years ago

Hi Jun,

It's not really possible to pinpoint an exact PR at this stage as it's possible there was more than one and it's also possible that this state was arrived at by an unusual sequence of steps, or a sequence involving unusual circumstances or events. Basically, if you can reproduce this state with current tools and can provide the steps you followed to lead to this state we's like to hear about it.

Actions

Also available in: Atom PDF