Project

General

Profile

Bug #47580

cephadm: "Error ENOENT: Module not found": TypeError: type object argument after ** must be a mapping, not NoneType

Added by Ashley Merrick 7 months ago. Updated 4 months ago.

Status:
Resolved
Priority:
Immediate
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Currently running a 15.2.4 cluster, trying to upgrade to 15.2.5 but was hit by the bug https://tracker.ceph.com/issues/46748

Sebastian suggested to manually change the docker image in the mgr entry and restart the mgr, which has removed the above bug.

However running "ceph orch upgrade start --ceph-version 15.2.5"

Shows the following error : "Error ENOENT: Module not found"

I have attached the mgr logs for last day to the ticket.

mgr.zip (573 KB) Ashley Merrick, 09/22/2020 11:14 AM

mgrrestart.zip (977 KB) Ashley Merrick, 09/22/2020 03:33 PM

mgr.zip - 29th Logs (695 KB) Ashley Merrick, 09/29/2020 10:49 AM

Archiv.zip - logs of both mgrs (875 KB) Tobias Fischer, 10/06/2020 07:45 AM


Related issues

Related to Orchestrator - Bug #47684: cephadm: auth get failed: failed to find osd.27 in keyring retval: -2 Resolved

History

#1 Updated by Sebastian Wagner 7 months ago

  • Project changed from Ceph to Orchestrator
  • Affected Versions v15.2.5 added
  • Affected Versions deleted (v15.2.4)

#2 Updated by Sebastian Wagner 7 months ago

  • Subject changed from mgr cephadm upgrade to cephadm: "Error ENOENT: Module not found"

#3 Updated by Sebastian Wagner 7 months ago

hm. according to

grep -v -e 'finish mon failed to return metadata' -e '/metrics' -e 'pgmap v' -e "no module 'cephadm'" mgr.txt 

the log is pretty much uninteresting.

would you mind restarting the MGRs and then adding the logs again?

#4 Updated by Ashley Merrick 7 months ago

I have restarted the mgr server itself just to make sure the docker did fully reload, and tried to re-run the upgrade command and have uploaded and updated log file for the whole of today.

I did notice lots of "mgr finish mon failed to return metadata for mds.cephfs.sn-s01.snkhfd", and have noticed that no stats are showing in Ceph Dashboard - > Filesystems where the normal performance graphs and request data would show for cephfs.
Not sure if related or due to a part complete upgrade currently.

#5 Updated by Ashley Merrick 7 months ago

I tried running "ceph orch status" and then checked the last few lines of the log and see the following, not sure if this is of any help:

Sep 25 05:08:18 sn-m01 bash[1694]: debug 2020-09-25T05:08:18.025+0000 7f9e41ec8700 -1 mgr.server reply reply (2) No such file or directory Module not found
Sep 25 05:09:13 sn-m01 bash[1694]: debug 2020-09-25T05:09:13.037+0000 7f9e416c7700  0 log_channel(audit) log [DBG] : from='client.146756643 -' entity='client.admin' cmd=[{                                              "prefix": "orch status", "target": ["mon-mgr", ""]}]: dispatch
Sep 25 05:09:13 sn-m01 bash[1694]: debug 2020-09-25T05:09:13.037+0000 7f9e41ec8700 -1 mgr.server reply reply (2) No such file or directory Module not found

#6 Updated by Sebastian Wagner 7 months ago

  • Subject changed from cephadm: "Error ENOENT: Module not found" to cephadm: "Error ENOENT: Module not found": TypeError: type object argument after ** must be a mapping, not NoneType
  • Priority changed from Normal to Urgent
Sep 22 15:20:25 sn-m01 bash[1694]: debug 2020-09-22T15:20:25.586+0000 7f710dc7f700 -1 mgr load Traceback (most recent call last):
Sep 22 15:20:25 sn-m01 bash[1694]:   File "/usr/share/ceph/mgr/cephadm/module.py", line 312, in __init__
Sep 22 15:20:25 sn-m01 bash[1694]:     self.upgrade = CephadmUpgrade(self)
Sep 22 15:20:25 sn-m01 bash[1694]:   File "/usr/share/ceph/mgr/cephadm/upgrade.py", line 59, in __init__
Sep 22 15:20:25 sn-m01 bash[1694]:     self.upgrade_state: Optional[UpgradeState] = UpgradeState.from_json(json.loads(t))
Sep 22 15:20:25 sn-m01 bash[1694]:   File "/usr/share/ceph/mgr/cephadm/upgrade.py", line 50, in from_json
Sep 22 15:20:25 sn-m01 bash[1694]:     return cls(**data)
Sep 22 15:20:25 sn-m01 bash[1694]: TypeError: type object argument after ** must be a mapping, not NoneType

#7 Updated by Sebastian Wagner 7 months ago

  • Priority changed from Urgent to Immediate

#8 Updated by Sebastian Wagner 7 months ago

  • Pull request ID set to 37432

#9 Updated by Sebastian Wagner 7 months ago

workaround should be something like

ceph config-key get mgr/cephadm/upgrade_state

make sure it returns "null". Then:

ceph config-key rm mgr/cephadm/upgrade_state

#10 Updated by Ashley Merrick 7 months ago

The above two commands worked and allowed the upgrade to start, it progressed to 2% (Upgraded the MON and a few OSD's) to 15.2.5 however now the cephadm module has crashed again with the first error "Module 'cephadm' has failed: auth get failed: failed to find osd.27 in keyring retval: -2" even though the mgr is running 15.2.5 which should have this error fixed?

#11 Updated by Sebastian Wagner 7 months ago

do you have a chance to upload new MGR logs?

#12 Updated by Ashley Merrick 7 months ago

Attached.

Thanks

#13 Updated by Sebastian Wagner 7 months ago

  • Related to Bug #47684: cephadm: auth get failed: failed to find osd.27 in keyring retval: -2 added

#15 Updated by Sebastian Wagner 7 months ago

  • Status changed from New to Fix Under Review

#16 Updated by Nathan Cutler 6 months ago

  • Status changed from Fix Under Review to Pending Backport

#17 Updated by Tobias Fischer 6 months ago

Having same problem here:
added new host & osds yesterday evening. while the cluster was still rebalancing removed all OSDs from another Host via "for i in {51..58}; do ceph orch osd rm $i; done"
today wanted to check the status with "ceph orch osd rm status" but got "Error ENOENT: Module not found"

mgr restart did not help. checking the logs found some stuff like

Oct 06 08:56:37 one1-ceph1 bash71526: debug 2020-10-06T06:56:37.368+0000 7f6049a68700 -1 mgr load Failed to construct class in 'cephadm'
Oct 06 08:56:37 one1-ceph1 bash71526: debug 2020-10-06T06:56:37.368+0000 7f6049a68700 -1 mgr load Traceback (most recent call last):
Oct 06 08:56:37 one1-ceph1 bash71526: File "/usr/share/ceph/mgr/cephadm/module.py", line 325, in init
Oct 06 08:56:37 one1-ceph1 bash71526: self.rm_util.load_from_store()
Oct 06 08:56:37 one1-ceph1 bash71526: File "/usr/share/ceph/mgr/cephadm/services/osd.py", line 465, in load_from_store
Oct 06 08:56:37 one1-ceph1 bash71526: osd_obj = OSD.from_json(json.loads(osd), ctx=self)
Oct 06 08:56:37 one1-ceph1 bash71526: File "/lib64/python3.6/json/__init__.py", line 348, in loads
Oct 06 08:56:37 one1-ceph1 bash71526: 'not {!r}'.format(s.__class__.__name__))
Oct 06 08:56:37 one1-ceph1 bash71526: TypeError: the JSON object must be str, bytes or bytearray, not 'dict'
Oct 06 08:56:37 one1-ceph1 bash71526: debug 2020-10-06T06:56:37.368+0000 7f6049a68700 -1 mgr operator() Failed to run module in active mode ('cephadm')
...
Oct 06 08:56:38 one1-ceph1 bash71526: debug 2020-10-06T06:56:38.097+0000 7f6049a68700 -1 no module 'cephadm'
Oct 06 08:56:38 one1-ceph1 bash71526: debug 2020-10-06T06:56:38.097+0000 7f6049a68700 -1 mgr.server reply reply (2) No such file or directory Module not found
...
Oct 06 09:04:10 one1-ceph1 bash71526: debug 2020-10-06T07:04:10.708+0000 7f604b26b700 0 mgr set_store `config-key set mgr/insights/health_history/2020-10-06_07 {"checks": {"BLUEFS_SPILLOVER": {"HEALTH_WARN": {"summary": ["1 OSD experiencing BlueFS spillover"], "detail": [" osd.55 spilled over 166 MiB metada>
Oct 06 09:04:10 one1-ceph1 bash71526: undersized for 65s, current state active+undersized+degraded, last acting [9,19]", "pg 20.1fb is active+undersized+degraded, acting [14,19]", "pg 20.1d0 is stuck undersized for 81s, current state active+undersized+degraded, last acting [23,10]", "pg 20.1da is stuck undersize>
Oct 06 09:04:10 one1-ceph1 bash71526: debug 2020-10-06T07:04:10.708+0000 7f604b26b700 0 mgr set_store mon returned -27: error: entry size limited to 65536 bytes. Use 'mon config key max entry size' to manually adjust

checked config key for OSD removal (ceph config-key get mgr/cephadm/osd_remove_queue)

obtained 'mgr/cephadm/osd_remove_queue'
[{"osd_id": 51, "started": true, "draining": true, "stopped": false, "replace": false, "force": false, "nodename": "one1-ceph3", "drain_started_at": "2020-10-05T16:56:41.888866", "drain_stopped_at": null, "drain_done_at": null, "process_started_at": "2020-10-05T16:56:35.315613"}, {"osd_id": 52, "started": true, "draining": true, "stopped": false, "replace": false, "force": false, "nodename": "one1-ceph3", "drain_started_at": "2020-10-05T16:56:42.949972", "drain_stopped_at": null, "drain_done_at": null, "process_started_at": "2020-10-05T16:56:36.090281"}, {"osd_id": 53, "started": true, "draining": true, "stopped": false, "replace": false, "force": false, "nodename": "one1-ceph3", "drain_started_at": "2020-10-05T16:56:43.976747", "drain_stopped_at": null, "drain_done_at": null, "process_started_at": "2020-10-05T16:56:36.869471"}, {"osd_id": 54, "started": true, "draining": true, "stopped": false, "replace": false, "force": false, "nodename": "one1-ceph3", "drain_started_at": "2020-10-05T16:56:44.730852", "drain_stopped_at": null, "drain_done_at": null, "process_started_at": "2020-10-05T16:56:37.724410"}, {"osd_id": 55, "started": true, "draining": true, "stopped": false, "replace": false, "force": false, "nodename": "one1-ceph3", "drain_started_at": "2020-10-05T16:56:45.764194", "drain_stopped_at": null, "drain_done_at": null, "process_started_at": "2020-10-05T16:56:38.471945"}, {"osd_id": 56, "started": true, "draining": true, "stopped": false, "replace": false, "force": false, "nodename": "one1-ceph3", "drain_started_at": "2020-10-05T16:56:46.793051", "drain_stopped_at": null, "drain_done_at": null, "process_started_at": "2020-10-05T16:56:39.214927"}, {"osd_id": 57, "started": true, "draining": true, "stopped": false, "replace": false, "force": false, "nodename": "one1-ceph3", "drain_started_at": "2020-10-05T16:56:47.813943", "drain_stopped_at": null, "drain_done_at": null, "process_started_at": "2020-10-05T16:56:39.957469"}, {"osd_id": 58, "started": true, "draining": true, "stopped": false, "replace": false, "force": false, "nodename": "one1-ceph3", "drain_started_at": "2020-10-05T16:56:48.845405", "drain_stopped_at": null, "drain_done_at": null, "process_started_at": "2020-10-05T16:56:40.703363"}]

Is there any workaround as this is a production cluster? thanks.

#18 Updated by Tobias Fischer 6 months ago

running latest Version:
"overall": "ceph version 15.2.5 (2c93eff00150f0cc5f106a559557a58d3d7b6f1f) octopus (stable)": 39

#19 Updated by Tobias Fischer 6 months ago

running "ceph config-key rm mgr/cephadm/osd_remove_queue" and restarting active mgr fixed the issue - "ceph orch" works again

#20 Updated by Sebastian Wagner 4 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF