Bug #51176
closedModule 'cephadm' has failed: 'MegaSAS'
0%
Description
This problem happened when I add Osds to the cluster.
The `ceph health detail` show that:
HEALTH_ERR Module 'cephadm' has failed: 'MegaSAS'
[ERR] MGR_MODULE_ERROR: Module 'cephadm' has failed: 'MegaSAS'
Module 'cephadm' has failed: 'MegaSAS'
The `ceph orch ls` show that:
Error EINVAL: Traceback (most recent call last):
File "/usr/share/ceph/mgr/mgr_module.py", line 1204, in _handle_command
return self.handle_command(inbuf, cmd)
File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 140, in handle_command
return dispatch[cmd['prefix']].call(self, cmd, inbuf)
File "/usr/share/ceph/mgr/mgr_module.py", line 320, in call
return self.func(mgr, **kwargs)
File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 102, in <lambda>
wrapper_copy = lambda *l_args, **l_kwargs: wrapper(*l_args, **l_kwargs)
File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 91, in wrapper
return func(*args, **kwargs)
File "/usr/share/ceph/mgr/orchestrator/module.py", line 503, in _list_services
raise_if_exception(completion)
File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 642, in raise_if_exception
raise e
AssertionError: MegaSAS
And the `ceph orch ps` show that there is a container named 'MegaSAS.log.bak' on the host `node4`. Neither can I find this docker container, nor can I deleted this service by command `ceph orch daemon rm MegaSAS.log.bak`.
I has tryed to excute command by `ceph mgr module disable cephadm` and `ceph mgr module enable cephadm`, bu useless.
My ceph version is v15.2.13 installed by cephadm.
Files
Updated by Loïc Dachary almost 3 years ago
- Target version deleted (
v15.2.13) - Affected Versions v15.2.13 added
Updated by Sebastian Wagner almost 3 years ago
- Status changed from New to Need More Info
could you please attach the MGR log file?
Updated by Ke Xiao almost 3 years ago
Sebastian Wagner wrote:
could you please attach the MGR log file?
since June 3th ,there were no more new logs.The last one is that:
root@wh-control01:/var/log/ceph/a234ac9a-ae23-11eb-8b05-374bf4f061c1# zcat ceph-mgr.wh-control01.aspott.cephadm.log.1.gz
2021-06-03 02:27:11,524 [Dummy-1] [DEBUG] [root] setting log level based on debug_mgr: WARNING (1/5)
2021-06-03 02:33:26,948 [Dummy-1] [ERROR] [orchestrator._interface] _Promise failed
Traceback (most recent call last):
File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 294, in _finalize
next_result = self._on_complete(self._value)
File "/usr/share/ceph/mgr/cephadm/module.py", line 107, in <lambda>
return CephadmCompletion(on_complete=lambda _: f(*args, **kwargs))
File "/usr/share/ceph/mgr/cephadm/module.py", line 1333, in describe_service
hosts=[dd.hostname]
File "/lib/python3.6/site-packages/ceph/deployment/service_spec.py", line 429, in __init__
assert service_type in ServiceSpec.KNOWN_SERVICE_TYPES, service_type
AssertionError: MegaSAS
2021-06-03 02:45:46,800 [Dummy-1] [ERROR] [orchestrator._interface] _Promise failed
Traceback (most recent call last):
File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 294, in _finalize
next_result = self._on_complete(self._value)
File "/usr/share/ceph/mgr/cephadm/module.py", line 107, in <lambda>
return CephadmCompletion(on_complete=lambda _: f(*args, **kwargs))
File "/usr/share/ceph/mgr/cephadm/module.py", line 1333, in describe_service
hosts=[dd.hostname]
File "/lib/python3.6/site-packages/ceph/deployment/service_spec.py", line 429, in __init__
assert service_type in ServiceSpec.KNOWN_SERVICE_TYPES, service_type
AssertionError: MegaSAS
Updated by Ke Xiao almost 3 years ago
- File screenshot-20210622-171410.png screenshot-20210622-171410.png added
- File screenshot-20210622-171345.png screenshot-20210622-171345.png added
I have viewed the code, and found that `ceph orch ls` worked with service name, for example ` ceph orch ls crash `
Updated by Sebastian Wagner almost 3 years ago
I think something erroneous ended up in the config-key store. Could you give use the export of the config-key dump?
ceph config-key dump mgr/cephadm
(please make sure you don't expose certificates and passwords)
Updated by Ke Xiao almost 3 years ago
Sebastian Wagner wrote:
I think something erroneous ended up in the config-key store. Could you give use the export of the config-key dump?
[...]
(please make sure you don't expose certificates and passwords)
Yes, you are right, I have found 'MegaSAS.log.bak' in the config-key.
Can I solve this problem by edit the config-key.
If I can, how to do it.
The command `ceph config-key dump mgr/cephadm` show a json data like this:
{
"mgr/cephadm/grafana_crt": "",
"mgr/cephadm/grafana_key": "",
"mgr/cephadm/host.wh-cmpt01": "",
"mgr/cephadm/host.wh-cmpt02": "",
"mgr/cephadm/host.wh-cmpt03": "",
"mgr/cephadm/host.wh-control01": "",
"mgr/cephadm/host.wh-control02": "",
"mgr/cephadm/host.wh-control03": "",
"mgr/cephadm/host.wh-node01": "",
"mgr/cephadm/host.wh-node02": "",
"mgr/cephadm/host.wh-node03": "",
"mgr/cephadm/host.wh-node04": {
"daemons": {
"osd.36": {
"daemon_type": "osd",
"daemon_id": "36",
"hostname": "wh-node04",
"container_id": "4666ea31cee6",
"container_image_id": "2cf504fded3980c76b59a354fca8f301941f86e369215a08752874d1ddb69b73",
"container_image_name": "docker.io/ceph/ceph:v15",
"version": "15.2.13",
"status": 1,
"status_desc": "running",
"osdspec_affinity": "all-available-devices",
"is_active": false,
"last_refresh": "2021-06-22T08:41:10.535579Z",
"created": "2021-06-02T07:06:05.369626Z",
"started": "2021-06-03T02:25:24.280830Z"
},
"osd.35": {
"daemon_type": "osd",
"daemon_id": "35",
"hostname": "wh-node04",
"container_id": "7071ae61bd70",
"container_image_id": "2cf504fded3980c76b59a354fca8f301941f86e369215a08752874d1ddb69b73",
"container_image_name": "docker.io/ceph/ceph:v15",
"version": "15.2.13",
"status": 1,
"status_desc": "running",
"osdspec_affinity": "all-available-devices",
"is_active": false,
"last_refresh": "2021-06-22T08:41:10.537535Z",
"created": "2021-06-02T07:06:04.145570Z",
"started": "2021-06-03T02:25:24.092646Z"
},
"node-exporter.wh-node04": {
"daemon_type": "node-exporter",
"daemon_id": "wh-node04",
"hostname": "wh-node04",
"container_id": "62e6d029030e",
"container_image_id": "e5a616e4b9cf68dfcad7782b78e118be4310022e874d52da85c55923fb615f87",
"container_image_name": "docker.io/prom/node-exporter:v0.18.1",
"version": "0.18.1",
"status": 1,
"status_desc": "running",
"is_active": false,
"last_refresh": "2021-06-22T08:41:10.538896Z",
"created": "2021-06-02T07:05:41.700674Z",
"started": "2021-06-03T02:25:22.666932Z"
},
"osd.34": {
"daemon_type": "osd",
"daemon_id": "34",
"hostname": "wh-node04",
"container_id": "41b40318dd0b",
"container_image_id": "2cf504fded3980c76b59a354fca8f301941f86e369215a08752874d1ddb69b73",
"container_image_name": "docker.io/ceph/ceph:v15",
"version": "15.2.13",
"status": 1,
"status_desc": "running",
"osdspec_affinity": "all-available-devices",
"is_active": false,
"last_refresh": "2021-06-22T08:41:10.539094Z",
"created": "2021-06-02T07:06:03.017518Z",
"started": "2021-06-03T02:25:24.294656Z"
},
"crash.wh-node04": {
"daemon_type": "crash",
"daemon_id": "wh-node04",
"hostname": "wh-node04",
"container_id": "26d7a4f54158",
"container_image_id": "2cf504fded3980c76b59a354fca8f301941f86e369215a08752874d1ddb69b73",
"container_image_name": "docker.io/ceph/ceph:v15",
"version": "15.2.13",
"status": 1,
"status_desc": "running",
"is_active": false,
"last_refresh": "2021-06-22T08:41:10.540328Z",
"created": "2021-06-02T07:05:40.828638Z",
"started": "2021-06-03T02:25:22.641109Z"
},
"MegaSAS.log.bak": {
"daemon_type": "MegaSAS",
"daemon_id": "log.bak",
"hostname": "wh-node04",
"status": 0,
"status_desc": "stopped",
"is_active": false,
"last_refresh": "2021-06-22T08:41:10.540521Z"
},
"osd.33": {
"daemon_type": "osd",
"daemon_id": "33",
"hostname": "wh-node04",
"container_id": "65b0145e7686",
"container_image_id": "2cf504fded3980c76b59a354fca8f301941f86e369215a08752874d1ddb69b73",
"container_image_name": "docker.io/ceph/ceph:v15",
"version": "15.2.13",
"status": 1,
"status_desc": "running",
"osdspec_affinity": "all-available-devices",
"is_active": false,
"last_refresh": "2021-06-22T08:41:10.540539Z",
"created": "2021-06-02T07:06:01.833469Z",
"started": "2021-06-03T02:25:24.314728Z"
},
"osd.45": {
"daemon_type": "osd",
"daemon_id": "45",
"hostname": "wh-node04",
"status": 1,
"status_desc": "starting",
"is_active": false
},
"osd.54": {
"daemon_type": "osd",
"daemon_id": "54",
"hostname": "wh-node04",
"status": 1,
"status_desc": "starting",
"is_active": false
}
},
"devices": [{
"rejected_reasons": ["Insufficient space (<10 extents) on vgs", "LVM detected", "locked"],
"available": false,
"path": "/dev/sdb",
"sys_api": {
"removable": "0",
"ro": "0",
"vendor": "DELL",
"model": "PERC H710P",
"rev": "3.13",
"sas_address": "",
"sas_device_handle": "",
"support_discard": "0",
"rotational": "1",
"nr_requests": "256",
"scheduler_mode": "mq-deadline",
"partitions": {},
"sectors": 0,
"sectorsize": "512",
"size": 2399812976640.0,
"human_readable_size": "2.18 TB",
"path": "/dev/sdb",
"locked": 1
},
"lvs": [{
"name": "osd-block-0d5d507d-e242-45bd-b5d0-009d6d2d0322",
"osd_id": "33",
"cluster_name": "ceph",
"type": "block",
"osd_fsid": "0d5d507d-e242-45bd-b5d0-009d6d2d0322",
"cluster_fsid": "a234ac9a-ae23-11eb-8b05-374bf4f061c1",
"osdspec_affinity": "all-available-devices",
"block_uuid": "cQabEW-PyJG-z6ek-w7bR-hK68-C1Is-uGCbjc"
}],
"human_readable_type": "hdd",
"device_id": "DELL_PERC_H710P_00a3d1f78dfd37dc2400bb55f960f681",
"lsm_data": {}
}, {
"rejected_reasons": ["Insufficient space (<10 extents) on vgs", "LVM detected", "locked"],
"available": false,
"path": "/dev/sdc",
"sys_api": {
"removable": "0",
"ro": "0",
"vendor": "DELL",
"model": "PERC H710P",
"rev": "3.13",
"sas_address": "",
"sas_device_handle": "",
"support_discard": "0",
"rotational": "1",
"nr_requests": "256",
"scheduler_mode": "mq-deadline",
"partitions": {},
"sectors": 0,
"sectorsize": "512",
"size": 2399812976640.0,
"human_readable_size": "2.18 TB",
"path": "/dev/sdc",
"locked": 1
},
"lvs": [{
"name": "osd-block-880c472b-8326-485a-8b86-33fe1d8db044",
"osd_id": "34",
"cluster_name": "ceph",
"type": "block",
"osd_fsid": "880c472b-8326-485a-8b86-33fe1d8db044",
"cluster_fsid": "a234ac9a-ae23-11eb-8b05-374bf4f061c1",
"osdspec_affinity": "all-available-devices",
"block_uuid": "fUDIJw-RfeJ-T5UX-H5km-JVzo-9737-cmk03Y"
}],
"human_readable_type": "hdd",
"device_id": "DELL_PERC_H710P_0066bd548e0338dc2400bb55f960f681",
"lsm_data": {}
}, {
"rejected_reasons": ["Insufficient space (<10 extents) on vgs", "LVM detected", "locked"],
"available": false,
"path": "/dev/sdd",
"sys_api": {
"removable": "0",
"ro": "0",
"vendor": "DELL",
"model": "PERC H710P",
"rev": "3.13",
"sas_address": "",
"sas_device_handle": "",
"support_discard": "0",
"rotational": "1",
"nr_requests": "256",
"scheduler_mode": "mq-deadline",
"partitions": {},
"sectors": 0,
"sectorsize": "512",
"size": 999653638144.0,
"human_readable_size": "931.00 GB",
"path": "/dev/sdd",
"locked": 1
},
"lvs": [{
"name": "osd-block-64512979-ada8-4098-9860-5d0597b086d2",
"osd_id": "35",
"cluster_name": "ceph",
"type": "block",
"osd_fsid": "64512979-ada8-4098-9860-5d0597b086d2",
"cluster_fsid": "a234ac9a-ae23-11eb-8b05-374bf4f061c1",
"osdspec_affinity": "all-available-devices",
"block_uuid": "Kdw40g-csQn-Lzq2-pBpm-5GRf-kpLS-NUCVad"
}],
"human_readable_type": "hdd",
"device_id": "DELL_PERC_H710P_000849ea8e0d38dc2400bb55f960f681",
"lsm_data": {}
}, {
"rejected_reasons": ["Insufficient space (<10 extents) on vgs", "LVM detected", "locked"],
"available": false,
"path": "/dev/sde",
"sys_api": {
"removable": "0",
"ro": "0",
"vendor": "DELL",
"model": "PERC H710P",
"rev": "3.13",
"sas_address": "",
"sas_device_handle": "",
"support_discard": "0",
"rotational": "1",
"nr_requests": "256",
"scheduler_mode": "mq-deadline",
"partitions": {},
"sectors": 0,
"sectorsize": "512",
"size": 2399812976640.0,
"human_readable_size": "2.18 TB",
"path": "/dev/sde",
"locked": 1
},
"lvs": [{
"name": "osd-block-d553e4d8-a7db-4bd1-a3be-85a36f42a9ec",
"osd_id": "36",
"cluster_name": "ceph",
"type": "block",
"osd_fsid": "d553e4d8-a7db-4bd1-a3be-85a36f42a9ec",
"cluster_fsid": "a234ac9a-ae23-11eb-8b05-374bf4f061c1",
"osdspec_affinity": "all-available-devices",
"block_uuid": "jna9la-ckPY-ZoAv-r57d-XouX-Jbyo-sapljF"
}],
"human_readable_type": "hdd",
"device_id": "DELL_PERC_H710P_008d294f9534960b2600bb55f960f681",
"lsm_data": {}
}],
"osdspec_previews": [],
"daemon_config_deps": {
"crash.wh-node04": {
"deps": [],
"last_config": "2021-06-02T07:05:39.197034Z"
},
"node-exporter.wh-node04": {
"deps": [],
"last_config": "2021-06-02T07:05:40.991697Z"
},
"osd.33": {
"deps": [],
"last_config": "2021-06-02T07:06:00.815063Z"
},
"osd.34": {
"deps": [],
"last_config": "2021-06-02T07:06:01.922781Z"
},
"osd.35": {
"deps": [],
"last_config": "2021-06-02T07:06:03.119950Z"
},
"osd.36": {
"deps": [],
"last_config": "2021-06-02T07:06:04.269442Z"
},
"osd.45": {
"deps": [],
"last_config": "2021-06-22T09:44:35.993549Z"
},
"osd.54": {
"deps": [],
"last_config": "2021-06-22T09:44:56.665786Z"
}
},
"networks": {},
"last_host_check": "2021-06-22T08:41:09.028416Z",
"scheduled_daemon_actions": {}
},
"mgr/cephadm/host.wh-node05": "",
"mgr/cephadm/host.wh-node06": "",
"mgr/cephadm/host.wh-node07": "",
"mgr/cephadm/host.wh-node08": "",
"mgr/cephadm/host.wh-node09": "",
"mgr/cephadm/inventory": "",
"mgr/cephadm/osd_remove_queue": "[]",
"mgr/cephadm/spec.alertmanager": "",
"mgr/cephadm/spec.crash": "",
"mgr/cephadm/spec.grafana": "",
"mgr/cephadm/spec.mds.cephfs": "",
"mgr/cephadm/spec.mgr": "",
"mgr/cephadm/spec.mon": "",
"mgr/cephadm/spec.node-exporter": "",
"mgr/cephadm/spec.osd.all-available-devices": "",
"mgr/cephadm/spec.prometheus": "",
"mgr/cephadm/ssh_identity_key": "***",
"mgr/cephadm/ssh_identity_pub": "***",
"mgr/cephadm/ssh_user": "root"
}
The "mgr/cephadm/host.wh-node04" is a string original, and I changed it to an object to have a better reading, and I hided other infomations.
The osd.45 and osd.54 are the new osds I add yesterday. It seemed that there were some tasks in the queue were blocked by this bug when I was applying new osds. I can only temporary solve it by excute command `ceph mgr module disable/enable cephadm`.
And I can't upgrade the ceph cluster by cephadm.
Hope to get your reply soon. Thanks.
Updated by Ke Xiao almost 3 years ago
Sebastian Wagner wrote:
I think something erroneous ended up in the config-key store. Could you give use the export of the config-key dump?
[...]
(please make sure you don't expose certificates and passwords)
Do you need more information? Looking forward to your reply soon. Best wishes to you.
Updated by Sebastian Wagner almost 3 years ago
- Status changed from Need More Info to Fix Under Review
- Assignee set to Sebastian Wagner
- Backport set to pacific
- Pull request ID set to 42177
you probably want to do:
1. Remove the MegaSys.log.bak file on node-04
2. run
$ ceph config-key set mgr/cephadm/pause true $ ceph-config-key rm mgr/cephadm/host.wh-node04 $ ceph mgr fail .....
3. Then wait a bit till, ceph -s no longer shows stray daemons, then
ceph orch resume
Updated by Ke Xiao almost 3 years ago
Sebastian Wagner wrote:
you probably want to do:
1. Remove the MegaSys.log.bak file on node-04
2. run
[...]3. Then wait a bit till, ceph -s no longer shows stray daemons, then
[...]
I had followed the steps as the suggests you gived me. The errors disappeared in a few seconds, and appeared again seconds latter.
This are the steps what I had done:
1. Rename the MegaSAS.log file to ___MegaSAS.log__ on node-04 2. ceph config-key set mgr/cephadm/pause true 3. ceph config-key get mgr/cephadm/host.wh-node04 4. delete the contents about ""MegaSAS.log.bak" from the json data of the config-key mgr/cephadm/host.wh-node04 5. ceph config-key set mgr/cephadm/host.wh-node04 with modified new contents 6. ceph mgr fail the active mgr daemon 7. ceph orch resume
And I also tried to `ceph config-key rm mgr/cephadm/host.wh-node04` instead of `ceph config-key set mgr/cephadm/host.wh-node04`, It's still no effect.
I found some new Traceback in the cephadm.log like this:
2021-07-06T02:20:48.512821+0000 mgr.wh-control03.ejzhzu (mgr.18481098) 43 : cephadm [ERR] _Promise failed
Traceback (most recent call last):
File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 294, in _finalize
next_result = self._on_complete(self._value)
File "/usr/share/ceph/mgr/cephadm/module.py", line 107, in <lambda>
return CephadmCompletion(on_complete=lambda _: f(*args, **kwargs))
File "/usr/share/ceph/mgr/cephadm/module.py", line 1333, in describe_service
hosts=[dd.hostname]
File "/lib/python3.6/site-packages/ceph/deployment/service_spec.py", line 429, in __init__
assert service_type in ServiceSpec.KNOWN_SERVICE_TYPES, service_type
AssertionError: MegaSAS
2021-07-06T02:20:48.904172+0000 mgr.wh-control03.ejzhzu (mgr.18481098) 44 : cephadm [INF] Reconfiguring osd.36 (unknown last config time)...
2021-07-06T02:20:48.914054+0000 mgr.wh-control03.ejzhzu (mgr.18481098) 45 : cephadm [INF] Reconfiguring daemon osd.36 on wh-node04
2021-07-06T02:20:49.550420+0000 mgr.wh-control03.ejzhzu (mgr.18481098) 46 : cephadm [INF] Reconfiguring osd.54 (unknown last config time)...
2021-07-06T02:20:49.564741+0000 mgr.wh-control03.ejzhzu (mgr.18481098) 47 : cephadm [INF] Reconfiguring daemon osd.54 on wh-node04
2021-07-06T02:20:50.208777+0000 mgr.wh-control03.ejzhzu (mgr.18481098) 49 : cephadm [INF] Reconfiguring osd.35 (unknown last config time)...
2021-07-06T02:20:50.221335+0000 mgr.wh-control03.ejzhzu (mgr.18481098) 50 : cephadm [INF] Reconfiguring daemon osd.35 on wh-node04
2021-07-06T02:20:50.934262+0000 mgr.wh-control03.ejzhzu (mgr.18481098) 51 : cephadm [INF] Reconfiguring osd.45 (unknown last config time)...
2021-07-06T02:20:50.949035+0000 mgr.wh-control03.ejzhzu (mgr.18481098) 52 : cephadm [INF] Reconfiguring daemon osd.45 on wh-node04
2021-07-06T02:20:51.592860+0000 mgr.wh-control03.ejzhzu (mgr.18481098) 53 : cephadm [INF] Reconfiguring node-exporter.wh-node04 (unknown last config time)...
2021-07-06T02:20:51.593198+0000 mgr.wh-control03.ejzhzu (mgr.18481098) 54 : cephadm [INF] Reconfiguring daemon node-exporter.wh-node04 on wh-node04
2021-07-06T02:20:52.121357+0000 mgr.wh-control03.ejzhzu (mgr.18481098) 56 : cephadm [INF] Reconfiguring osd.34 (unknown last config time)...
2021-07-06T02:20:52.135538+0000 mgr.wh-control03.ejzhzu (mgr.18481098) 57 : cephadm [INF] Reconfiguring daemon osd.34 on wh-node04
2021-07-06T02:20:52.846721+0000 mgr.wh-control03.ejzhzu (mgr.18481098) 58 : cephadm [INF] Reconfiguring crash.wh-node04 (unknown last config time)...
2021-07-06T02:20:52.851470+0000 mgr.wh-control03.ejzhzu (mgr.18481098) 59 : cephadm [INF] Reconfiguring daemon crash.wh-node04 on wh-node04
2021-07-06T02:20:53.459776+0000 mgr.wh-control03.ejzhzu (mgr.18481098) 60 : cephadm [INF] Removing orphan daemon MegaSAS.log.bak...
What shoul I do next step?
Looking forward to your suggestions, thanks!
Updated by Sebastian Wagner almost 3 years ago
Please make sure there are NO stray files next to the list of daemons /var/lib/ceph/<cluster-fsid>
Updated by Ke Xiao almost 3 years ago
Yes,It's works by deleted this file.
Thank you very much.
Updated by Kefu Chai almost 3 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Sebastian Wagner over 2 years ago
- Status changed from Pending Backport to Resolved