Bug #51239
closed[ERR] MGR_MODULE_ERROR: Module 'devicehealth' has failed:
0%
Description
Hi
I'm not sure what the problem is but even if I made some mistake the error message is lacking.
I have errors like his in the log:
"
Jun 15 09:44:22 dcn-ceph-01 bash3278: debug 2021-06-15T09:44:22.507+0000 7f704e4b3700 -1 mgr notify devicehealth.notify:
Jun 15 09:44:22 dcn-ceph-01 bash3278: debug 2021-06-15T09:44:22.507+0000 7f704e4b3700 -1 mgr notify Traceback (most recent call last):
Jun 15 09:44:22 dcn-ceph-01 bash3278: File "/usr/share/ceph/mgr/devicehealth/module.py", line 229, in notify
Jun 15 09:44:22 dcn-ceph-01 bash3278: self.create_device_pool()
Jun 15 09:44:22 dcn-ceph-01 bash3278: File "/usr/share/ceph/mgr/devicehealth/module.py", line 254, in create_device_pool
Jun 15 09:44:22 dcn-ceph-01 bash3278: assert r == 0
Jun 15 09:44:22 dcn-ceph-01 bash3278: AssertionError
"
I believe it used to work when I originally installed ceph, and I have the pool:
"- ceph osd dump | grep pool
pool 9 'device_health_metrics' replicated size 2 min_size 1 crush_rule 1 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 2630 flags hashpspool stripe_width 0 compression_algorithm snappy compression_mode aggressive application health_metrics
"
I'll be happy to provide any information needed. The cluster is not in production.
Mvh.
Torkil
Updated by Torkil Svensgaard almost 3 years ago
Lacking as in:
"- ceph -s
cluster:
id: 183ae4ba-9ced-11eb-9444-3cecef467984
health: HEALTH_ERR
mons are allowing insecure global_id reclaim
Module 'devicehealth' has failed:
333 pgs not deep-scrubbed in time
334 pgs not scrubbed in time
services:
mon: 3 daemons, quorum dcn-ceph-01,dcn-ceph-03,dcn-ceph-02 (age 8d)
mgr: dcn-ceph-01.oifsfa(active, since 8d), standbys: dcn-ceph-02.oojihd
mds: cephfs:1 {0=cephfs.dcn-ceph-01.tjwkyl=up:active} 2 up:standby
osd: 37 osds: 37 up (since 8d), 37 in (since 5w); 29 remapped pgs
data:
pools: 3 pools, 576 pgs
objects: 60.23M objects, 34 TiB
usage: 71 TiB used, 80 TiB / 150 TiB avail
pgs: 12265510/238245544 objects misplaced (5.148%)
547 active+clean
27 active+remapped+backfill_wait
2 active+remapped+backfilling
io:
recovery: 14 MiB/s, 26 objects/s
"
Updated by Neha Ojha almost 3 years ago
- Has duplicate Bug #48670: Unhandled exception from module 'devicehealth' added
Updated by Michael Wodniok over 2 years ago
This also affects v16.2.0 and v16.2.5 after Upgrade from v15. Assuming this is a bug in the mgr module. As it's happening during the health check, it should not be that urgent (it however appeared in our production environment first).
Updated by yite gu 6 months ago
I had the same problem. my versions v16.2.14. the Traceback from mgr log as below:
2023-10-31T10:37:26.194+0000 7f6069e81700 -1 mgr notify devicehealth.notify: 2023-10-31T10:37:26.194+0000 7f6069e81700 -1 mgr notify Traceback (most recent call last): File "/usr/share/ceph/mgr/devicehealth/module.py", line 250, in notify self.maybe_create_device_pool() File "/usr/share/ceph/mgr/devicehealth/module.py", line 267, in maybe_create_device_pool self.create_device_pool() File "/usr/share/ceph/mgr/devicehealth/module.py", line 294, in create_device_pool assert r == 0 AssertionError 2023-10-31T10:37:27.195+0000 7f6069e81700 -1 mgr notify devicehealth.notify: 2023-10-31T10:37:27.195+0000 7f6069e81700 -1 mgr notify Traceback (most recent call last): File "/usr/share/ceph/mgr/devicehealth/module.py", line 250, in notify self.maybe_create_device_pool() File "/usr/share/ceph/mgr/devicehealth/module.py", line 267, in maybe_create_device_pool self.create_device_pool() File "/usr/share/ceph/mgr/devicehealth/module.py", line 294, in create_device_pool assert r == 0 AssertionError
I think that the problem occurs on setting the pool application. my ceph admintor manually modified the application to mgr:
# ceph osd pool ls detail pool 3 'replicapool-ssd' replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode off last_change 20824 lfor 0/0/14112 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd pool 4 'device_health_metrics' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode off last_change 19871 lfor 0/19871/19869 flags hashpspool stripe_width 0 application mgr
This is a human error, you only disable device_health_metrics pool's application and restart mgr daemon.