Bug #51239
[ERR] MGR_MODULE_ERROR: Module 'devicehealth' has failed:
0%
Description
Hi
I'm not sure what the problem is but even if I made some mistake the error message is lacking.
I have errors like his in the log:
"
Jun 15 09:44:22 dcn-ceph-01 bash3278: debug 2021-06-15T09:44:22.507+0000 7f704e4b3700 -1 mgr notify devicehealth.notify:
Jun 15 09:44:22 dcn-ceph-01 bash3278: debug 2021-06-15T09:44:22.507+0000 7f704e4b3700 -1 mgr notify Traceback (most recent call last):
Jun 15 09:44:22 dcn-ceph-01 bash3278: File "/usr/share/ceph/mgr/devicehealth/module.py", line 229, in notify
Jun 15 09:44:22 dcn-ceph-01 bash3278: self.create_device_pool()
Jun 15 09:44:22 dcn-ceph-01 bash3278: File "/usr/share/ceph/mgr/devicehealth/module.py", line 254, in create_device_pool
Jun 15 09:44:22 dcn-ceph-01 bash3278: assert r == 0
Jun 15 09:44:22 dcn-ceph-01 bash3278: AssertionError
"
I believe it used to work when I originally installed ceph, and I have the pool:
"- ceph osd dump | grep pool
pool 9 'device_health_metrics' replicated size 2 min_size 1 crush_rule 1 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 2630 flags hashpspool stripe_width 0 compression_algorithm snappy compression_mode aggressive application health_metrics
"
I'll be happy to provide any information needed. The cluster is not in production.
Mvh.
Torkil
Related issues
History
#1 Updated by Torkil Svensgaard almost 2 years ago
Lacking as in:
"- ceph -s
cluster:
id: 183ae4ba-9ced-11eb-9444-3cecef467984
health: HEALTH_ERR
mons are allowing insecure global_id reclaim
Module 'devicehealth' has failed:
333 pgs not deep-scrubbed in time
334 pgs not scrubbed in time
services:
mon: 3 daemons, quorum dcn-ceph-01,dcn-ceph-03,dcn-ceph-02 (age 8d)
mgr: dcn-ceph-01.oifsfa(active, since 8d), standbys: dcn-ceph-02.oojihd
mds: cephfs:1 {0=cephfs.dcn-ceph-01.tjwkyl=up:active} 2 up:standby
osd: 37 osds: 37 up (since 8d), 37 in (since 5w); 29 remapped pgs
data:
pools: 3 pools, 576 pgs
objects: 60.23M objects, 34 TiB
usage: 71 TiB used, 80 TiB / 150 TiB avail
pgs: 12265510/238245544 objects misplaced (5.148%)
547 active+clean
27 active+remapped+backfill_wait
2 active+remapped+backfilling
io:
recovery: 14 MiB/s, 26 objects/s
"
#2 Updated by Neha Ojha almost 2 years ago
- Duplicated by Bug #48670: Unhandled exception from module 'devicehealth' added
#3 Updated by Michael Wodniok over 1 year ago
This also affects v16.2.0 and v16.2.5 after Upgrade from v15. Assuming this is a bug in the mgr module. As it's happening during the health check, it should not be that urgent (it however appeared in our production environment first).
#4 Updated by Neha Ojha over 1 year ago
- Status changed from New to Duplicate