Project

General

Profile

Bug #51239

[ERR] MGR_MODULE_ERROR: Module 'devicehealth' has failed:

Added by Torkil Svensgaard almost 2 years ago. Updated over 1 year ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi

I'm not sure what the problem is but even if I made some mistake the error message is lacking.

I have errors like his in the log:

"
Jun 15 09:44:22 dcn-ceph-01 bash3278: debug 2021-06-15T09:44:22.507+0000 7f704e4b3700 -1 mgr notify devicehealth.notify:
Jun 15 09:44:22 dcn-ceph-01 bash3278: debug 2021-06-15T09:44:22.507+0000 7f704e4b3700 -1 mgr notify Traceback (most recent call last):
Jun 15 09:44:22 dcn-ceph-01 bash3278: File "/usr/share/ceph/mgr/devicehealth/module.py", line 229, in notify
Jun 15 09:44:22 dcn-ceph-01 bash3278: self.create_device_pool()
Jun 15 09:44:22 dcn-ceph-01 bash3278: File "/usr/share/ceph/mgr/devicehealth/module.py", line 254, in create_device_pool
Jun 15 09:44:22 dcn-ceph-01 bash3278: assert r == 0
Jun 15 09:44:22 dcn-ceph-01 bash3278: AssertionError
"

I believe it used to work when I originally installed ceph, and I have the pool:

"
  1. ceph osd dump | grep pool
    pool 9 'device_health_metrics' replicated size 2 min_size 1 crush_rule 1 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 2630 flags hashpspool stripe_width 0 compression_algorithm snappy compression_mode aggressive application health_metrics
    "

I'll be happy to provide any information needed. The cluster is not in production.

Mvh.

Torkil


Related issues

Duplicated by mgr - Bug #48670: Unhandled exception from module 'devicehealth' New

History

#1 Updated by Torkil Svensgaard almost 2 years ago

Lacking as in:

"
  1. ceph -s
    cluster:
    id: 183ae4ba-9ced-11eb-9444-3cecef467984
    health: HEALTH_ERR
    mons are allowing insecure global_id reclaim
    Module 'devicehealth' has failed:
    333 pgs not deep-scrubbed in time
    334 pgs not scrubbed in time
services:
mon: 3 daemons, quorum dcn-ceph-01,dcn-ceph-03,dcn-ceph-02 (age 8d)
mgr: dcn-ceph-01.oifsfa(active, since 8d), standbys: dcn-ceph-02.oojihd
mds: cephfs:1 {0=cephfs.dcn-ceph-01.tjwkyl=up:active} 2 up:standby
osd: 37 osds: 37 up (since 8d), 37 in (since 5w); 29 remapped pgs
data:
pools: 3 pools, 576 pgs
objects: 60.23M objects, 34 TiB
usage: 71 TiB used, 80 TiB / 150 TiB avail
pgs: 12265510/238245544 objects misplaced (5.148%)
547 active+clean
27 active+remapped+backfill_wait
2 active+remapped+backfilling
io:
recovery: 14 MiB/s, 26 objects/s
"

#2 Updated by Neha Ojha almost 2 years ago

  • Duplicated by Bug #48670: Unhandled exception from module 'devicehealth' added

#3 Updated by Michael Wodniok over 1 year ago

This also affects v16.2.0 and v16.2.5 after Upgrade from v15. Assuming this is a bug in the mgr module. As it's happening during the health check, it should not be that urgent (it however appeared in our production environment first).

#4 Updated by Neha Ojha over 1 year ago

  • Status changed from New to Duplicate

Also available in: Atom PDF