Project

General

Profile

Actions

Bug #51239

closed

[ERR] MGR_MODULE_ERROR: Module 'devicehealth' has failed:

Added by Torkil Svensgaard almost 3 years ago. Updated 6 months ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi

I'm not sure what the problem is but even if I made some mistake the error message is lacking.

I have errors like his in the log:

"
Jun 15 09:44:22 dcn-ceph-01 bash3278: debug 2021-06-15T09:44:22.507+0000 7f704e4b3700 -1 mgr notify devicehealth.notify:
Jun 15 09:44:22 dcn-ceph-01 bash3278: debug 2021-06-15T09:44:22.507+0000 7f704e4b3700 -1 mgr notify Traceback (most recent call last):
Jun 15 09:44:22 dcn-ceph-01 bash3278: File "/usr/share/ceph/mgr/devicehealth/module.py", line 229, in notify
Jun 15 09:44:22 dcn-ceph-01 bash3278: self.create_device_pool()
Jun 15 09:44:22 dcn-ceph-01 bash3278: File "/usr/share/ceph/mgr/devicehealth/module.py", line 254, in create_device_pool
Jun 15 09:44:22 dcn-ceph-01 bash3278: assert r == 0
Jun 15 09:44:22 dcn-ceph-01 bash3278: AssertionError
"

I believe it used to work when I originally installed ceph, and I have the pool:

"
  1. ceph osd dump | grep pool
    pool 9 'device_health_metrics' replicated size 2 min_size 1 crush_rule 1 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 2630 flags hashpspool stripe_width 0 compression_algorithm snappy compression_mode aggressive application health_metrics
    "

I'll be happy to provide any information needed. The cluster is not in production.

Mvh.

Torkil


Related issues 1 (1 open0 closed)

Has duplicate mgr - Bug #48670: Unhandled exception from module 'devicehealth' NewYaarit Hatuka

Actions
Actions #1

Updated by Torkil Svensgaard almost 3 years ago

Lacking as in:

"
  1. ceph -s
    cluster:
    id: 183ae4ba-9ced-11eb-9444-3cecef467984
    health: HEALTH_ERR
    mons are allowing insecure global_id reclaim
    Module 'devicehealth' has failed:
    333 pgs not deep-scrubbed in time
    334 pgs not scrubbed in time
services:
mon: 3 daemons, quorum dcn-ceph-01,dcn-ceph-03,dcn-ceph-02 (age 8d)
mgr: dcn-ceph-01.oifsfa(active, since 8d), standbys: dcn-ceph-02.oojihd
mds: cephfs:1 {0=cephfs.dcn-ceph-01.tjwkyl=up:active} 2 up:standby
osd: 37 osds: 37 up (since 8d), 37 in (since 5w); 29 remapped pgs
data:
pools: 3 pools, 576 pgs
objects: 60.23M objects, 34 TiB
usage: 71 TiB used, 80 TiB / 150 TiB avail
pgs: 12265510/238245544 objects misplaced (5.148%)
547 active+clean
27 active+remapped+backfill_wait
2 active+remapped+backfilling
io:
recovery: 14 MiB/s, 26 objects/s
"
Actions #2

Updated by Neha Ojha almost 3 years ago

  • Has duplicate Bug #48670: Unhandled exception from module 'devicehealth' added
Actions #3

Updated by Michael Wodniok over 2 years ago

This also affects v16.2.0 and v16.2.5 after Upgrade from v15. Assuming this is a bug in the mgr module. As it's happening during the health check, it should not be that urgent (it however appeared in our production environment first).

Actions #4

Updated by Neha Ojha over 2 years ago

  • Status changed from New to Duplicate
Actions #5

Updated by yite gu 6 months ago

I had the same problem. my versions v16.2.14. the Traceback from mgr log as below:

2023-10-31T10:37:26.194+0000 7f6069e81700 -1 mgr notify devicehealth.notify:
2023-10-31T10:37:26.194+0000 7f6069e81700 -1 mgr notify Traceback (most recent call last):
  File "/usr/share/ceph/mgr/devicehealth/module.py", line 250, in notify
    self.maybe_create_device_pool()
  File "/usr/share/ceph/mgr/devicehealth/module.py", line 267, in maybe_create_device_pool
    self.create_device_pool()
  File "/usr/share/ceph/mgr/devicehealth/module.py", line 294, in create_device_pool
    assert r == 0
AssertionError

2023-10-31T10:37:27.195+0000 7f6069e81700 -1 mgr notify devicehealth.notify:
2023-10-31T10:37:27.195+0000 7f6069e81700 -1 mgr notify Traceback (most recent call last):
  File "/usr/share/ceph/mgr/devicehealth/module.py", line 250, in notify
    self.maybe_create_device_pool()
  File "/usr/share/ceph/mgr/devicehealth/module.py", line 267, in maybe_create_device_pool
    self.create_device_pool()
  File "/usr/share/ceph/mgr/devicehealth/module.py", line 294, in create_device_pool
    assert r == 0
AssertionError

I think that the problem occurs on setting the pool application. my ceph admintor manually modified the application to mgr:
# ceph osd pool ls detail
pool 3 'replicapool-ssd' replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode off last_change 20824 lfor 0/0/14112 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 4 'device_health_metrics' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode off last_change 19871 lfor 0/19871/19869 flags hashpspool stripe_width 0 application mgr

This is a human error, you only disable device_health_metrics pool's application and restart mgr daemon.

Actions

Also available in: Atom PDF