Bug #51239: [ERR] MGR_MODULE_ERROR: Module 'devicehealth' has failed: - Ceph - Ceph

Actions

Copy link

Bug #51239

closed

[ERR] MGR_MODULE_ERROR: Module 'devicehealth' has failed:

Added by Torkil Svensgaard almost 3 years ago. Updated 6 months ago.

Status:

Duplicate

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

v15.2.13

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

I'm not sure what the problem is but even if I made some mistake the error message is lacking.

I have errors like his in the log:

"
Jun 15 09:44:22 dcn-ceph-01 bash³²⁷⁸: debug 2021-06-15T09:44:22.507+0000 7f704e4b3700 -1 mgr notify devicehealth.notify:
Jun 15 09:44:22 dcn-ceph-01 bash³²⁷⁸: debug 2021-06-15T09:44:22.507+0000 7f704e4b3700 -1 mgr notify Traceback (most recent call last):
Jun 15 09:44:22 dcn-ceph-01 bash³²⁷⁸: File "/usr/share/ceph/mgr/devicehealth/module.py", line 229, in notify
Jun 15 09:44:22 dcn-ceph-01 bash³²⁷⁸: self.create_device_pool()
Jun 15 09:44:22 dcn-ceph-01 bash³²⁷⁸: File "/usr/share/ceph/mgr/devicehealth/module.py", line 254, in create_device_pool
Jun 15 09:44:22 dcn-ceph-01 bash³²⁷⁸: assert r == 0
Jun 15 09:44:22 dcn-ceph-01 bash³²⁷⁸: AssertionError
"

I believe it used to work when I originally installed ceph, and I have the pool:

ceph osd dump | grep pool
pool 9 'device_health_metrics' replicated size 2 min_size 1 crush_rule 1 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 2630 flags hashpspool stripe_width 0 compression_algorithm snappy compression_mode aggressive application health_metrics
"

I'll be happy to provide any information needed. The cluster is not in production.

Mvh.

Torkil

Related issues 1 (1 open — 0 closed)

Actions

Copy link

Updated by Torkil Svensgaard almost 3 years ago

Lacking as in:

ceph -s
cluster:
id: 183ae4ba-9ced-11eb-9444-3cecef467984
health: HEALTH_ERR
mons are allowing insecure global_id reclaim
Module 'devicehealth' has failed:
333 pgs not deep-scrubbed in time
334 pgs not scrubbed in time

services:
    mon: 3 daemons, quorum dcn-ceph-01,dcn-ceph-03,dcn-ceph-02 (age 8d)
    mgr: dcn-ceph-01.oifsfa(active, since 8d), standbys: dcn-ceph-02.oojihd
    mds: cephfs:1 {0=cephfs.dcn-ceph-01.tjwkyl=up:active} 2 up:standby
    osd: 37 osds: 37 up (since 8d), 37 in (since 5w); 29 remapped pgs

data:
    pools:   3 pools, 576 pgs
    objects: 60.23M objects, 34 TiB
    usage:   71 TiB used, 80 TiB / 150 TiB avail
    pgs:     12265510/238245544 objects misplaced (5.148%)
             547 active+clean
             27  active+remapped+backfill_wait
             2   active+remapped+backfilling

io:
    recovery: 14 MiB/s, 26 objects/s
"

Actions

Copy link

Updated by Neha Ojha almost 3 years ago

Has duplicate Bug #48670: Unhandled exception from module 'devicehealth' added

Actions

Copy link

Updated by Michael Wodniok over 2 years ago

This also affects v16.2.0 and v16.2.5 after Upgrade from v15. Assuming this is a bug in the mgr module. As it's happening during the health check, it should not be that urgent (it however appeared in our production environment first).

Actions

Copy link

Updated by Neha Ojha over 2 years ago

Status changed from New to Duplicate

Actions

Copy link

Updated by yite gu 6 months ago

I had the same problem. my versions v16.2.14. the Traceback from mgr log as below:

2023-10-31T10:37:26.194+0000 7f6069e81700 -1 mgr notify devicehealth.notify:
2023-10-31T10:37:26.194+0000 7f6069e81700 -1 mgr notify Traceback (most recent call last):
  File "/usr/share/ceph/mgr/devicehealth/module.py", line 250, in notify
    self.maybe_create_device_pool()
  File "/usr/share/ceph/mgr/devicehealth/module.py", line 267, in maybe_create_device_pool
    self.create_device_pool()
  File "/usr/share/ceph/mgr/devicehealth/module.py", line 294, in create_device_pool
    assert r == 0
AssertionError

2023-10-31T10:37:27.195+0000 7f6069e81700 -1 mgr notify devicehealth.notify:
2023-10-31T10:37:27.195+0000 7f6069e81700 -1 mgr notify Traceback (most recent call last):
  File "/usr/share/ceph/mgr/devicehealth/module.py", line 250, in notify
    self.maybe_create_device_pool()
  File "/usr/share/ceph/mgr/devicehealth/module.py", line 267, in maybe_create_device_pool
    self.create_device_pool()
  File "/usr/share/ceph/mgr/devicehealth/module.py", line 294, in create_device_pool
    assert r == 0
AssertionError

I think that the problem occurs on setting the pool application. my ceph admintor manually modified the application to mgr:

# ceph osd pool ls detail
pool 3 'replicapool-ssd' replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode off last_change 20824 lfor 0/0/14112 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 4 'device_health_metrics' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode off last_change 19871 lfor 0/19871/19869 flags hashpspool stripe_width 0 application mgr

This is a human error, you only disable device_health_metrics pool's application and restart mgr daemon.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #51239

[ERR] MGR_MODULE_ERROR: Module 'devicehealth' has failed:

Updated by Torkil Svensgaard almost 3 years ago

Updated by Neha Ojha almost 3 years ago

Updated by Michael Wodniok over 2 years ago

Updated by Neha Ojha over 2 years ago

Updated by yite gu 6 months ago