Project

General

Profile

Bug #61180

Ceph version 15.2.17 (octopus stable) - HEALTH_ERR 4 mgr modules have failed

Added by Duy Nguyen Hong 11 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
ceph-mgr
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi Team,

We have 1 problem with Ceph health check error.

HEALTH_ERR 4 mgr modules have failed
[ERR] MGR_MODULE_ERROR: 4 mgr modules have failed
Module 'devicehealth' has failed: Not found or unloadable
Module 'pg_autoscaler' has failed: Not found or unloadable
Module 'telemetry' has failed: 'NoneType' object has no attribute 'items'
Module 'volumes' has failed: Not found or unloadable

We have try disable and enable module diskprediction_local. After the alert appeared on ceph -s.
How to turn off that alert ?

My kernel = 5.4.0-124-generic

My Cluster env prod :

cluster:
id: cf0e8a4a-9c0a-11eb-966b-6fb1f36da8cd
health: HEALTH_ERR
4 mgr modules have failed

  cluster:
    id:     cf0e8a4a-9c0a-11eb-966b-6fb1f36da8cd
    health: HEALTH_ERR
            4 mgr modules have failed

  services:
    mon: 5 daemons, quorum cephnode-120,cephnode-121,cephnode-124,cephnode-123,cephnode-122 (age 7M)
    mgr: cephnode-121.jsgurc(active, since 22h), standbys: cephnode-120.kxyhfa
    mds: fplay:2 {0=fplay.cephnode-127.baegiz=up:active,1=fplay.cephnode-128.qfebeb=up:active} 2 up:standby
    osd: 342 osds: 342 up (since 2h), 342 in (since 5w); 17 remapped pgs
    rgw: 15 daemons active (btsx.hcm.cephnode-120.pwxelj, btsx.hcm.cephnode-121.pvfttc, btsx.hcm.cephnode-122.grlyac, btsx.hcm.cephnode-123.aqfbgm, btsx.hcm.cephnode-124.ajeqqn, btsx.hcm.cephnode-125.osyamm, btsx.hcm.cephnode-126.wjlrlu, btsx.hcm.cephnode-127.qjsnme, btsx.hcm.cephnode-128.ndxmbx, btsx.hcm.cephnode-129.xvuhjm, btsx.hcm.cephnode-131.rsegwd, btsx.hcm.cephnode-132.fdeygv, btsx.hcm.cephnode-133.hhxvoi, btsx.hcm.cephnode-134.rrnbwv, btsx.hcm.cephnode-135.kobjbm)

  task status:

  data:
    pools:   11 pools, 1321 pgs
    objects: 406.11M objects, 1.8 PiB
    usage:   2.5 PiB used, 1.5 PiB / 4.0 PiB avail
    pgs:     2745648/6496535596 objects misplaced (0.042%)
             1302 active+clean
             15   active+remapped+backfilling
             2    active+clean+scrubbing+deep
             2    active+remapped+backfill_wait

  io:
    client:   44 MiB/s rd, 69 MiB/s wr, 56 op/s rd, 33 op/s wr
    recovery: 1.1 GiB/s, 230 objects/s

Also available in: Atom PDF