Bug #48230: nautilus: cluster [ERR] mgr modules have failed (MGR_MODULE_ERROR) - RADOS - Ceph

Actions

Copy link

Bug #48230

closed

nautilus: cluster [ERR] mgr modules have failed (MGR_MODULE_ERROR)

Added by Neha Ojha over 3 years ago. Updated over 3 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

38069

Crash signature (v1):

Crash signature (v2):

Description

2020-11-12T22:33:15.916 INFO:tasks.ceph.mon.a.smithi038.stderr:2020-11-12 22:33:15.917 7fc763b5c700 -1 log_channel(cluster) log [ERR] : Health check failed: 3 mgr modules have failed (MGR_MODULE_ERROR)

Looking at the mon log:

2020-11-12 22:33:15.917 7fc763b5c700 20 mon.a@0(leader).mgrstat health checks:
{
    "MGR_MODULE_ERROR": {
        "severity": "HEALTH_ERR",
        "summary": {
            "message": "3 mgr modules have failed" 
        },
        "detail": [
            {
                "message": "Module 'rbd_support' has failed: Not found or unloadable" 
            },
            {
                "message": "Module 'status' has failed: Not found or unloadable" 
            },
            {
                "message": "Module 'volumes' has failed: Not found or unloadable" 
            }
        ]
    },
    "PG_AVAILABILITY": {
        "severity": "HEALTH_WARN",
        "summary": {
            "message": "Reduced data availability: 6 pgs peering" 
        },
        "detail": [
            {
                "message": "pg 1.0 is stuck peering for 123.626751, current state peering, last acting [1,0]" 
            },
            {
                "message": "pg 1.2 is stuck peering for 123.624747, current state peering, last acting [0]" 
            },
            {
                "message": "pg 1.3 is stuck peering for 123.626252, current state peering, last acting [1]" 
            },
            {
                "message": "pg 1.4 is stuck peering for 123.627984, current state peering, last acting [1,0]" 
            },
            {
                "message": "pg 1.6 is stuck peering for 123.627208, current state peering, last acting [1,0]" 
            },
            {
                "message": "pg 1.7 is stuck peering for 123.625035, current state peering, last acting [1]" 
            }
        ]
    }
}
.
.
.
2020-11-12 22:33:24.591 7fc766361700  0 log_channel(cluster) log [INF] : Health check cleared: MGR_MODULE_ERROR (was: 3 mgr modules have failed)

I think this can be ignored.

/a/yuriw-2020-11-12_20:34:09-rados-nautilus-distro-basic-smithi/5617251

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Dan Mick over 3 years ago

It's odd, because the mgr log for the job cited above shows a lot of what look like normal status messages from rbd_support (at least)

Actions

Copy link

Updated by Neha Ojha over 3 years ago

This seems to be due to those 3 modules not being present in "modules" when get_health_checks() is called.

2020-11-12 22:33:22.397 7f79c57f3700 15 mgr get_health_checks getting health checks forbalancer
2020-11-12 22:33:22.397 7f79c57f3700 15 mgr get_health_checks getting health checks forcrash
2020-11-12 22:33:22.397 7f79c57f3700 15 mgr get_health_checks getting health checks fordevicehealth
2020-11-12 22:33:22.397 7f79c57f3700 15 mgr get_health_checks getting health checks foriostat
2020-11-12 22:33:22.397 7f79c57f3700 15 mgr get_health_checks getting health checks fororchestrator_cli
2020-11-12 22:33:22.397 7f79c57f3700 15 mgr get_health_checks getting health checks forprogress
2020-11-12 22:33:22.397 7f79c57f3700 10 mgr update_delta_stats  v15
2020-11-12 22:33:22.397 7f79c57f3700 10 mgr.server operator() 8 pgs: 6 active+clean, 2 peering; 0 B data, 548 KiB used, 267 GiB / 270 GiB avail
2020-11-12 22:33:22.397 7f79c57f3700 10 mgr.server operator() 2 health checks
2020-11-12 22:33:22.397 7f79c57f3700 20 mgr.server operator() health checks:
{
    "MGR_MODULE_ERROR": {
        "severity": "HEALTH_ERR",
        "summary": {
            "message": "3 mgr modules have failed" 
        },
        "detail": [
            {
                "message": "Module 'rbd_support' has failed: Not found or unloadable" 
            },
            {
                "message": "Module 'status' has failed: Not found or unloadable" 
            },
            {
                "message": "Module 'volumes' has failed: Not found or unloadable" 
            }
        ]
    },

later they are

2020-11-12 22:33:24.397 7f79c57f3700 10 mgr.server tick
2020-11-12 22:33:24.397 7f79c57f3700 15 mgr get_health_checks getting health checks forbalancer
2020-11-12 22:33:24.397 7f79c57f3700 15 mgr get_health_checks getting health checks forcrash
2020-11-12 22:33:24.397 7f79c57f3700 15 mgr get_health_checks getting health checks fordevicehealth
2020-11-12 22:33:24.397 7f79c57f3700 15 mgr get_health_checks getting health checks foriostat
2020-11-12 22:33:24.397 7f79c57f3700 15 mgr get_health_checks getting health checks fororchestrator_cli
2020-11-12 22:33:24.397 7f79c57f3700 15 mgr get_health_checks getting health checks forprogress
2020-11-12 22:33:24.397 7f79c57f3700 15 mgr get_health_checks getting health checks forrbd_support
2020-11-12 22:33:24.397 7f79c57f3700 15 mgr get_health_checks getting health checks forrestful
2020-11-12 22:33:24.397 7f79c57f3700 15 mgr get_health_checks getting health checks forstatus
2020-11-12 22:33:24.397 7f79c57f3700 15 mgr get_health_checks getting health checks forvolumes
2020-11-12 22:33:24.397 7f79c57f3700 10 mgr update_delta_stats  v17
2020-11-12 22:33:24.397 7f79c57f3700 10 mgr.server operator() 24 pgs: 16 unknown, 6 active+clean, 2 peering; 0 B data, 548 KiB used, 267 GiB / 270 GiB avail
2020-11-12 22:33:24.397 7f79c57f3700 10 mgr.server operator() 1 health checks
2020-11-12 22:33:24.397 7f79c57f3700 20 mgr.server operator() health checks:
{
    "PG_AVAILABILITY": {
        "severity": "HEALTH_WARN",
        "summary": {
            "message": "Reduced data availability: 2 pgs peering" 
        },

just looking at rbd_support

2020-11-12 22:33:04.902 7f5f738eb700 15 mgr get_health_checks getting health checks forrbd_support
2020-11-12 22:33:05.898 7f5f6f0a2700 20 mgr[rbd_support] TaskHandler: tick
2020-11-12 22:33:05.906 7f5f6f8a3700 20 mgr[rbd_support] PerfHandler: tick
2020-11-12 22:33:06.134 7f5f8d6f4700 15 mgr notify_all queuing notify to rbd_support
2020-11-12 22:33:07.455 7f79f360ae40  1 mgr[py] Loading python module 'rbd_support'
2020-11-12 22:33:07.489 7f79f360ae40  4 mgr[py] load_subclass_of: found class: 'rbd_support.Module'
2020-11-12 22:33:07.489 7f79f360ae40  4 mgr[py] Standby mode not provided by module 'rbd_support'
2020-11-12 22:33:08.405 7f79c6ff6700  4 mgr[py] Starting rbd_support
2020-11-12 22:33:08.414 7f79c17ab700  4 mgr[rbd_support] PerfHandler: starting
2020-11-12 22:33:10.917 7f79c6ff6700  4 mgr[rbd_support] load_task_task: rbd, start_after=
2020-11-12 22:33:13.414 7f79c17ab700 20 mgr[rbd_support] PerfHandler: tick
                "message": "Module 'rbd_support' has failed: Not found or unloadable" 
                "message": "Module 'rbd_support' has failed: Not found or unloadable" 
                "message": "Module 'rbd_support' has failed: Not found or unloadable" 
2020-11-12 22:33:18.415 7f79c17ab700 20 mgr[rbd_support] PerfHandler: tick
                "message": "Module 'rbd_support' has failed: Not found or unloadable" 
                "message": "Module 'rbd_support' has failed: Not found or unloadable" 
2020-11-12 22:33:23.415 7f79c17ab700 20 mgr[rbd_support] PerfHandler: tick
2020-11-12 22:33:23.429 7f79c6ff6700 20 mgr[rbd_support] sequence=0, tasks_by_sequence={}, tasks_by_id={}
2020-11-12 22:33:23.430 7f79bcfa2700  4 mgr[rbd_support] TaskHandler: starting
2020-11-12 22:33:23.430 7f79c6ff6700  1 mgr load Constructed class from module: rbd_support
2020-11-12 22:33:23.430 7f79c6ff6700  4 mgr operator() Starting thread for rbd_support
2020-11-12 22:33:23.430 7f79bc7a1700  4 mgr entry Entering thread for rbd_support
2020-11-12 22:33:23.575 7f79df5fc700 15 mgr notify_all queuing notify to rbd_support
2020-11-12 22:33:23.591 7f79df5fc700 15 mgr notify_all queuing notify (clog) to rbd_support
2020-11-12 22:33:23.591 7f79df5fc700 15 mgr notify_all queuing notify (clog) to rbd_support
2020-11-12 22:33:24.397 7f79c57f3700 15 mgr get_health_checks getting health checks forrbd_support

Actions

Copy link