Project

General

Profile

Actions

Bug #62578

open

mon: osd pg-upmap-items command causes PG_DEGRADED warnings

Added by Patrick Donnelly 8 months ago. Updated 8 months ago.

Status:
New
Priority:
Normal
Category:
Administration/Usability
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
reef,quincy,pacific
Regression:
No
Severity:
4 - irritation
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Monitor
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2023-08-22T15:33:52.003+0000 7ffbe2517700  7 mon.a@0(leader).osd e214 prepare_update mon_command({"prefix": "osd pg-upmap-items", "format": "json", "pgid": "31.23", "id": [0, 7]} v 0) v1 from mgr.4100 172.21.15.70:0/33077
...
2023-08-22T15:33:53.187+0000 7ffbe2517700 20 mon.a@0(leader).mgrstat health checks:
{
    "PG_DEGRADED": {
        "severity": "HEALTH_WARN",
        "summary": {
            "message": "Degraded data redundancy: 1 pg degraded",
            "count": 1
        },
        "detail": [
            {
                "message": "pg 31.23 is active+recovering+degraded, acting [7,3]" 
            }
        ]
    }
}

From: /teuthology/yuriw-2023-08-22_14:48:56-fs-pacific-release-distro-default-smithi/7376330/remote/smithi070/log/ceph-mon.a.log.gz

I am loath to silence PG_DEGRADED across the board. Is there a way to avoid this warning otherwise?

Actions #1

Updated by Radoslaw Zarzynski 8 months ago

Form the the bug scrub: the message mentions acting [7,3] but the osd pg-upmap-items requested [0, 7].
Isn't this looking like a race condition in e.g. the balancer?
Do we have a compare-and-swap based on e.g. epoch used for the calculation of [0, 7]?

There is a time window between calculating the requested set and processing the command in which a crash of an OSD can happen. This would explain the recovery state.

Actions

Also available in: Atom PDF