Project

General

Profile

Bug #52624

qa: "Health check failed: Reduced data availability: 1 pg peering (PG_AVAILABILITY)"

Added by Patrick Donnelly over 1 year ago. Updated 5 months ago.

Status:
New
Priority:
Normal
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
fs
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/ceph/teuthology-archive/pdonnell-2021-09-14_01:17:08-fs-wip-pdonnell-testing-20210910.181451-distro-basic-smithi/6388614/teuthology.log

Test wasn't really doing anything to the cluster at the time. May have stumbled on a bug of some kind.


Related issues

Duplicated by CephFS - Bug #52607: qa: "mon.a (mon.0) 1022 : cluster [WRN] Health check failed: Reduced data availability: 4 pgs peering (PG_AVAILABILITY)" Duplicate

History

#1 Updated by Neha Ojha over 1 year ago

2021-09-14T02:04:30.392+0000 7f75fd230700  1 -- 172.21.15.134:0/17688 <== mon.0 v2:172.21.15.134:3300/0 162 ==== mon_command_ack([{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "7.2e", "id": [0, 5]}]=0 set 7.2e pg_upmap_items mapping to [0->5] v44) v1 ==== 155+0+0 (secure 0 0 0) 0x55d65a8c29c0 con 0x55d655fbd400
...
2021-09-14T02:04:32.312+0000 7f75eb753700 20 mgr.server operator() health checks:
{
    "PG_AVAILABILITY": {
        "severity": "HEALTH_WARN",
        "summary": {
            "message": "Reduced data availability: 1 pg peering",
            "count": 1
        },
        "detail": [
            {
                "message": "pg 7.2e is stuck peering for 62s, current state peering, last acting [4,5]" 
            }
        ]
    }
}

Looks like peering induced by mapping change by the balancer. How often does this happen?

#2 Updated by Patrick Donnelly over 1 year ago

Neha Ojha wrote:

[...]

Looks like peering induced by mapping change by the balancer. How often does this happen?

Pretty rare, once in ~250 jobs so far.

#3 Updated by Patrick Donnelly over 1 year ago

  • Duplicated by Bug #52607: qa: "mon.a (mon.0) 1022 : cluster [WRN] Health check failed: Reduced data availability: 4 pgs peering (PG_AVAILABILITY)" added

#4 Updated by Patrick Donnelly over 1 year ago

Patrick Donnelly wrote:

Neha Ojha wrote:

[...]

Looks like peering induced by mapping change by the balancer. How often does this happen?

Pretty rare, once in ~250 jobs so far.

3/262 jobs: https://pulpito.ceph.com/pdonnell-2021-09-17_20:49:46-fs-wip-pdonnell-testing-20210917.174826-distro-basic-smithi/

#5 Updated by Neha Ojha over 1 year ago

Patrick Donnelly wrote:

Patrick Donnelly wrote:

Neha Ojha wrote:

[...]

Looks like peering induced by mapping change by the balancer. How often does this happen?

Pretty rare, once in ~250 jobs so far.

3/262 jobs: https://pulpito.ceph.com/pdonnell-2021-09-17_20:49:46-fs-wip-pdonnell-testing-20210917.174826-distro-basic-smithi/

Did this start happening fairly recently? I'll take a look at the logs to see what I can find.

#6 Updated by Patrick Donnelly over 1 year ago

Neha Ojha wrote:

Patrick Donnelly wrote:

Patrick Donnelly wrote:

Neha Ojha wrote:

[...]

Looks like peering induced by mapping change by the balancer. How often does this happen?

Pretty rare, once in ~250 jobs so far.

3/262 jobs: https://pulpito.ceph.com/pdonnell-2021-09-17_20:49:46-fs-wip-pdonnell-testing-20210917.174826-distro-basic-smithi/

Did this start happening fairly recently? I'll take a look at the logs to see what I can find.

Oldest occurrence I can find is

https://pulpito.ceph.com/teuthology-2021-08-03_03:15:03-fs-master-distro-basic-gibba/6308125/

but there may be some I missed.

#7 Updated by Patrick Donnelly over 1 year ago

Patrick Donnelly wrote:

Neha Ojha wrote:

Patrick Donnelly wrote:

Patrick Donnelly wrote:

Neha Ojha wrote:

[...]

Looks like peering induced by mapping change by the balancer. How often does this happen?

Pretty rare, once in ~250 jobs so far.

3/262 jobs: https://pulpito.ceph.com/pdonnell-2021-09-17_20:49:46-fs-wip-pdonnell-testing-20210917.174826-distro-basic-smithi/

Did this start happening fairly recently? I'll take a look at the logs to see what I can find.

Oldest occurrence I can find is

https://pulpito.ceph.com/teuthology-2021-08-03_03:15:03-fs-master-distro-basic-gibba/6308125/

but there may be some I missed.

Scratch that. Looks like the earliest I can grep for is:

/a/pdonnell-2021-05-12_04:01:31-fs-wip-pdonnell-testing-20210511.232042-distro-basic-smithi/6110639/teuthology.log.gz

This is before the .mgr pool change was merged (in June).

#10 Updated by Radoslaw Zarzynski 9 months ago

To judge how severe the problem really is we need the information whether the stall is permanent (PG gets stuck and there is no progress) or it's just a (slightly) delayed operation,

#15 Updated by Radoslaw Zarzynski 6 months ago

  • Assignee set to Aishwarya Mathuria

#17 Updated by Aishwarya Mathuria 5 months ago

I have been going through the failure logs mentioned above and I see that the health check does pass eventually:

2022-08-19T23:14:08.032190+0000 mon.a (mon.0) 1636 : cluster [WRN] Health check failed: Reduced data availability: 1 pg peering (PG_AVAILABILITY)
2022-08-19T23:14:09.980141+0000 mgr.y (mgr.9288) 36 : cluster [DBG] pgmap v37: 137 pgs: 2 peering, 135 active+clean; 11 KiB data, 170 MiB used, 720 GiB / 720 GiB avail; 1.1 KiB/s wr, 0 op/s
2022-08-19T23:14:11.980868+0000 mgr.y (mgr.9288) 37 : cluster [DBG] pgmap v38: 137 pgs: 2 peering, 135 active+clean; 11 KiB data, 170 MiB used, 720 GiB / 720 GiB avail; 127 B/s wr, 0 op/s
2022-08-19T23:14:13.981488+0000 mgr.y (mgr.9288) 38 : cluster [DBG] pgmap v39: 137 pgs: 137 active+clean; 11 KiB data, 170 MiB used, 720 GiB / 720 GiB avail; 127 B/s wr, 0 op/s; 0 B/s, 0 objects/s recovering
2022-08-19T23:14:14.032388+0000 mon.a (mon.0) 1637 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg peering)
2022-08-19T23:14:14.032413+0000 mon.a (mon.0) 1638 : cluster [INF] Cluster is now healthy

But in tasks/ceph.py here: https://github.com/ceph/ceph/blob/main/qa/tasks/ceph.py#L1138 we check for the first occurrence of WRN level logs.
So the test fails because it finds the following log:

2022-08-19T23:15:59.841 INFO:tasks.ceph:Checking cluster log for badness...
2022-08-19T23:15:59.841 DEBUG:teuthology.orchestra.run.smithi138:> sudo egrep '\[ERR\]|\[WRN\]|\[SEC\]' /var/log/ceph/ceph.log | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | egrep -v 'overall HEALTH_' | egrep -v '\(FS_DEGRADED\)' | egrep -v '\(MDS_FAILED\)' | egrep -v '\(MDS_DEGRADED\)' | egrep -v '\(FS_WITH_FAILED_MDS\)' | egrep -v '\(MDS_DAMAGE\)' | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | egrep -v '\(FS_INLINE_DATA_DEPRECATED\)' | egrep -v 'overall HEALTH_' | egrep -v '\(OSD_DOWN\)' | egrep -v '\(OSD_' | egrep -v 'but it is still running' | egrep -v 'is not responding' | head -n 1
2022-08-19T23:15:59.955 INFO:teuthology.orchestra.run.smithi138.stdout:2022-08-19T22:51:27.586480+0000 mon.a (mon.0) 590 : cluster [WRN] Health check failed: Reduced data availability: 2 pgs peering (PG_AVAILABILITY)
2022-08-19T23:15:59.956 WARNING:tasks.ceph:Found errors (ERR|WRN|SEC) in cluster log
2022-08-19T23:16:00.284 INFO:teuthology.orchestra.run.smithi138.stdout:2022-08-19T22:51:27.586480+0000 mon.a (mon.0) 590 : cluster [WRN] Health check failed: Reduced data availability: 2 pgs peering (PG_AVAILABILITY)

If we look at the timestamp, that log is from 25 minutes before the cluster was healthy again. At 2022-08-19T23:16:00 the cluster was actually healthy again

#18 Updated by Aishwarya Mathuria 5 months ago

Took a look at why peering was happening in the first place. Looking at PG 7.16 logs below, we can see that the balancer has changed the OSD from [0,5] to [4,5]. This rebalancing of PGs is causing the peering to start.
Maybe the test failures can be avoided by adding 'Reduced data availability' to the log-ignorelist in the ignorelist_health.yaml?
I am still taking a look at why peering was taking long.

2022-08-19T22:51:25.522+0000 7f3fefbf2700  1 -- [v2:172.21.15.138:3300/0,v1:172.21.15.138:6789/0] <== mgr.6216 172.21.15.138:0/40300 118 ==== mon_command({"prefix": "osd pg-upmap-items", "format": "json", "pgid": "7.16", "id": [0, 5]} v 0) v1 ==== 122+0+0 (secure 0 0 0) 0x5558b7630000 con 0x5558b69aa400
2022-08-19T22:51:25.523+0000 7f3fefbf2700 10 mon.a@0(leader).osd e66 preprocess_query mon_command({"prefix": "osd pg-upmap-items", "format": "json", "pgid": "7.16", "id": [0, 5]} v 0) v1 from mgr.6216 172.21.15.138:0/40300
2022-08-19T22:51:25.523+0000 7f3fefbf2700  7 mon.a@0(leader).osd e66 prepare_update mon_command({"prefix": "osd pg-upmap-items", "format": "json", "pgid": "7.16", "id": [0, 5]} v 0) v1 from mgr.6216 172.21.15.138:0/40300
2022-08-19T22:51:25.523+0000 7f3fefbf2700 10 mon.a@0(leader).log v530  logging 2022-08-19T22:51:25.523722+0000 mon.a (mon.0) 576 : audit [INF] from='mgr.6216 172.21.15.138:0/40300' entity='mgr.y' cmd=[{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "7.16", "id": [0, 5]}]: dispatch
2022-08-19T22:51:25.578+0000 7f3fee3ef700  0 log_channel(audit) log [INF] : from='mgr.6216 172.21.15.138:0/40300' entity='mgr.y' cmd='[{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "7.16", "id": [0, 5]}]': finished
2022-08-19T22:51:25.578+0000 7f3fee3ef700  2 mon.a@0(leader) e1 send_reply 0x5558b6558c30 0x5558b6bd1860 mon_command_ack([{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "7.16", "id": [0, 5]}]=0 set 7.16 pg_upmap_items mapping to [0->5] v67) v1
2022-08-19T22:51:25.578+0000 7f3fee3ef700  1 -- [v2:172.21.15.138:3300/0,v1:172.21.15.138:6789/0] --> 172.21.15.138:0/40300 -- mon_command_ack([{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "7.16", "id": [0, 5]}]=0 set 7.16 pg_upmap_items mapping to [0->5] v67) v1 -- 0x5558b6bd1860 con 0x5558b69aa400
2022-08-19T22:51:25.582+0000 7f3fee3ef700  7 mon.a@0(leader).log v531 update_from_paxos applying incremental log 531 2022-08-19T22:51:25.523722+0000 mon.a (mon.0) 576 : audit [INF] from='mgr.6216 172.21.15.138:0/40300' entity='mgr.y' cmd=[{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "7.16", "id": [0, 5]}]: dispatch
022-08-19T22:51:27.530+0000 7f3fefbf2700 20 mon.a@0(leader).mgrstat health checks:
{
    "PG_AVAILABILITY": {
        "severity": "HEALTH_WARN",
        "summary": {
            "message": "Reduced data availability: 2 pgs peering",
            "count": 2
        },
        "detail": [
            {
                "message": "pg 7.16 is stuck peering for 61s, current state peering, last acting [4,5]" 
            },
            {
                "message": "pg 8.25 is stuck peering for 61s, current state peering, last acting [1,6]" 
            }
        ]
    }
}

#19 Updated by Radoslaw Zarzynski 5 months ago

Let's move back to it next week.

Also available in: Atom PDF