Project

General

Profile

Bug #52624

qa: "Health check failed: Reduced data availability: 1 pg peering (PG_AVAILABILITY)"

Added by Patrick Donnelly over 2 years ago. Updated 24 days ago.

Status:
New
Priority:
Normal
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
reef, quincy
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
fs
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/ceph/teuthology-archive/pdonnell-2021-09-14_01:17:08-fs-wip-pdonnell-testing-20210910.181451-distro-basic-smithi/6388614/teuthology.log

Test wasn't really doing anything to the cluster at the time. May have stumbled on a bug of some kind.


Related issues

Duplicated by CephFS - Bug #52607: qa: "mon.a (mon.0) 1022 : cluster [WRN] Health check failed: Reduced data availability: 4 pgs peering (PG_AVAILABILITY)" Duplicate

History

#1 Updated by Neha Ojha over 2 years ago

2021-09-14T02:04:30.392+0000 7f75fd230700  1 -- 172.21.15.134:0/17688 <== mon.0 v2:172.21.15.134:3300/0 162 ==== mon_command_ack([{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "7.2e", "id": [0, 5]}]=0 set 7.2e pg_upmap_items mapping to [0->5] v44) v1 ==== 155+0+0 (secure 0 0 0) 0x55d65a8c29c0 con 0x55d655fbd400
...
2021-09-14T02:04:32.312+0000 7f75eb753700 20 mgr.server operator() health checks:
{
    "PG_AVAILABILITY": {
        "severity": "HEALTH_WARN",
        "summary": {
            "message": "Reduced data availability: 1 pg peering",
            "count": 1
        },
        "detail": [
            {
                "message": "pg 7.2e is stuck peering for 62s, current state peering, last acting [4,5]" 
            }
        ]
    }
}

Looks like peering induced by mapping change by the balancer. How often does this happen?

#2 Updated by Patrick Donnelly over 2 years ago

Neha Ojha wrote:

[...]

Looks like peering induced by mapping change by the balancer. How often does this happen?

Pretty rare, once in ~250 jobs so far.

#3 Updated by Patrick Donnelly over 2 years ago

  • Duplicated by Bug #52607: qa: "mon.a (mon.0) 1022 : cluster [WRN] Health check failed: Reduced data availability: 4 pgs peering (PG_AVAILABILITY)" added

#4 Updated by Patrick Donnelly over 2 years ago

Patrick Donnelly wrote:

Neha Ojha wrote:

[...]

Looks like peering induced by mapping change by the balancer. How often does this happen?

Pretty rare, once in ~250 jobs so far.

3/262 jobs: https://pulpito.ceph.com/pdonnell-2021-09-17_20:49:46-fs-wip-pdonnell-testing-20210917.174826-distro-basic-smithi/

#5 Updated by Neha Ojha over 2 years ago

Patrick Donnelly wrote:

Patrick Donnelly wrote:

Neha Ojha wrote:

[...]

Looks like peering induced by mapping change by the balancer. How often does this happen?

Pretty rare, once in ~250 jobs so far.

3/262 jobs: https://pulpito.ceph.com/pdonnell-2021-09-17_20:49:46-fs-wip-pdonnell-testing-20210917.174826-distro-basic-smithi/

Did this start happening fairly recently? I'll take a look at the logs to see what I can find.

#6 Updated by Patrick Donnelly over 2 years ago

Neha Ojha wrote:

Patrick Donnelly wrote:

Patrick Donnelly wrote:

Neha Ojha wrote:

[...]

Looks like peering induced by mapping change by the balancer. How often does this happen?

Pretty rare, once in ~250 jobs so far.

3/262 jobs: https://pulpito.ceph.com/pdonnell-2021-09-17_20:49:46-fs-wip-pdonnell-testing-20210917.174826-distro-basic-smithi/

Did this start happening fairly recently? I'll take a look at the logs to see what I can find.

Oldest occurrence I can find is

https://pulpito.ceph.com/teuthology-2021-08-03_03:15:03-fs-master-distro-basic-gibba/6308125/

but there may be some I missed.

#7 Updated by Patrick Donnelly over 2 years ago

Patrick Donnelly wrote:

Neha Ojha wrote:

Patrick Donnelly wrote:

Patrick Donnelly wrote:

Neha Ojha wrote:

[...]

Looks like peering induced by mapping change by the balancer. How often does this happen?

Pretty rare, once in ~250 jobs so far.

3/262 jobs: https://pulpito.ceph.com/pdonnell-2021-09-17_20:49:46-fs-wip-pdonnell-testing-20210917.174826-distro-basic-smithi/

Did this start happening fairly recently? I'll take a look at the logs to see what I can find.

Oldest occurrence I can find is

https://pulpito.ceph.com/teuthology-2021-08-03_03:15:03-fs-master-distro-basic-gibba/6308125/

but there may be some I missed.

Scratch that. Looks like the earliest I can grep for is:

/a/pdonnell-2021-05-12_04:01:31-fs-wip-pdonnell-testing-20210511.232042-distro-basic-smithi/6110639/teuthology.log.gz

This is before the .mgr pool change was merged (in June).

#10 Updated by Radoslaw Zarzynski almost 2 years ago

To judge how severe the problem really is we need the information whether the stall is permanent (PG gets stuck and there is no progress) or it's just a (slightly) delayed operation,

#15 Updated by Radoslaw Zarzynski over 1 year ago

  • Assignee set to Aishwarya Mathuria

#17 Updated by Aishwarya Mathuria over 1 year ago

I have been going through the failure logs mentioned above and I see that the health check does pass eventually:

2022-08-19T23:14:08.032190+0000 mon.a (mon.0) 1636 : cluster [WRN] Health check failed: Reduced data availability: 1 pg peering (PG_AVAILABILITY)
2022-08-19T23:14:09.980141+0000 mgr.y (mgr.9288) 36 : cluster [DBG] pgmap v37: 137 pgs: 2 peering, 135 active+clean; 11 KiB data, 170 MiB used, 720 GiB / 720 GiB avail; 1.1 KiB/s wr, 0 op/s
2022-08-19T23:14:11.980868+0000 mgr.y (mgr.9288) 37 : cluster [DBG] pgmap v38: 137 pgs: 2 peering, 135 active+clean; 11 KiB data, 170 MiB used, 720 GiB / 720 GiB avail; 127 B/s wr, 0 op/s
2022-08-19T23:14:13.981488+0000 mgr.y (mgr.9288) 38 : cluster [DBG] pgmap v39: 137 pgs: 137 active+clean; 11 KiB data, 170 MiB used, 720 GiB / 720 GiB avail; 127 B/s wr, 0 op/s; 0 B/s, 0 objects/s recovering
2022-08-19T23:14:14.032388+0000 mon.a (mon.0) 1637 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg peering)
2022-08-19T23:14:14.032413+0000 mon.a (mon.0) 1638 : cluster [INF] Cluster is now healthy

But in tasks/ceph.py here: https://github.com/ceph/ceph/blob/main/qa/tasks/ceph.py#L1138 we check for the first occurrence of WRN level logs.
So the test fails because it finds the following log:

2022-08-19T23:15:59.841 INFO:tasks.ceph:Checking cluster log for badness...
2022-08-19T23:15:59.841 DEBUG:teuthology.orchestra.run.smithi138:> sudo egrep '\[ERR\]|\[WRN\]|\[SEC\]' /var/log/ceph/ceph.log | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | egrep -v 'overall HEALTH_' | egrep -v '\(FS_DEGRADED\)' | egrep -v '\(MDS_FAILED\)' | egrep -v '\(MDS_DEGRADED\)' | egrep -v '\(FS_WITH_FAILED_MDS\)' | egrep -v '\(MDS_DAMAGE\)' | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | egrep -v '\(FS_INLINE_DATA_DEPRECATED\)' | egrep -v 'overall HEALTH_' | egrep -v '\(OSD_DOWN\)' | egrep -v '\(OSD_' | egrep -v 'but it is still running' | egrep -v 'is not responding' | head -n 1
2022-08-19T23:15:59.955 INFO:teuthology.orchestra.run.smithi138.stdout:2022-08-19T22:51:27.586480+0000 mon.a (mon.0) 590 : cluster [WRN] Health check failed: Reduced data availability: 2 pgs peering (PG_AVAILABILITY)
2022-08-19T23:15:59.956 WARNING:tasks.ceph:Found errors (ERR|WRN|SEC) in cluster log
2022-08-19T23:16:00.284 INFO:teuthology.orchestra.run.smithi138.stdout:2022-08-19T22:51:27.586480+0000 mon.a (mon.0) 590 : cluster [WRN] Health check failed: Reduced data availability: 2 pgs peering (PG_AVAILABILITY)

If we look at the timestamp, that log is from 25 minutes before the cluster was healthy again. At 2022-08-19T23:16:00 the cluster was actually healthy again

#18 Updated by Aishwarya Mathuria over 1 year ago

Took a look at why peering was happening in the first place. Looking at PG 7.16 logs below, we can see that the balancer has changed the OSD from [0,5] to [4,5]. This rebalancing of PGs is causing the peering to start.
Maybe the test failures can be avoided by adding 'Reduced data availability' to the log-ignorelist in the ignorelist_health.yaml?
I am still taking a look at why peering was taking long.

2022-08-19T22:51:25.522+0000 7f3fefbf2700  1 -- [v2:172.21.15.138:3300/0,v1:172.21.15.138:6789/0] <== mgr.6216 172.21.15.138:0/40300 118 ==== mon_command({"prefix": "osd pg-upmap-items", "format": "json", "pgid": "7.16", "id": [0, 5]} v 0) v1 ==== 122+0+0 (secure 0 0 0) 0x5558b7630000 con 0x5558b69aa400
2022-08-19T22:51:25.523+0000 7f3fefbf2700 10 mon.a@0(leader).osd e66 preprocess_query mon_command({"prefix": "osd pg-upmap-items", "format": "json", "pgid": "7.16", "id": [0, 5]} v 0) v1 from mgr.6216 172.21.15.138:0/40300
2022-08-19T22:51:25.523+0000 7f3fefbf2700  7 mon.a@0(leader).osd e66 prepare_update mon_command({"prefix": "osd pg-upmap-items", "format": "json", "pgid": "7.16", "id": [0, 5]} v 0) v1 from mgr.6216 172.21.15.138:0/40300
2022-08-19T22:51:25.523+0000 7f3fefbf2700 10 mon.a@0(leader).log v530  logging 2022-08-19T22:51:25.523722+0000 mon.a (mon.0) 576 : audit [INF] from='mgr.6216 172.21.15.138:0/40300' entity='mgr.y' cmd=[{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "7.16", "id": [0, 5]}]: dispatch
2022-08-19T22:51:25.578+0000 7f3fee3ef700  0 log_channel(audit) log [INF] : from='mgr.6216 172.21.15.138:0/40300' entity='mgr.y' cmd='[{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "7.16", "id": [0, 5]}]': finished
2022-08-19T22:51:25.578+0000 7f3fee3ef700  2 mon.a@0(leader) e1 send_reply 0x5558b6558c30 0x5558b6bd1860 mon_command_ack([{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "7.16", "id": [0, 5]}]=0 set 7.16 pg_upmap_items mapping to [0->5] v67) v1
2022-08-19T22:51:25.578+0000 7f3fee3ef700  1 -- [v2:172.21.15.138:3300/0,v1:172.21.15.138:6789/0] --> 172.21.15.138:0/40300 -- mon_command_ack([{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "7.16", "id": [0, 5]}]=0 set 7.16 pg_upmap_items mapping to [0->5] v67) v1 -- 0x5558b6bd1860 con 0x5558b69aa400
2022-08-19T22:51:25.582+0000 7f3fee3ef700  7 mon.a@0(leader).log v531 update_from_paxos applying incremental log 531 2022-08-19T22:51:25.523722+0000 mon.a (mon.0) 576 : audit [INF] from='mgr.6216 172.21.15.138:0/40300' entity='mgr.y' cmd=[{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "7.16", "id": [0, 5]}]: dispatch
022-08-19T22:51:27.530+0000 7f3fefbf2700 20 mon.a@0(leader).mgrstat health checks:
{
    "PG_AVAILABILITY": {
        "severity": "HEALTH_WARN",
        "summary": {
            "message": "Reduced data availability: 2 pgs peering",
            "count": 2
        },
        "detail": [
            {
                "message": "pg 7.16 is stuck peering for 61s, current state peering, last acting [4,5]" 
            },
            {
                "message": "pg 8.25 is stuck peering for 61s, current state peering, last acting [1,6]" 
            }
        ]
    }
}

#19 Updated by Radoslaw Zarzynski over 1 year ago

Let's move back to it next week.

#28 Updated by Radoslaw Zarzynski 9 months ago

Aishwarya, it started showing again. Could you please take a look?

#31 Updated by Radoslaw Zarzynski 9 months ago

  • Backport set to reef, quincy

#33 Updated by Aishwarya Mathuria 9 months ago

After taking a look at the logs, I think this is related to the following tracker: https://tracker.ceph.com/issues/51688

From monitor logs, the health check that caused the test to fail:


2023-06-14T02:35:19.373+0000 7f07df565700 20 mon.a@0(leader).mgrstat health checks:
{
    "PG_AVAILABILITY": {
        "severity": "HEALTH_WARN",
        "summary": {
            "message": "Reduced data availability: 1 pg peering",
            "count": 1
        },
        "detail": [
            {
                "message": "pg 7.3e is stuck peering for 62s, current state peering, last acting [5,4]" 
            }
        ]
    }
}

However, according to the osd.5 logs peering for PG 7.3e was over by then and at this point it was in active state:

2023-06-14T02:35:19.020+0000 7fb5b3b79700 10 osd.5 pg_epoch: 47 pg[7.3e( empty local-lis/les=46/47 n=0 ec=42/42 lis/c=46/46 les/c/f=47/47/0 sis=46) [5,4] r=0 lpr=46 crt=0'0 mlcod 0'0 active+clean] remove_stray_recovery_sources remove osd 0 from missing_loc
2023-06-14T02:35:19.020+0000 7fb5b3b79700 10 osd.5 pg_epoch: 47 pg[7.3e( empty local-lis/les=46/47 n=0 ec=42/42 lis/c=46/46 les/c/f=47/47/0 sis=46) [5,4] r=0 lpr=46 crt=0'0 mlcod 0'0 active+clean] update_heartbeat_peers 0,4,5 -> 4,5
2023-06-14T02:35:19.020+0000 7fb5b3b79700 20 osd.5 47 need_heartbeat_peer_update
2023-06-14T02:35:19.020+0000 7fb5b3b79700 20 osd.5 pg_epoch: 47 pg[7.3e( empty local-lis/les=46/47 n=0 ec=42/42 lis/c=46/46 les/c/f=47/47/0 sis=46) [5,4] r=0 lpr=46 crt=0'0 mlcod 0'0 active+clean] prepare_stats_for_publish reporting purged_snaps []

Peering for PG 7.3e taking around one second:

2023-06-14T02:35:17.916+0000 7fb5b3b79700  5 osd.5 pg_epoch: 46 pg[7.3e( empty local-lis/les=42/43 n=0 ec=42/42 lis/c=42/42 les/c/f=43/43/0 sis=46 pruub=11.765387535s) [5,4] r=0 lpr=46 pi=[42,46)/1 crt=0'0 mlcod 0'0 peering pruub 161.579864502s@ mbc={}] enter Started/Primary/Peering/GetInfo
2023-06-14T02:35:17.916+0000 7fb5b3b79700 10 osd.5 pg_epoch: 46 pg[7.3e( empty local-lis/les=42/43 n=0 ec=42/42 lis/c=42/42 les/c/f=43/43/0 sis=46 pruub=11.765387535s) [5,4] r=0 lpr=46 pi=[42,46)/1 crt=0'0 mlcod 0'0 peering pruub 161.579864502s@ mbc={}] build_prior all_probe 0,5
2023-06-14T02:35:17.916+0000 7fb5b3b79700 10 osd.5 pg_epoch: 46 pg[7.3e( empty local-lis/les=42/43 n=0 ec=42/42 lis/c=42/42 les/c/f=43/43/0 sis=46 pruub=11.765387535s) [5,4] r=0 lpr=46 pi=[42,46)/1 crt=0'0 mlcod 0'0 peering pruub 161.579864502s@ mbc={}] build_prior maybe_rw interval:42, acting: 0,5
2023-06-14T02:35:17.916+0000 7fb5b3b79700 10 osd.5 pg_epoch: 46 pg[7.3e( empty local-lis/les=42/43 n=0 ec=42/42 lis/c=42/42 les/c/f=43/43/0 sis=46 pruub=11.765387535s) [5,4] r=0 lpr=46 pi=[42,46)/1 crt=0'0 mlcod 0'0 peering pruub 161.579864502s@ mbc={}] build_prior final: probe 0,4,5 down  blocked_by {}
2023-06-14T02:35:17.916+0000 7fb5b3b79700 10 osd.5 pg_epoch: 46 pg[7.3e( empty local-lis/les=42/43 n=0 ec=42/42 lis/c=42/42 les/c/f=43/43/0 sis=46 pruub=11.765387535s) [5,4] r=0 lpr=46 pi=[42,46)/1 crt=0'0 mlcod 0'0 peering pruub 161.579864502s@ mbc={}] up_thru 42 < same_since 46, must notify monitor
.
.
.
2023-06-14T02:35:18.916+0000 7fb5b3b79700 10 osd.5 pg_epoch: 47 pg[7.3e( empty local-lis/les=42/43 n=0 ec=42/42 lis/c=42/42 les/c/f=43/43/0 sis=46 pruub=11.759792328s) [5,4] r=0 lpr=46 pi=[42,46)/1 crt=0'0 mlcod 0'0 peering pruub 161.579864502s@ mbc={}] state<Started/Primary/Peering>: Leaving Peering
2023-06-14T02:35:18.916+0000 7fb5b3b79700 10 osd.5 pg_epoch: 47 pg[7.3e( empty local-lis/les=42/43 n=0 ec=42/42 lis/c=42/42 les/c/f=43/43/0 sis=46 pruub=11.759792328s) [5,4] r=0 lpr=46 pi=[42,46)/1 crt=0'0 mlcod 0'0 unknown pruub 161.579864502s@ mbc={}] state<Started/Primary/Active>: In Active, about to call activate
2023-06-14T02:35:18.916+0000 7fb5b3b79700 20 osd.5 pg_epoch: 47 pg[7.3e( empty local-lis/les=46/47 n=0 ec=42/42 lis/c=42/42 les/c/f=43/43/0 sis=46) [5,4] r=0 lpr=46 pi=[42,46)/1 crt=0'0 mlcod 0'0 activating mbc={}] update_calc_stats no peer_missing found for 0

There is a PR out for review for the tracker I have mentioned, I will do a test run with it and update this tracker if it fixes the issue

#35 Updated by Radoslaw Zarzynski 8 months ago

In addition to the suggestion of being a duplicate, the alternative hypothesis could be a ceph-mgr. My understanding the of last Aishwarya's comment the issue isn't _actual one – the PG was active, just mgr's report was inaccurate.

#37 Updated by Radoslaw Zarzynski 5 months ago

In response to #33: this seems something different as the fix for https://tracker.ceph.com/issues/51688 got merged in July.

#39 Updated by Laura Flores 5 months ago

/a/yuriw-2023-10-02_20:49:32-rados-wip-yuri5-testing-2023-10-02-1105-distro-default-smithi/7409028

#40 Updated by Laura Flores 5 months ago

  • Tags set to test-failure

#48 Updated by Radoslaw Zarzynski 24 days ago

https://github.com/ceph/ceph/pull/49332 got merged to main on Jul 19 2023.

#49 Updated by Laura Flores 24 days ago

Hey Shreyansh, can you take a look at this and see if it's related to what was going on in https://tracker.ceph.com/issues/51688?

Also available in: Atom PDF