Bug #52624
openqa: "Health check failed: Reduced data availability: 1 pg peering (PG_AVAILABILITY)"
0%
Description
/ceph/teuthology-archive/pdonnell-2021-09-14_01:17:08-fs-wip-pdonnell-testing-20210910.181451-distro-basic-smithi/6388614/teuthology.log
Test wasn't really doing anything to the cluster at the time. May have stumbled on a bug of some kind.
Updated by Neha Ojha over 2 years ago
2021-09-14T02:04:30.392+0000 7f75fd230700 1 -- 172.21.15.134:0/17688 <== mon.0 v2:172.21.15.134:3300/0 162 ==== mon_command_ack([{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "7.2e", "id": [0, 5]}]=0 set 7.2e pg_upmap_items mapping to [0->5] v44) v1 ==== 155+0+0 (secure 0 0 0) 0x55d65a8c29c0 con 0x55d655fbd400 ... 2021-09-14T02:04:32.312+0000 7f75eb753700 20 mgr.server operator() health checks: { "PG_AVAILABILITY": { "severity": "HEALTH_WARN", "summary": { "message": "Reduced data availability: 1 pg peering", "count": 1 }, "detail": [ { "message": "pg 7.2e is stuck peering for 62s, current state peering, last acting [4,5]" } ] } }
Looks like peering induced by mapping change by the balancer. How often does this happen?
Updated by Patrick Donnelly over 2 years ago
Neha Ojha wrote:
[...]
Looks like peering induced by mapping change by the balancer. How often does this happen?
Pretty rare, once in ~250 jobs so far.
Updated by Patrick Donnelly over 2 years ago
- Has duplicate Bug #52607: qa: "mon.a (mon.0) 1022 : cluster [WRN] Health check failed: Reduced data availability: 4 pgs peering (PG_AVAILABILITY)" added
Updated by Patrick Donnelly over 2 years ago
Patrick Donnelly wrote:
Neha Ojha wrote:
[...]
Looks like peering induced by mapping change by the balancer. How often does this happen?
Pretty rare, once in ~250 jobs so far.
Updated by Neha Ojha over 2 years ago
Patrick Donnelly wrote:
Patrick Donnelly wrote:
Neha Ojha wrote:
[...]
Looks like peering induced by mapping change by the balancer. How often does this happen?
Pretty rare, once in ~250 jobs so far.
Did this start happening fairly recently? I'll take a look at the logs to see what I can find.
Updated by Patrick Donnelly over 2 years ago
Neha Ojha wrote:
Patrick Donnelly wrote:
Patrick Donnelly wrote:
Neha Ojha wrote:
[...]
Looks like peering induced by mapping change by the balancer. How often does this happen?
Pretty rare, once in ~250 jobs so far.
Did this start happening fairly recently? I'll take a look at the logs to see what I can find.
Oldest occurrence I can find is
https://pulpito.ceph.com/teuthology-2021-08-03_03:15:03-fs-master-distro-basic-gibba/6308125/
but there may be some I missed.
Updated by Patrick Donnelly over 2 years ago
Patrick Donnelly wrote:
Neha Ojha wrote:
Patrick Donnelly wrote:
Patrick Donnelly wrote:
Neha Ojha wrote:
[...]
Looks like peering induced by mapping change by the balancer. How often does this happen?
Pretty rare, once in ~250 jobs so far.
Did this start happening fairly recently? I'll take a look at the logs to see what I can find.
Oldest occurrence I can find is
https://pulpito.ceph.com/teuthology-2021-08-03_03:15:03-fs-master-distro-basic-gibba/6308125/
but there may be some I missed.
Scratch that. Looks like the earliest I can grep for is:
/a/pdonnell-2021-05-12_04:01:31-fs-wip-pdonnell-testing-20210511.232042-distro-basic-smithi/6110639/teuthology.log.gz
This is before the .mgr pool change was merged (in June).
Updated by Kotresh Hiremath Ravishankar over 2 years ago
Seen in this pacific run as well.
Updated by Radoslaw Zarzynski almost 2 years ago
To judge how severe the problem really is we need the information whether the stall is permanent (PG gets stuck and there is no progress) or it's just a (slightly) delayed operation,
Updated by Kotresh Hiremath Ravishankar over 1 year ago
Updated by Kotresh Hiremath Ravishankar over 1 year ago
Seen in these pacific runs
1. https://pulpito.ceph.com/yuriw-2022-08-04_20:54:08-fs-wip-yuri6-testing-2022-08-04-0617-pacific-distro-default-smithi/6959069
2. https://pulpito.ceph.com/yuriw-2022-08-04_20:54:08-fs-wip-yuri6-testing-2022-08-04-0617-pacific-distro-default-smithi/6959102
3. https://pulpito.ceph.com/yuriw-2022-08-04_20:54:08-fs-wip-yuri6-testing-2022-08-04-0617-pacific-distro-default-smithi/6959168
4. https://pulpito.ceph.com/yuriw-2022-08-04_20:54:08-fs-wip-yuri6-testing-2022-08-04-0617-pacific-distro-default-smithi/6959193
Updated by Radoslaw Zarzynski over 1 year ago
- Assignee set to Aishwarya Mathuria
Updated by Kotresh Hiremath Ravishankar over 1 year ago
Seen in these recent pacific runs:
1. https://pulpito.ceph.com/yuriw-2022-08-18_23:16:33-fs-wip-yuri10-testing-2022-08-18-1400-pacific-distro-default-smithi/6979462
2. https://pulpito.ceph.com/yuriw-2022-08-18_23:16:33-fs-wip-yuri10-testing-2022-08-18-1400-pacific-distro-default-smithi/6979495
3. https://pulpito.ceph.com/yuriw-2022-08-19_21:01:11-fs-wip-yuri10-testing-2022-08-18-1400-pacific-distro-default-smithi/6981264
4. https://pulpito.ceph.com/yuriw-2022-08-19_21:01:11-fs-wip-yuri10-testing-2022-08-18-1400-pacific-distro-default-smithi/6981270
Updated by Aishwarya Mathuria over 1 year ago
I have been going through the failure logs mentioned above and I see that the health check does pass eventually:
2022-08-19T23:14:08.032190+0000 mon.a (mon.0) 1636 : cluster [WRN] Health check failed: Reduced data availability: 1 pg peering (PG_AVAILABILITY)
2022-08-19T23:14:09.980141+0000 mgr.y (mgr.9288) 36 : cluster [DBG] pgmap v37: 137 pgs: 2 peering, 135 active+clean; 11 KiB data, 170 MiB used, 720 GiB / 720 GiB avail; 1.1 KiB/s wr, 0 op/s
2022-08-19T23:14:11.980868+0000 mgr.y (mgr.9288) 37 : cluster [DBG] pgmap v38: 137 pgs: 2 peering, 135 active+clean; 11 KiB data, 170 MiB used, 720 GiB / 720 GiB avail; 127 B/s wr, 0 op/s
2022-08-19T23:14:13.981488+0000 mgr.y (mgr.9288) 38 : cluster [DBG] pgmap v39: 137 pgs: 137 active+clean; 11 KiB data, 170 MiB used, 720 GiB / 720 GiB avail; 127 B/s wr, 0 op/s; 0 B/s, 0 objects/s recovering
2022-08-19T23:14:14.032388+0000 mon.a (mon.0) 1637 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg peering)
2022-08-19T23:14:14.032413+0000 mon.a (mon.0) 1638 : cluster [INF] Cluster is now healthy
But in tasks/ceph.py here: https://github.com/ceph/ceph/blob/main/qa/tasks/ceph.py#L1138 we check for the first occurrence of WRN level logs.
So the test fails because it finds the following log:
2022-08-19T23:15:59.841 INFO:tasks.ceph:Checking cluster log for badness...
2022-08-19T23:15:59.841 DEBUG:teuthology.orchestra.run.smithi138:> sudo egrep '\[ERR\]|\[WRN\]|\[SEC\]' /var/log/ceph/ceph.log | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | egrep -v 'overall HEALTH_' | egrep -v '\(FS_DEGRADED\)' | egrep -v '\(MDS_FAILED\)' | egrep -v '\(MDS_DEGRADED\)' | egrep -v '\(FS_WITH_FAILED_MDS\)' | egrep -v '\(MDS_DAMAGE\)' | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | egrep -v '\(FS_INLINE_DATA_DEPRECATED\)' | egrep -v 'overall HEALTH_' | egrep -v '\(OSD_DOWN\)' | egrep -v '\(OSD_' | egrep -v 'but it is still running' | egrep -v 'is not responding' | head -n 1
2022-08-19T23:15:59.955 INFO:teuthology.orchestra.run.smithi138.stdout:2022-08-19T22:51:27.586480+0000 mon.a (mon.0) 590 : cluster [WRN] Health check failed: Reduced data availability: 2 pgs peering (PG_AVAILABILITY)
2022-08-19T23:15:59.956 WARNING:tasks.ceph:Found errors (ERR|WRN|SEC) in cluster log
2022-08-19T23:16:00.284 INFO:teuthology.orchestra.run.smithi138.stdout:2022-08-19T22:51:27.586480+0000 mon.a (mon.0) 590 : cluster [WRN] Health check failed: Reduced data availability: 2 pgs peering (PG_AVAILABILITY)
If we look at the timestamp, that log is from 25 minutes before the cluster was healthy again. At 2022-08-19T23:16:00 the cluster was actually healthy again
Updated by Aishwarya Mathuria over 1 year ago
Took a look at why peering was happening in the first place. Looking at PG 7.16 logs below, we can see that the balancer has changed the OSD from [0,5] to [4,5]. This rebalancing of PGs is causing the peering to start.
Maybe the test failures can be avoided by adding 'Reduced data availability' to the log-ignorelist in the ignorelist_health.yaml?
I am still taking a look at why peering was taking long.
2022-08-19T22:51:25.522+0000 7f3fefbf2700 1 -- [v2:172.21.15.138:3300/0,v1:172.21.15.138:6789/0] <== mgr.6216 172.21.15.138:0/40300 118 ==== mon_command({"prefix": "osd pg-upmap-items", "format": "json", "pgid": "7.16", "id": [0, 5]} v 0) v1 ==== 122+0+0 (secure 0 0 0) 0x5558b7630000 con 0x5558b69aa400
2022-08-19T22:51:25.523+0000 7f3fefbf2700 10 mon.a@0(leader).osd e66 preprocess_query mon_command({"prefix": "osd pg-upmap-items", "format": "json", "pgid": "7.16", "id": [0, 5]} v 0) v1 from mgr.6216 172.21.15.138:0/40300
2022-08-19T22:51:25.523+0000 7f3fefbf2700 7 mon.a@0(leader).osd e66 prepare_update mon_command({"prefix": "osd pg-upmap-items", "format": "json", "pgid": "7.16", "id": [0, 5]} v 0) v1 from mgr.6216 172.21.15.138:0/40300
2022-08-19T22:51:25.523+0000 7f3fefbf2700 10 mon.a@0(leader).log v530 logging 2022-08-19T22:51:25.523722+0000 mon.a (mon.0) 576 : audit [INF] from='mgr.6216 172.21.15.138:0/40300' entity='mgr.y' cmd=[{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "7.16", "id": [0, 5]}]: dispatch
2022-08-19T22:51:25.578+0000 7f3fee3ef700 0 log_channel(audit) log [INF] : from='mgr.6216 172.21.15.138:0/40300' entity='mgr.y' cmd='[{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "7.16", "id": [0, 5]}]': finished
2022-08-19T22:51:25.578+0000 7f3fee3ef700 2 mon.a@0(leader) e1 send_reply 0x5558b6558c30 0x5558b6bd1860 mon_command_ack([{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "7.16", "id": [0, 5]}]=0 set 7.16 pg_upmap_items mapping to [0->5] v67) v1
2022-08-19T22:51:25.578+0000 7f3fee3ef700 1 -- [v2:172.21.15.138:3300/0,v1:172.21.15.138:6789/0] --> 172.21.15.138:0/40300 -- mon_command_ack([{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "7.16", "id": [0, 5]}]=0 set 7.16 pg_upmap_items mapping to [0->5] v67) v1 -- 0x5558b6bd1860 con 0x5558b69aa400
2022-08-19T22:51:25.582+0000 7f3fee3ef700 7 mon.a@0(leader).log v531 update_from_paxos applying incremental log 531 2022-08-19T22:51:25.523722+0000 mon.a (mon.0) 576 : audit [INF] from='mgr.6216 172.21.15.138:0/40300' entity='mgr.y' cmd=[{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "7.16", "id": [0, 5]}]: dispatch
022-08-19T22:51:27.530+0000 7f3fefbf2700 20 mon.a@0(leader).mgrstat health checks:
{
"PG_AVAILABILITY": {
"severity": "HEALTH_WARN",
"summary": {
"message": "Reduced data availability: 2 pgs peering",
"count": 2
},
"detail": [
{
"message": "pg 7.16 is stuck peering for 61s, current state peering, last acting [4,5]"
},
{
"message": "pg 8.25 is stuck peering for 61s, current state peering, last acting [1,6]"
}
]
}
}
Updated by Kotresh Hiremath Ravishankar 11 months ago
Updated by Kotresh Hiremath Ravishankar 11 months ago
Updated by Kotresh Hiremath Ravishankar 11 months ago
Updated by Kotresh Hiremath Ravishankar 11 months ago
Updated by Radoslaw Zarzynski 11 months ago
Aishwarya, it started showing again. Could you please take a look?
Updated by Milind Changire 10 months ago
quincy:
http://pulpito.front.sepia.ceph.com/yuriw-2023-06-13_23:20:02-fs-wip-yuri3-testing-2023-06-13-1204-quincy-distro-default-smithi/7303222
http://pulpito.front.sepia.ceph.com/yuriw-2023-06-13_23:20:02-fs-wip-yuri3-testing-2023-06-13-1204-quincy-distro-default-smithi/7303379
http://pulpito.front.sepia.ceph.com/yuriw-2023-06-13_23:20:02-fs-wip-yuri3-testing-2023-06-13-1204-quincy-distro-default-smithi/7303257
Updated by Aishwarya Mathuria 10 months ago
After taking a look at the logs, I think this is related to the following tracker: https://tracker.ceph.com/issues/51688
From monitor logs, the health check that caused the test to fail:
2023-06-14T02:35:19.373+0000 7f07df565700 20 mon.a@0(leader).mgrstat health checks:
{
"PG_AVAILABILITY": {
"severity": "HEALTH_WARN",
"summary": {
"message": "Reduced data availability: 1 pg peering",
"count": 1
},
"detail": [
{
"message": "pg 7.3e is stuck peering for 62s, current state peering, last acting [5,4]"
}
]
}
}
However, according to the osd.5 logs peering for PG 7.3e was over by then and at this point it was in active state:
2023-06-14T02:35:19.020+0000 7fb5b3b79700 10 osd.5 pg_epoch: 47 pg[7.3e( empty local-lis/les=46/47 n=0 ec=42/42 lis/c=46/46 les/c/f=47/47/0 sis=46) [5,4] r=0 lpr=46 crt=0'0 mlcod 0'0 active+clean] remove_stray_recovery_sources remove osd 0 from missing_loc
2023-06-14T02:35:19.020+0000 7fb5b3b79700 10 osd.5 pg_epoch: 47 pg[7.3e( empty local-lis/les=46/47 n=0 ec=42/42 lis/c=46/46 les/c/f=47/47/0 sis=46) [5,4] r=0 lpr=46 crt=0'0 mlcod 0'0 active+clean] update_heartbeat_peers 0,4,5 -> 4,5
2023-06-14T02:35:19.020+0000 7fb5b3b79700 20 osd.5 47 need_heartbeat_peer_update
2023-06-14T02:35:19.020+0000 7fb5b3b79700 20 osd.5 pg_epoch: 47 pg[7.3e( empty local-lis/les=46/47 n=0 ec=42/42 lis/c=46/46 les/c/f=47/47/0 sis=46) [5,4] r=0 lpr=46 crt=0'0 mlcod 0'0 active+clean] prepare_stats_for_publish reporting purged_snaps []
Peering for PG 7.3e taking around one second:
2023-06-14T02:35:17.916+0000 7fb5b3b79700 5 osd.5 pg_epoch: 46 pg[7.3e( empty local-lis/les=42/43 n=0 ec=42/42 lis/c=42/42 les/c/f=43/43/0 sis=46 pruub=11.765387535s) [5,4] r=0 lpr=46 pi=[42,46)/1 crt=0'0 mlcod 0'0 peering pruub 161.579864502s@ mbc={}] enter Started/Primary/Peering/GetInfo
2023-06-14T02:35:17.916+0000 7fb5b3b79700 10 osd.5 pg_epoch: 46 pg[7.3e( empty local-lis/les=42/43 n=0 ec=42/42 lis/c=42/42 les/c/f=43/43/0 sis=46 pruub=11.765387535s) [5,4] r=0 lpr=46 pi=[42,46)/1 crt=0'0 mlcod 0'0 peering pruub 161.579864502s@ mbc={}] build_prior all_probe 0,5
2023-06-14T02:35:17.916+0000 7fb5b3b79700 10 osd.5 pg_epoch: 46 pg[7.3e( empty local-lis/les=42/43 n=0 ec=42/42 lis/c=42/42 les/c/f=43/43/0 sis=46 pruub=11.765387535s) [5,4] r=0 lpr=46 pi=[42,46)/1 crt=0'0 mlcod 0'0 peering pruub 161.579864502s@ mbc={}] build_prior maybe_rw interval:42, acting: 0,5
2023-06-14T02:35:17.916+0000 7fb5b3b79700 10 osd.5 pg_epoch: 46 pg[7.3e( empty local-lis/les=42/43 n=0 ec=42/42 lis/c=42/42 les/c/f=43/43/0 sis=46 pruub=11.765387535s) [5,4] r=0 lpr=46 pi=[42,46)/1 crt=0'0 mlcod 0'0 peering pruub 161.579864502s@ mbc={}] build_prior final: probe 0,4,5 down blocked_by {}
2023-06-14T02:35:17.916+0000 7fb5b3b79700 10 osd.5 pg_epoch: 46 pg[7.3e( empty local-lis/les=42/43 n=0 ec=42/42 lis/c=42/42 les/c/f=43/43/0 sis=46 pruub=11.765387535s) [5,4] r=0 lpr=46 pi=[42,46)/1 crt=0'0 mlcod 0'0 peering pruub 161.579864502s@ mbc={}] up_thru 42 < same_since 46, must notify monitor
.
.
.
2023-06-14T02:35:18.916+0000 7fb5b3b79700 10 osd.5 pg_epoch: 47 pg[7.3e( empty local-lis/les=42/43 n=0 ec=42/42 lis/c=42/42 les/c/f=43/43/0 sis=46 pruub=11.759792328s) [5,4] r=0 lpr=46 pi=[42,46)/1 crt=0'0 mlcod 0'0 peering pruub 161.579864502s@ mbc={}] state<Started/Primary/Peering>: Leaving Peering
2023-06-14T02:35:18.916+0000 7fb5b3b79700 10 osd.5 pg_epoch: 47 pg[7.3e( empty local-lis/les=42/43 n=0 ec=42/42 lis/c=42/42 les/c/f=43/43/0 sis=46 pruub=11.759792328s) [5,4] r=0 lpr=46 pi=[42,46)/1 crt=0'0 mlcod 0'0 unknown pruub 161.579864502s@ mbc={}] state<Started/Primary/Active>: In Active, about to call activate
2023-06-14T02:35:18.916+0000 7fb5b3b79700 20 osd.5 pg_epoch: 47 pg[7.3e( empty local-lis/les=46/47 n=0 ec=42/42 lis/c=42/42 les/c/f=43/43/0 sis=46) [5,4] r=0 lpr=46 pi=[42,46)/1 crt=0'0 mlcod 0'0 activating mbc={}] update_calc_stats no peer_missing found for 0
There is a PR out for review for the tracker I have mentioned, I will do a test run with it and update this tracker if it fixes the issue
Updated by Radoslaw Zarzynski 9 months ago
In addition to the suggestion of being a duplicate, the alternative hypothesis could be a ceph-mgr. My understanding the of last Aishwarya's comment the issue isn't _actual one – the PG was active, just mgr's report was inaccurate.
Updated by Radoslaw Zarzynski 7 months ago
In response to #33: this seems something different as the fix for https://tracker.ceph.com/issues/51688 got merged in July.
Updated by Laura Flores 7 months ago
/a/yuriw-2023-10-02_20:49:32-rados-wip-yuri5-testing-2023-10-02-1105-distro-default-smithi/7409028
Updated by Laura Flores 7 months ago
- Translation missing: en.field_tag_list set to test-failure
Updated by Kotresh Hiremath Ravishankar 5 months ago
Updated by Kotresh Hiremath Ravishankar 4 months ago
https://pulpito.ceph.com/yuriw-2023-12-26_19:48:51-fs-wip-yuri-testing-2023-12-26-0957-quincy-distro-default-smithi/7501710
https://pulpito.ceph.com/yuriw-2023-12-26_19:48:51-fs-wip-yuri-testing-2023-12-26-0957-quincy-distro-default-smithi/7501722
https://pulpito.ceph.com/yuriw-2023-12-26_19:48:51-fs-wip-yuri-testing-2023-12-26-0957-quincy-distro-default-smithi/7501800
Updated by Kotresh Hiremath Ravishankar 4 months ago
Updated by Milind Changire 3 months ago
quincy:
https://pulpito.ceph.com/yuriw-2024-01-26_01:07:29-fs-wip-yuri4-testing-2024-01-25-1331-quincy-distro-default-smithi/7533360
https://pulpito.ceph.com/yuriw-2024-01-26_01:07:29-fs-wip-yuri4-testing-2024-01-25-1331-quincy-distro-default-smithi/7533490
https://pulpito.ceph.com/yuriw-2024-01-26_01:07:29-fs-wip-yuri4-testing-2024-01-25-1331-quincy-distro-default-smithi/7533514
https://pulpito.ceph.com/yuriw-2024-01-26_01:07:29-fs-wip-yuri4-testing-2024-01-25-1331-quincy-distro-default-smithi/7533611
Updated by Milind Changire 3 months ago
Updated by Milind Changire 2 months ago
reef:
https://pulpito.ceph.com/vshankar-2024-02-02_11:17:23-fs-wip-vshankar-testing1-reef-2024-02-02-1152-testing-default-smithi/7543396
https://pulpito.ceph.com/vshankar-2024-02-02_11:17:23-fs-wip-vshankar-testing1-reef-2024-02-02-1152-testing-default-smithi/7543394
https://pulpito.ceph.com/vshankar-2024-02-02_11:17:23-fs-wip-vshankar-testing1-reef-2024-02-02-1152-testing-default-smithi/7543410
Updated by Radoslaw Zarzynski 2 months ago
https://github.com/ceph/ceph/pull/49332 got merged to main on Jul 19 2023.
Updated by Laura Flores 2 months ago
Hey Shreyansh, can you take a look at this and see if it's related to what was going on in https://tracker.ceph.com/issues/51688?
Updated by Sridhar Seshasayee about 1 month ago
/a/yuriw-2024-03-08_16:20:46-rados-wip-yuri4-testing-2024-03-05-0854-distro-default-smithi/7587455
Updated by Patrick Donnelly about 1 month ago
- Related to Fix #64984: qa: probabilistically ignore PG_AVAILABILITY/PG_DEGRADED added
Updated by Laura Flores 21 days ago
- Related to Bug #65231: upgrade/quincy-x/parallel: "Reduced data availability: 1 pg peering (PG_AVAILABILITY)" in cluster log added