Bug #52624
qa: "Health check failed: Reduced data availability: 1 pg peering (PG_AVAILABILITY)"
0%
Description
/ceph/teuthology-archive/pdonnell-2021-09-14_01:17:08-fs-wip-pdonnell-testing-20210910.181451-distro-basic-smithi/6388614/teuthology.log
Test wasn't really doing anything to the cluster at the time. May have stumbled on a bug of some kind.
Related issues
History
#1 Updated by Neha Ojha over 1 year ago
2021-09-14T02:04:30.392+0000 7f75fd230700 1 -- 172.21.15.134:0/17688 <== mon.0 v2:172.21.15.134:3300/0 162 ==== mon_command_ack([{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "7.2e", "id": [0, 5]}]=0 set 7.2e pg_upmap_items mapping to [0->5] v44) v1 ==== 155+0+0 (secure 0 0 0) 0x55d65a8c29c0 con 0x55d655fbd400 ... 2021-09-14T02:04:32.312+0000 7f75eb753700 20 mgr.server operator() health checks: { "PG_AVAILABILITY": { "severity": "HEALTH_WARN", "summary": { "message": "Reduced data availability: 1 pg peering", "count": 1 }, "detail": [ { "message": "pg 7.2e is stuck peering for 62s, current state peering, last acting [4,5]" } ] } }
Looks like peering induced by mapping change by the balancer. How often does this happen?
#2 Updated by Patrick Donnelly over 1 year ago
Neha Ojha wrote:
[...]
Looks like peering induced by mapping change by the balancer. How often does this happen?
Pretty rare, once in ~250 jobs so far.
#3 Updated by Patrick Donnelly over 1 year ago
- Duplicated by Bug #52607: qa: "mon.a (mon.0) 1022 : cluster [WRN] Health check failed: Reduced data availability: 4 pgs peering (PG_AVAILABILITY)" added
#4 Updated by Patrick Donnelly over 1 year ago
Patrick Donnelly wrote:
Neha Ojha wrote:
[...]
Looks like peering induced by mapping change by the balancer. How often does this happen?
Pretty rare, once in ~250 jobs so far.
#5 Updated by Neha Ojha over 1 year ago
Patrick Donnelly wrote:
Patrick Donnelly wrote:
Neha Ojha wrote:
[...]
Looks like peering induced by mapping change by the balancer. How often does this happen?
Pretty rare, once in ~250 jobs so far.
Did this start happening fairly recently? I'll take a look at the logs to see what I can find.
#6 Updated by Patrick Donnelly over 1 year ago
Neha Ojha wrote:
Patrick Donnelly wrote:
Patrick Donnelly wrote:
Neha Ojha wrote:
[...]
Looks like peering induced by mapping change by the balancer. How often does this happen?
Pretty rare, once in ~250 jobs so far.
Did this start happening fairly recently? I'll take a look at the logs to see what I can find.
Oldest occurrence I can find is
https://pulpito.ceph.com/teuthology-2021-08-03_03:15:03-fs-master-distro-basic-gibba/6308125/
but there may be some I missed.
#7 Updated by Patrick Donnelly over 1 year ago
Patrick Donnelly wrote:
Neha Ojha wrote:
Patrick Donnelly wrote:
Patrick Donnelly wrote:
Neha Ojha wrote:
[...]
Looks like peering induced by mapping change by the balancer. How often does this happen?
Pretty rare, once in ~250 jobs so far.
Did this start happening fairly recently? I'll take a look at the logs to see what I can find.
Oldest occurrence I can find is
https://pulpito.ceph.com/teuthology-2021-08-03_03:15:03-fs-master-distro-basic-gibba/6308125/
but there may be some I missed.
Scratch that. Looks like the earliest I can grep for is:
/a/pdonnell-2021-05-12_04:01:31-fs-wip-pdonnell-testing-20210511.232042-distro-basic-smithi/6110639/teuthology.log.gz
This is before the .mgr pool change was merged (in June).
#8 Updated by Kotresh Hiremath Ravishankar about 1 year ago
Seen in this pacific run as well.
#10 Updated by Radoslaw Zarzynski 9 months ago
To judge how severe the problem really is we need the information whether the stall is permanent (PG gets stuck and there is no progress) or it's just a (slightly) delayed operation,
#11 Updated by Kotresh Hiremath Ravishankar 6 months ago
#14 Updated by Kotresh Hiremath Ravishankar 6 months ago
Seen in these pacific runs
1. https://pulpito.ceph.com/yuriw-2022-08-04_20:54:08-fs-wip-yuri6-testing-2022-08-04-0617-pacific-distro-default-smithi/6959069
2. https://pulpito.ceph.com/yuriw-2022-08-04_20:54:08-fs-wip-yuri6-testing-2022-08-04-0617-pacific-distro-default-smithi/6959102
3. https://pulpito.ceph.com/yuriw-2022-08-04_20:54:08-fs-wip-yuri6-testing-2022-08-04-0617-pacific-distro-default-smithi/6959168
4. https://pulpito.ceph.com/yuriw-2022-08-04_20:54:08-fs-wip-yuri6-testing-2022-08-04-0617-pacific-distro-default-smithi/6959193
#15 Updated by Radoslaw Zarzynski 6 months ago
- Assignee set to Aishwarya Mathuria
#16 Updated by Kotresh Hiremath Ravishankar 5 months ago
Seen in these recent pacific runs:
1. https://pulpito.ceph.com/yuriw-2022-08-18_23:16:33-fs-wip-yuri10-testing-2022-08-18-1400-pacific-distro-default-smithi/6979462
2. https://pulpito.ceph.com/yuriw-2022-08-18_23:16:33-fs-wip-yuri10-testing-2022-08-18-1400-pacific-distro-default-smithi/6979495
3. https://pulpito.ceph.com/yuriw-2022-08-19_21:01:11-fs-wip-yuri10-testing-2022-08-18-1400-pacific-distro-default-smithi/6981264
4. https://pulpito.ceph.com/yuriw-2022-08-19_21:01:11-fs-wip-yuri10-testing-2022-08-18-1400-pacific-distro-default-smithi/6981270
#17 Updated by Aishwarya Mathuria 5 months ago
I have been going through the failure logs mentioned above and I see that the health check does pass eventually:
2022-08-19T23:14:08.032190+0000 mon.a (mon.0) 1636 : cluster [WRN] Health check failed: Reduced data availability: 1 pg peering (PG_AVAILABILITY)
2022-08-19T23:14:09.980141+0000 mgr.y (mgr.9288) 36 : cluster [DBG] pgmap v37: 137 pgs: 2 peering, 135 active+clean; 11 KiB data, 170 MiB used, 720 GiB / 720 GiB avail; 1.1 KiB/s wr, 0 op/s
2022-08-19T23:14:11.980868+0000 mgr.y (mgr.9288) 37 : cluster [DBG] pgmap v38: 137 pgs: 2 peering, 135 active+clean; 11 KiB data, 170 MiB used, 720 GiB / 720 GiB avail; 127 B/s wr, 0 op/s
2022-08-19T23:14:13.981488+0000 mgr.y (mgr.9288) 38 : cluster [DBG] pgmap v39: 137 pgs: 137 active+clean; 11 KiB data, 170 MiB used, 720 GiB / 720 GiB avail; 127 B/s wr, 0 op/s; 0 B/s, 0 objects/s recovering
2022-08-19T23:14:14.032388+0000 mon.a (mon.0) 1637 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg peering)
2022-08-19T23:14:14.032413+0000 mon.a (mon.0) 1638 : cluster [INF] Cluster is now healthy
But in tasks/ceph.py here: https://github.com/ceph/ceph/blob/main/qa/tasks/ceph.py#L1138 we check for the first occurrence of WRN level logs.
So the test fails because it finds the following log:
2022-08-19T23:15:59.841 INFO:tasks.ceph:Checking cluster log for badness...
2022-08-19T23:15:59.841 DEBUG:teuthology.orchestra.run.smithi138:> sudo egrep '\[ERR\]|\[WRN\]|\[SEC\]' /var/log/ceph/ceph.log | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | egrep -v 'overall HEALTH_' | egrep -v '\(FS_DEGRADED\)' | egrep -v '\(MDS_FAILED\)' | egrep -v '\(MDS_DEGRADED\)' | egrep -v '\(FS_WITH_FAILED_MDS\)' | egrep -v '\(MDS_DAMAGE\)' | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | egrep -v '\(FS_INLINE_DATA_DEPRECATED\)' | egrep -v 'overall HEALTH_' | egrep -v '\(OSD_DOWN\)' | egrep -v '\(OSD_' | egrep -v 'but it is still running' | egrep -v 'is not responding' | head -n 1
2022-08-19T23:15:59.955 INFO:teuthology.orchestra.run.smithi138.stdout:2022-08-19T22:51:27.586480+0000 mon.a (mon.0) 590 : cluster [WRN] Health check failed: Reduced data availability: 2 pgs peering (PG_AVAILABILITY)
2022-08-19T23:15:59.956 WARNING:tasks.ceph:Found errors (ERR|WRN|SEC) in cluster log
2022-08-19T23:16:00.284 INFO:teuthology.orchestra.run.smithi138.stdout:2022-08-19T22:51:27.586480+0000 mon.a (mon.0) 590 : cluster [WRN] Health check failed: Reduced data availability: 2 pgs peering (PG_AVAILABILITY)
If we look at the timestamp, that log is from 25 minutes before the cluster was healthy again. At 2022-08-19T23:16:00 the cluster was actually healthy again
#18 Updated by Aishwarya Mathuria 5 months ago
Took a look at why peering was happening in the first place. Looking at PG 7.16 logs below, we can see that the balancer has changed the OSD from [0,5] to [4,5]. This rebalancing of PGs is causing the peering to start.
Maybe the test failures can be avoided by adding 'Reduced data availability' to the log-ignorelist in the ignorelist_health.yaml?
I am still taking a look at why peering was taking long.
2022-08-19T22:51:25.522+0000 7f3fefbf2700 1 -- [v2:172.21.15.138:3300/0,v1:172.21.15.138:6789/0] <== mgr.6216 172.21.15.138:0/40300 118 ==== mon_command({"prefix": "osd pg-upmap-items", "format": "json", "pgid": "7.16", "id": [0, 5]} v 0) v1 ==== 122+0+0 (secure 0 0 0) 0x5558b7630000 con 0x5558b69aa400
2022-08-19T22:51:25.523+0000 7f3fefbf2700 10 mon.a@0(leader).osd e66 preprocess_query mon_command({"prefix": "osd pg-upmap-items", "format": "json", "pgid": "7.16", "id": [0, 5]} v 0) v1 from mgr.6216 172.21.15.138:0/40300
2022-08-19T22:51:25.523+0000 7f3fefbf2700 7 mon.a@0(leader).osd e66 prepare_update mon_command({"prefix": "osd pg-upmap-items", "format": "json", "pgid": "7.16", "id": [0, 5]} v 0) v1 from mgr.6216 172.21.15.138:0/40300
2022-08-19T22:51:25.523+0000 7f3fefbf2700 10 mon.a@0(leader).log v530 logging 2022-08-19T22:51:25.523722+0000 mon.a (mon.0) 576 : audit [INF] from='mgr.6216 172.21.15.138:0/40300' entity='mgr.y' cmd=[{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "7.16", "id": [0, 5]}]: dispatch
2022-08-19T22:51:25.578+0000 7f3fee3ef700 0 log_channel(audit) log [INF] : from='mgr.6216 172.21.15.138:0/40300' entity='mgr.y' cmd='[{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "7.16", "id": [0, 5]}]': finished
2022-08-19T22:51:25.578+0000 7f3fee3ef700 2 mon.a@0(leader) e1 send_reply 0x5558b6558c30 0x5558b6bd1860 mon_command_ack([{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "7.16", "id": [0, 5]}]=0 set 7.16 pg_upmap_items mapping to [0->5] v67) v1
2022-08-19T22:51:25.578+0000 7f3fee3ef700 1 -- [v2:172.21.15.138:3300/0,v1:172.21.15.138:6789/0] --> 172.21.15.138:0/40300 -- mon_command_ack([{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "7.16", "id": [0, 5]}]=0 set 7.16 pg_upmap_items mapping to [0->5] v67) v1 -- 0x5558b6bd1860 con 0x5558b69aa400
2022-08-19T22:51:25.582+0000 7f3fee3ef700 7 mon.a@0(leader).log v531 update_from_paxos applying incremental log 531 2022-08-19T22:51:25.523722+0000 mon.a (mon.0) 576 : audit [INF] from='mgr.6216 172.21.15.138:0/40300' entity='mgr.y' cmd=[{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "7.16", "id": [0, 5]}]: dispatch
022-08-19T22:51:27.530+0000 7f3fefbf2700 20 mon.a@0(leader).mgrstat health checks:
{
"PG_AVAILABILITY": {
"severity": "HEALTH_WARN",
"summary": {
"message": "Reduced data availability: 2 pgs peering",
"count": 2
},
"detail": [
{
"message": "pg 7.16 is stuck peering for 61s, current state peering, last acting [4,5]"
},
{
"message": "pg 8.25 is stuck peering for 61s, current state peering, last acting [1,6]"
}
]
}
}
#19 Updated by Radoslaw Zarzynski 5 months ago
Let's move back to it next week.