Bug #45619
Health check failed: Reduced data availability: PG_AVAILABILITY
0%
Description
multiple RGW tests are failing on different branches, with:
failure_reason: '"2020-05-19T22:16:08.390058+0000 mon.b (mon.0) 275 : cluster [WRN] Health check failed: Reduced data availability: 1 pg inactive, 1 pg peering (PG_AVAILABILITY)" in cluster log'
see:
http://pulpito.front.sepia.ceph.com/yuvalif-2020-05-19_14:52:46-rgw:verify-fix-amqp-urls-with-vhosts-distro-basic-smithi/
http://pulpito.front.sepia.ceph.com/yuvalif-2020-05-19_16:05:16-rgw:verify-fix_issue_45586-distro-basic-smithi/
http://pulpito.front.sepia.ceph.com/yuvalif-2020-05-19_16:12:48-rgw:verify-cls_2pc_q_async_api-distro-basic-smithi/
Related issues
History
#1 Updated by Neha Ojha almost 4 years ago
- Status changed from New to Need More Info
Is this something that has started appearing recently? If not, probably just needs whitelisting.
#2 Updated by Kefu Chai almost 4 years ago
it's a new thing. also, before whitelist things, better off figure out why we should whitelist it.
#3 Updated by Casey Bodley almost 4 years ago
We've been seeing a lot of this in the rgw suite over the last month or two.
#4 Updated by Neha Ojha almost 4 years ago
- Status changed from Need More Info to New
- Priority changed from Normal to High
Seen in the rados suite: /a/nojha-2020-05-21_19:33:40-rados-wip-32601-distro-basic-smithi/5077159/
#5 Updated by Neha Ojha almost 4 years ago
- Subject changed from teuthology tests are failing with: "Health check failed" to Health check failed: Reduced data availability: PG_AVAILABILITY
Neha Ojha wrote:
Seen in the rados suite: /a/nojha-2020-05-21_19:33:40-rados-wip-32601-distro-basic-smithi/5077159/
Here the PG_AVAILABILITY flag is added due to the balancer making changes to pg 16.4.
from the mgr.x's log:
2020-05-21T22:25:16.913+0000 7fa6a4f58700 0 [balancer DEBUG root] Waking up [active, now 2020-05-21_22:25:16] 2020-05-21T22:25:16.917+0000 7fa6a4f58700 0 [balancer INFO root] ceph osd pg-upmap-items 16.4 mappings [{'from': 4, 'to': 3}] 2020-05-21T22:25:16.917+0000 7fa6a4f58700 1 -- 172.21.15.188:0/12583 --> [v2:172.21.15.188:3300/0,v1:172.21.15.188:6789/0] -- mon_command({"prefix": "osd pg-upmap-items", "format": "json", "pgid": "16.4", "id": [4, 3]} v 0) v1 -- 0x55a7faf62580 con 0x55a8002d6c00
from mon.a's log:
All the PGs were active+clean until here.
2020-05-21T22:25:16.931+0000 7f1540539700 20 mon.a@0(leader).mgrstat pending_digest: "num_pg_by_state": [ { "state": "active+clean", "num": 41 } ],
This is where we see PG_AVAILABILITY
2020-05-21T22:25:18.931+0000 7f1540539700 20 mon.a@0(leader).mgrstat pending_digest: "num_pg_by_state": [ { "state": "active+clean", "num": 40 }, { "state": "peering", "num": 1 } ], . . 2020-05-21T22:25:18.931+0000 7f1540539700 20 mon.a@0(leader).mgrstat health checks: { "PG_AVAILABILITY": { "severity": "HEALTH_WARN", "summary": { "message": "Reduced data availability: 1 pg inactive, 1 pg peering", "count": 2 }, "detail": [ { "message": "pg 16.4 is stuck peering since forever, current state peering, last acting [3,5]" } ]
and what we seeing in between is pg-upmap-items being applied
2020-05-21T22:25:17.087+0000 7f153ed36700 7 mon.a@0(leader).log v90 update_from_paxos applying incremental log 90 2020-05-21T22:25:16.920845+0000 mon.b (mon.1) 94 : audit [INF] from='mgr.4102 172.21.15.188:0/12583' entity='mgr.x' cmd=[{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "16.4", "id": [4, 3]}]: dispatch
from osd.3's perspective, we create the pg, finish peering and then go into active+clean.
2020-05-21T22:25:17.119+0000 7ff5beecd700 10 osd.3 66 _make_pg 16.4 . . 2020-05-21T22:25:18.223+0000 7ff5beecd700 10 osd.3 pg_epoch: 67 pg[16.4( empty local-lis/les=66/67 n=0 ec=63/63 lis/c=66/66 les/c/f=67/67/0 sis=66) [3,5] r=0 lpr=66 crt=0'0 mlcod 0'0 active+clean] share_pg_info
The later health checks report all PGs active+clean.
2020-05-21T22:25:24.935+0000 7f1540539700 20 mon.a@0(leader).mgrstat pending_digest: "num_pg_by_state": [ { "state": "active+clean", "num": 41 } ],
This warning is expected when the pg is temporarily unavailable. I'm pretty sure we'll see similar symptoms in the rgw tests, but I will look at the logs and confirm.
#6 Updated by Neha Ojha almost 4 years ago
- Status changed from New to Triaged
- Assignee set to Neha Ojha
Same symptoms as https://tracker.ceph.com/issues/45619#note-5.
from mgr.x's log
2020-05-19T22:16:06.479+0000 7f9f3e096700 0 [balancer INFO root] ceph osd pg-upmap-items 16.9 mappings [{'from': 4, 'to': 3}] 2020-05-19T22:16:06.479+0000 7f9f3e096700 1 -- 172.21.15.63:0/25768 --> [v2:172.21.15.63:3300/0,v1:172.21.15.63:6789/0] -- mon_command({"prefix": "osd pg-upmap-items", "format": "json", "pgid": "16.9", "id": [4, 3]} v 0) v1 -- 0x562f63429680 con 0x562f68734800
from mon.b's log
2020-05-19T22:16:05.794+0000 7f4484506700 20 mon.b@0(leader).mgrstat pending_digest: "num_pg_by_state": [ { "state": "active+clean", "num": 297 } ], 2020-05-19T22:16:07.386+0000 7f4482d03700 7 mon.b@0(leader).log v93 update_from_paxos applying incremental log 93 2020-05-19T22:16:06.481348+0000 mon.b (mon.0) 272 : audit [INF] from='mgr.4113 172.21.15.63:0/25768' entity='mgr.x' cmd=[{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "16.9", "id": [4, 3]}]: dispatch 2020-05-19T22:16:07.795+0000 7f4484506700 20 mon.b@0(leader).mgrstat pending_digest: "num_pg_by_state": [ { "state": "active+clean", "num": 296 }, { "state": "peering", "num": 1 } ], 2020-05-19T22:16:07.795+0000 7f4484506700 20 mon.b@0(leader).mgrstat health checks: { "PG_AVAILABILITY": { "severity": "HEALTH_WARN", "summary": { "message": "Reduced data availability: 1 pg inactive, 1 pg peering", "count": 2 }, "detail": [ { "message": "pg 16.9 is stuck peering since forever, current state peering, last acting [3,6]" } ] } }
from osd.3's log
2020-05-19T22:16:07.394+0000 7f9883185700 10 osd.3 60 _make_pg 16.9 . . 2020-05-19T22:16:08.405+0000 7f9883185700 10 osd.3 pg_epoch: 61 pg[16.9( empty local-lis/les=60/61 n=0 ec=57/57 lis/c=60/60 les/c/f=61/61/0 sis=60) [3,6] r=0 lpr=60 crt=0'0 mlcod 0'0 active+clean] share_pg_info
This is happening because we turned the balancer on by default in https://github.com/ceph/ceph/pull/34541 and Casey's observation in https://tracker.ceph.com/issues/45619#note-3 about when we started seeing this issue matches the timeframe. It is safe to whitelist this warning.
#7 Updated by Neha Ojha almost 4 years ago
- Status changed from Triaged to Fix Under Review
- Pull request ID set to 35302
#8 Updated by Kefu Chai almost 4 years ago
- Status changed from Fix Under Review to Resolved
#9 Updated by Casey Bodley almost 4 years ago
- Copied to Bug #45802: Health check failed: Reduced data availability: PG_AVAILABILITY added