Project

General

Profile

Actions

Bug #45619

closed

Health check failed: Reduced data availability: PG_AVAILABILITY

Added by Yuval Lifshitz about 4 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

multiple RGW tests are failing on different branches, with:

failure_reason: '"2020-05-19T22:16:08.390058+0000 mon.b (mon.0) 275 : cluster [WRN]
  Health check failed: Reduced data availability: 1 pg inactive, 1 pg peering (PG_AVAILABILITY)" 
  in cluster log'

see:
http://pulpito.front.sepia.ceph.com/yuvalif-2020-05-19_14:52:46-rgw:verify-fix-amqp-urls-with-vhosts-distro-basic-smithi/
http://pulpito.front.sepia.ceph.com/yuvalif-2020-05-19_16:05:16-rgw:verify-fix_issue_45586-distro-basic-smithi/
http://pulpito.front.sepia.ceph.com/yuvalif-2020-05-19_16:12:48-rgw:verify-cls_2pc_q_async_api-distro-basic-smithi/


Related issues 1 (0 open1 closed)

Copied to RADOS - Bug #45802: Health check failed: Reduced data availability: PG_AVAILABILITYResolvedNeha Ojha

Actions
Actions #1

Updated by Neha Ojha about 4 years ago

  • Status changed from New to Need More Info

Is this something that has started appearing recently? If not, probably just needs whitelisting.

Actions #2

Updated by Kefu Chai about 4 years ago

it's a new thing. also, before whitelist things, better off figure out why we should whitelist it.

Actions #3

Updated by Casey Bodley about 4 years ago

We've been seeing a lot of this in the rgw suite over the last month or two.

Actions #4

Updated by Neha Ojha about 4 years ago

  • Status changed from Need More Info to New
  • Priority changed from Normal to High

Seen in the rados suite: /a/nojha-2020-05-21_19:33:40-rados-wip-32601-distro-basic-smithi/5077159/

Actions #5

Updated by Neha Ojha almost 4 years ago

  • Subject changed from teuthology tests are failing with: "Health check failed" to Health check failed: Reduced data availability: PG_AVAILABILITY

Neha Ojha wrote:

Seen in the rados suite: /a/nojha-2020-05-21_19:33:40-rados-wip-32601-distro-basic-smithi/5077159/

Here the PG_AVAILABILITY flag is added due to the balancer making changes to pg 16.4.

from the mgr.x's log:

2020-05-21T22:25:16.913+0000 7fa6a4f58700  0 [balancer DEBUG root] Waking up [active, now 2020-05-21_22:25:16]
2020-05-21T22:25:16.917+0000 7fa6a4f58700  0 [balancer INFO root] ceph osd pg-upmap-items 16.4 mappings [{'from': 4, 'to': 3}]
2020-05-21T22:25:16.917+0000 7fa6a4f58700  1 -- 172.21.15.188:0/12583 --> [v2:172.21.15.188:3300/0,v1:172.21.15.188:6789/0] -- mon_command({"prefix": "osd pg-upmap-items", "format": "json", "pgid": "16.4", "id": [4, 3]} v 0) v1 -- 0x55a7faf62580 con 0x55a8002d6c00

from mon.a's log:

All the PGs were active+clean until here.

2020-05-21T22:25:16.931+0000 7f1540539700 20 mon.a@0(leader).mgrstat pending_digest:
    "num_pg_by_state": [
        {
            "state": "active+clean",
            "num": 41
        }
    ],

This is where we see PG_AVAILABILITY

2020-05-21T22:25:18.931+0000 7f1540539700 20 mon.a@0(leader).mgrstat pending_digest:
    "num_pg_by_state": [
        {
            "state": "active+clean",
            "num": 40
        },
        {
            "state": "peering",
            "num": 1
        }
    ],
.
.

2020-05-21T22:25:18.931+0000 7f1540539700 20 mon.a@0(leader).mgrstat health checks:
{
    "PG_AVAILABILITY": {
        "severity": "HEALTH_WARN",
        "summary": {
            "message": "Reduced data availability: 1 pg inactive, 1 pg peering",
            "count": 2
        },
        "detail": [
            {
                "message": "pg 16.4 is stuck peering since forever, current state peering, last acting [3,5]" 
            }
        ]

and what we seeing in between is pg-upmap-items being applied

2020-05-21T22:25:17.087+0000 7f153ed36700  7 mon.a@0(leader).log v90 update_from_paxos applying incremental log 90 2020-05-21T22:25:16.920845+0000 mon.b (mon.1) 94 : audit [INF] from='mgr.4102 172.21.15.188:0/12583' entity='mgr.x' cmd=[{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "16.4", "id": [4, 3]}]: dispatch

from osd.3's perspective, we create the pg, finish peering and then go into active+clean.

2020-05-21T22:25:17.119+0000 7ff5beecd700 10 osd.3 66 _make_pg 16.4
.
.
2020-05-21T22:25:18.223+0000 7ff5beecd700 10 osd.3 pg_epoch: 67 pg[16.4( empty local-lis/les=66/67 n=0 ec=63/63 lis/c=66/66 les/c/f=67/67/0 sis=66) [3,5] r=0 lpr=66 crt=0'0 mlcod 0'0 active+clean] share_pg_info

The later health checks report all PGs active+clean.

2020-05-21T22:25:24.935+0000 7f1540539700 20 mon.a@0(leader).mgrstat pending_digest:
    "num_pg_by_state": [
        {
            "state": "active+clean",
            "num": 41
        }
    ],

This warning is expected when the pg is temporarily unavailable. I'm pretty sure we'll see similar symptoms in the rgw tests, but I will look at the logs and confirm.

Actions #6

Updated by Neha Ojha almost 4 years ago

  • Status changed from New to Triaged
  • Assignee set to Neha Ojha

http://pulpito.front.sepia.ceph.com/yuvalif-2020-05-19_14:52:46-rgw:verify-fix-amqp-urls-with-vhosts-distro-basic-smithi/5069884/

Same symptoms as https://tracker.ceph.com/issues/45619#note-5.

from mgr.x's log

2020-05-19T22:16:06.479+0000 7f9f3e096700  0 [balancer INFO root] ceph osd pg-upmap-items 16.9 mappings [{'from': 4, 'to': 3}]
2020-05-19T22:16:06.479+0000 7f9f3e096700  1 -- 172.21.15.63:0/25768 --> [v2:172.21.15.63:3300/0,v1:172.21.15.63:6789/0] -- mon_command({"prefix": "osd pg-upmap-items", "format": "json", "pgid": "16.9", "id": [4, 3]} v 0) v1 -- 0x562f63429680 con 0x562f68734800

from mon.b's log

2020-05-19T22:16:05.794+0000 7f4484506700 20 mon.b@0(leader).mgrstat pending_digest:
    "num_pg_by_state": [
        {
            "state": "active+clean",
            "num": 297
        }
    ],
2020-05-19T22:16:07.386+0000 7f4482d03700  7 mon.b@0(leader).log v93 update_from_paxos applying incremental log 93 2020-05-19T22:16:06.481348+0000 mon.b (mon.0) 272 : audit [INF] from='mgr.4113 172.21.15.63:0/25768' entity='mgr.x' cmd=[{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "16.9", "id": [4, 3]}]: dispatch

2020-05-19T22:16:07.795+0000 7f4484506700 20 mon.b@0(leader).mgrstat pending_digest:

    "num_pg_by_state": [
        {
            "state": "active+clean",
            "num": 296
        },

        {
            "state": "peering",
            "num": 1
        }
    ],
2020-05-19T22:16:07.795+0000 7f4484506700 20 mon.b@0(leader).mgrstat health checks:
{
    "PG_AVAILABILITY": {
        "severity": "HEALTH_WARN",
        "summary": {
            "message": "Reduced data availability: 1 pg inactive, 1 pg peering",
            "count": 2
        },
        "detail": [
            {
                "message": "pg 16.9 is stuck peering since forever, current state peering, last acting [3,6]" 
            }
        ]
    }
}

from osd.3's log

2020-05-19T22:16:07.394+0000 7f9883185700 10 osd.3 60 _make_pg 16.9
.
.
2020-05-19T22:16:08.405+0000 7f9883185700 10 osd.3 pg_epoch: 61 pg[16.9( empty local-lis/les=60/61 n=0 ec=57/57 lis/c=60/60 les/c/f=61/61/0 sis=60) [3,6] r=0 lpr=60 crt=0'0 mlcod 0'0 active+clean] share_pg_info

This is happening because we turned the balancer on by default in https://github.com/ceph/ceph/pull/34541 and Casey's observation in https://tracker.ceph.com/issues/45619#note-3 about when we started seeing this issue matches the timeframe. It is safe to whitelist this warning.

Actions #7

Updated by Neha Ojha almost 4 years ago

  • Status changed from Triaged to Fix Under Review
  • Pull request ID set to 35302
Actions #8

Updated by Kefu Chai almost 4 years ago

  • Status changed from Fix Under Review to Resolved
Actions #9

Updated by Casey Bodley almost 4 years ago

  • Copied to Bug #45802: Health check failed: Reduced data availability: PG_AVAILABILITY added
Actions

Also available in: Atom PDF