Project

General

Profile

Bug #51282

pybind/mgr/mgr_util: .mgr pool may be created too early causing spurious PG_DEGRADED warnings

Added by Patrick Donnelly over 1 year ago. Updated 5 months ago.

Status:
New
Priority:
Normal
Category:
-
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
4 - irritation
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2021-06-16T22:22:43.040+0000 7f6e8e779700 20 mon.a@0(leader).mgrstat health checks:
{   
    "PG_DEGRADED": {
        "severity": "HEALTH_WARN",
        "summary": {
            "message": "Degraded data redundancy: 2/4 objects degraded (50.000%), 1 pg degraded",
            "count": 1
        },
        "detail": [
            {   
                "message": "pg 1.0 is active+undersized+degraded, acting [7]" 
            }
        ]
    }
}

From: /ceph/teuthology-archive/pdonnell-2021-06-16_21:26:55-fs-wip-pdonnell-testing-20210616.191804-distro-basic-smithi/6175605/remote/smithi120/log/ceph-mon.a.log.gz

and a few other tests:

Failure: "2021-06-16T22:22:43.881363+0000 mon.a (mon.0) 143 : cluster [WRN] Health check failed: Degraded data redundancy: 2/4 objects degraded (50.000%), 1 pg degraded (PG_DEGRADED)" in cluster log
7 jobs: ['6175605', '6175619', '6175556', '6175591', '6175671', '6175600', '6175639']
suites intersection: ['conf/{client', 'mds', 'mon', 'osd}', 'overrides/{frag_enable', 'whitelist_health', 'whitelist_wrongly_marked_down}']
suites union: ['clusters/1a11s-mds-1c-client-3node', 'clusters/1a3s-mds-1c-client', 'conf/{client', 'distro/{centos_8}', 'distro/{rhel_8}', 'distro/{ubuntu_latest}', 'fs/snaps/{begin', 'fs/workload/{begin', 'k-testing}', 'mds', 'mon', 'mount/fuse', 'mount/kclient/{mount', 'ms-die-on-skipped}}', 'objectstore-ec/bluestore-comp', 'objectstore-ec/bluestore-comp-ec-root', 'omap_limit/10', 'omap_limit/10000', 'osd-asserts', 'osd}', 'overrides/{distro/testing/{flavor/centos_latest', 'overrides/{distro/testing/{flavor/ubuntu_latest', 'overrides/{frag_enable', 'ranks/1', 'ranks/3', 'ranks/5', 'scrub/no', 'scrub/yes', 'session_timeout', 'standby-replay', 'tasks/workunit/snaps}', 'tasks/{0-check-counter', 'whitelist_health', 'whitelist_wrongly_marked_down}', 'workunit/fs/misc}', 'workunit/fs/test_o_trunc}', 'workunit/suites/ffsb}', 'workunit/suites/fsstress}', 'workunit/suites/iogen}', 'workunit/suites/iozone}', 'wsync/{no}}', 'wsync/{yes}}']

I'm thinking this check is not quite right:https://github.com/ceph/ceph/blob/05d7f883a04d230cf17b40a1c7e8044d402c6a30/src/pybind/mgr/mgr_module.py#L972-L981

(I lifted that code from the devicehealth module.)


Related issues

Duplicated by rgw - Bug #51727: cluster [WRN] Health check failed: Degraded data redundancy: 2/4 objects degraded (50.000%), 1 pg degraded (PG_DEGRADED)" in cluster log Closed
Duplicated by CephFS - Bug #55825: cluster [WRN] Health check failed: Degraded data redundancy: 1 pg degraded (PG_DEGRADED)" in cluster Duplicate

History

#1 Updated by Brad Hubbard over 1 year ago

  • Subject changed from pybind/mgr/mgr_util: .mgr pool may be created to early causing spurious PG_DEGRADED warnings to pybind/mgr/mgr_util: .mgr pool may be created too early causing spurious PG_DEGRADED warnings

#2 Updated by Brad Hubbard over 1 year ago

  • Project changed from Ceph to mgr

#3 Updated by Neha Ojha over 1 year ago

2021-06-16T22:22:36.116+0000 7f6e8bf74700  0 log_channel(cluster) log [DBG] : osdmap e18: 12 total, 4 up, 12 in
...
2021-06-16T22:22:37.144+0000 7f6e90f7e700 10 mon.a@0(leader).osd e18 update_pending_pgs  pg 1.0 just added,  up [7] p 7 acting [7] p 7 history ec=19/19 lis/c=0/0 les/c/f=0/0/0 sis=19 past_intervals ([0,0] all_participants= intervals=)
...
2021-06-16T22:22:43.040+0000 7f6e8e779700 20 mon.a@0(leader).mgrstat health checks:
{
    "PG_DEGRADED": {
        "severity": "HEALTH_WARN",
        "summary": {
            "message": "Degraded data redundancy: 2/4 objects degraded (50.000%), 1 pg degraded",
            "count": 1
        },
        "detail": [
            {
                "message": "pg 1.0 is active+undersized+degraded, acting [7]" 
            }
        ]
    }
}

The 4 osds that were up and in:

2021-06-16T22:22:36.112+0000 7f6e90f7e700  2 mon.a@0(leader).osd e17  osd.1 UP [v2:172.21.15.199:6808/20782,v1:172.21.15.199:6809/20782]
2021-06-16T22:22:36.112+0000 7f6e90f7e700  2 mon.a@0(leader).osd e17  osd.4 UP [v2:172.21.15.199:6800/20784,v1:172.21.15.199:6801/20784]
2021-06-16T22:22:36.112+0000 7f6e90f7e700  2 mon.a@0(leader).osd e17  osd.7 UP [v2:172.21.15.199:6816/20783,v1:172.21.15.199:6817/20783]
2021-06-16T22:22:36.112+0000 7f6e90f7e700  2 mon.a@0(leader).osd e17  osd.10 UP [v2:172.21.15.199:6824/20785,v1:172.21.15.199:6825/20785]
2021-06-16T22:22:36.112+0000 7f6e90f7e700  2 mon.a@0(leader).osd e17  osd.1 IN
2021-06-16T22:22:36.112+0000 7f6e90f7e700  2 mon.a@0(leader).osd e17  osd.4 IN
2021-06-16T22:22:36.112+0000 7f6e90f7e700  2 mon.a@0(leader).osd e17  osd.7 IN
2021-06-16T22:22:36.112+0000 7f6e90f7e700  2 mon.a@0(leader).osd e17  osd.10 IN
2021-06-16T22:10:25.464 INFO:teuthology.task.internal:roles: ubuntu@smithi199.front.sepia.ceph.com - ['mon.b', 'mgr.y', 'mds.b', 'mds.e', 'mds.h', 'mds.k', 'osd.1', 'osd.4', 'osd.7', 'osd.10']

Looks like the have_enough_osds check would have passed because we've got [4 up, 12 in] OSDs when the pool was created, but for some reason, we created the PG with 1 osd. Could be something to do with how crush is picking placements targets, since all those 4 OSDs are from the same host.

#4 Updated by Neha Ojha over 1 year ago

/a/sage-2021-06-29_21:27:07-rados-wip-sage3-testing-2021-06-28-1912-distro-basic-smithi/6244042
/a/rfriedma-2021-07-07_11:26:37-rados-wip-ronenf-list-object-distro-basic-smithi/6255195

These were seen in rados/objectstore on a single host but similar symptoms. I will try to see if I can reproduce this reliably.

2021-06-29T23:51:12.764+0000 7f5d530cf700  1 mon.a@0(leader).osd e8 e8: 3 total, 2 up, 3 in
2021-06-29T23:51:14.552+0000 7f5d558d4700  1 -- [v2:172.21.15.60:3300/0,v1:172.21.15.60:6789/0] <== mgr.4100 172.21.15.60:0/13831 63 ==== mon_command({"prefix": "osd pool create", "format": "json", "pool": ".mgr", "pg_num": 1, "pg_num_min": 1} v 0) v1 ==== 135+0+0 (secure 0 0 0) 0x5584574a7b00 con 0x5584574cc400
2021-06-29T23:51:15.536+0000 7f5d580d9700 10 mon.a@0(leader).osd e9 update_pending_pgs  pg 1.0 just added,  up [1] p 1 acting [1] p 1 history ec=10/10 lis/c=0/0 les/c/f=0/0/0 sis=10 past_intervals ([0,0] all_participants= intervals=)

#5 Updated by Neha Ojha over 1 year ago

  • Duplicated by Bug #51727: cluster [WRN] Health check failed: Degraded data redundancy: 2/4 objects degraded (50.000%), 1 pg degraded (PG_DEGRADED)" in cluster log added

#6 Updated by Neha Ojha over 1 year ago

Neha Ojha wrote:

/a/sage-2021-06-29_21:27:07-rados-wip-sage3-testing-2021-06-28-1912-distro-basic-smithi/6244042
/a/rfriedma-2021-07-07_11:26:37-rados-wip-ronenf-list-object-distro-basic-smithi/6255195

These were seen in rados/objectstore on a single host but similar symptoms. I will try to see if I can reproduce this reliably.

[...]

Fails 2/40 times https://pulpito.ceph.com/nojha-2021-07-19_17:59:45-rados:objectstore-master-distro-basic-smithi/

#8 Updated by Casey Bodley over 1 year ago

should the rgw suite be whitelisting these too?

#9 Updated by Patrick Donnelly over 1 year ago

Casey Bodley wrote:

should the rgw suite be whitelisting these too?

I think ignorelisting these errors/warnings is a mistake as it can catch genuine problems in QA testing. Once this issue is fixed it's likely we forget to remove the ignorelist entry.

As annoying as it is, we should let this keep failing QA to give us the proper motivation to fix it.

#10 Updated by Neha Ojha over 1 year ago

  • Assignee set to Neha Ojha
  • Priority changed from High to Urgent

#11 Updated by Neha Ojha over 1 year ago

I reproduced this issue with additional logging using the same reproducer https://tracker.ceph.com/issues/51282#note-6. This time 1/40 jobs failed - https://pulpito.ceph.com/nojha-2021-08-20_22:23:09-rados:objectstore-wip-51282-distro-basic-smithi/

PASSED

2021-08-20T22:40:48.504+0000 7f2006311700  1 mon.a@0(leader).osd e9 update_pending_pgs osd.0 up   in  weight 1 up_from 8 up_thru 0 down_at 0 last_clean_interval [0,0) [v2:172.21.15.23:6804/28252,v1:172.21.15.23:6806/28252] [v2:172.21.15.23:6808/28252,v1:172.21.15.23:6810/28252] exists,up 83dfbd47-d084-43d9-bf49-fdd016340430
osd.1 up   in  weight 1 up_from 8 up_thru 0 down_at 0 last_clean_interval [0,0) [v2:172.21.15.23:6800/28253,v1:172.21.15.23:6801/28253] [v2:172.21.15.23:6802/28253,v1:172.21.15.23:6803/28253] exists,up ebba65bc-0d7b-4a81-98a6-3ac7376ac599
osd.2 up   in  weight 1 up_from 8 up_thru 0 down_at 0 last_clean_interval [0,0) [v2:172.21.15.23:6816/28254,v1:172.21.15.23:6817/28254] [v2:172.21.15.23:6818/28254,v1:172.21.15.23:6819/28254] exists,up 8bbb1841-c1c3-48a5-96eb-b697db0b6d29

2021-08-20T22:40:48.504+0000 7f2006311700 10 mon.a@0(leader).osd e9 update_pending_pgs  pg 1.0 just added,  up [1,0] p 1 acting [1,0] p 1 history ec=10/10 lis/c=0/0 les/c/f=0/0/0 sis=10 past_intervals ([0,0] all_participants= intervals=)

vs FAILED

2021-08-20T22:46:59.355+0000 7f6209b49700  1 mon.a@0(leader).osd e9 update_pending_pgs osd.0 up   in  weight 1 up_from 8 up_thru 0 down_at 0 last_clean_interval [0,0) [v2:172.21.15.149:6800/28223,v1:172.21.15.149:6801/28223] [v2:172.21.15.149:6802/28223,v1:172.21.15.149:6803/28223] exists,up 9b0f044d-f1bf-474d-b87e-0ebe954c3e7c
osd.1 down in  weight 1 up_from 0 up_thru 0 down_at 0 last_clean_interval [0,0)   exists,new a91c812b-d9b3-4b78-a493-d1309fe6ca62
osd.2 up   in  weight 1 up_from 8 up_thru 0 down_at 0 last_clean_interval [0,0) [v2:172.21.15.149:6816/28226,v1:172.21.15.149:6817/28226] [v2:172.21.15.149:6818/28226,v1:172.21.15.149:6819/28226] exists,up ecb81170-2d1e-4c5e-acd3-91d38d254299

2021-08-20T22:46:59.355+0000 7f6209b49700 10 mon.a@0(leader).osd e9 update_pending_pgs  pg 1.0 just added,  up [0] p 0 acting [0] p 0 history ec=10/10 lis/c=0/0 les/c/f=0/0/0 sis=10 past_intervals ([0,0] all_participants= intervals=)

Looks like when pg 1.0 was added, though there were 2 candidate osds (osd.0 and osd.2) present in the osdmap, pg_to_up_acting_osds chose 1 osd and then we start seeing PG_DEGRADED warnings when a couple of objects are written to the pg. Need to check if anything has changed in this logic in master.

{
    "PG_DEGRADED": {
        "severity": "HEALTH_WARN",
        "summary": {
            "message": "Degraded data redundancy: 2/4 objects degraded (50.000%), 1 pg degraded",
            "count": 1
        },
        "detail": [
            {
                "message": "pg 1.0 is active+undersized+degraded, acting [0]" 
            }
        ]
    }
}

#12 Updated by Neha Ojha over 1 year ago

Couple of updates:

1. I am not able to reproduce the issue in pacific with the same reproducer, https://pulpito.ceph.com/nojha-2021-08-25_18:46:59-rados:objectstore-wip-yuri-testing-2021-08-24-0743-pacific-distro-basic-smithi/.

2. I looked at recent changes in the relevant code path and found https://github.com/ceph/ceph/pull/39631, so I decided to revert it and run it through my reproducer, and it passed! https://pulpito.ceph.com/nojha-2021-08-27_03:56:45-rados:objectstore-wip-51282-revert-distro-basic-smithi/
The interesting thing is that the patch was also backported to pacific, but did not cause this issue there.

In any case, Patrick/Casey/Deepika, can you please run wip-51282-revert through your corresponding suites and let me know if it fixes the issue for you as well?

#13 Updated by Patrick Donnelly over 1 year ago

Neha Ojha wrote:

Couple of updates:

1. I am not able to reproduce the issue in pacific with the same reproducer, https://pulpito.ceph.com/nojha-2021-08-25_18:46:59-rados:objectstore-wip-yuri-testing-2021-08-24-0743-pacific-distro-basic-smithi/.

2. I looked at recent changes in the relevant code path and found https://github.com/ceph/ceph/pull/39631, so I decided to revert it and run it through my reproducer, and it passed! https://pulpito.ceph.com/nojha-2021-08-27_03:56:45-rados:objectstore-wip-51282-revert-distro-basic-smithi/
The interesting thing is that the patch was also backported to pacific, but did not cause this issue there.

Well, in Pacific the .mgr pool does not exist. The device_health_metrics pool should generate the same warning though. There may be a subtle difference?

In any case, Patrick/Casey/Deepika, can you please run wip-51282-revert through your corresponding suites and let me know if it fixes the issue for you as well?

https://pulpito.ceph.com/pdonnell-2021-08-27_17:21:13-fs-wip-51282-revert-distro-basic-smithi/

#14 Updated by Neha Ojha over 1 year ago

Patrick Donnelly wrote:

Neha Ojha wrote:

Couple of updates:

1. I am not able to reproduce the issue in pacific with the same reproducer, https://pulpito.ceph.com/nojha-2021-08-25_18:46:59-rados:objectstore-wip-yuri-testing-2021-08-24-0743-pacific-distro-basic-smithi/.

2. I looked at recent changes in the relevant code path and found https://github.com/ceph/ceph/pull/39631, so I decided to revert it and run it through my reproducer, and it passed! https://pulpito.ceph.com/nojha-2021-08-27_03:56:45-rados:objectstore-wip-51282-revert-distro-basic-smithi/
The interesting thing is that the patch was also backported to pacific, but did not cause this issue there.

Well, in Pacific the .mgr pool does not exist. The device_health_metrics pool should generate the same warning though. There may be a subtle difference?

In any case, Patrick/Casey/Deepika, can you please run wip-51282-revert through your corresponding suites and let me know if it fixes the issue for you as well?

https://pulpito.ceph.com/pdonnell-2021-08-27_17:21:13-fs-wip-51282-revert-distro-basic-smithi/

Comparing with the earlier run https://pulpito.ceph.com/pdonnell-2021-06-16_21:26:55-fs-wip-pdonnell-testing-20210616.191804-distro-basic-smithi/, there are fewer instances (2 to be accurate) of PG_DEGRADED, but these are not during pool creation, but are related to recovery, where these warnings are expected.

2021-08-27T19:36:02.842+0000 7f179e14c700 10 mon.a@0(leader).osd e19 update_pending_pgs  pg 1.0 just added,  up [7,10] p 7 acting [7,10] p 7 history ec=20/20 lis/c=0/0 les/c/f=0/0/0 sis=20 past_intervals ([0,0] all_participants= intervals=)

#15 Updated by Deepika Upadhyay over 1 year ago

@Neha hit this one in RBD suite:

description: rbd/mirror/{base/install clients/{mirror-extra mirror} cluster/{2-node
  openstack} msgr-failures/few objectstore/filestore-xfs supported-random-distro$/{rhel_8}
  workloads/rbd-mirror-workunit-min-compat-client-octopus}
duration: 2079.4988112449646
failure_reason: '"2021-11-02T14:27:49.280378+0000 mon.a (mon.0) 114 : cluster [WRN]
  Health check failed: Degraded data redundancy: 2/4 objects degraded (50.000%), 1
  pg degraded (PG_DEGRADED)" in cluster log'

/ceph/teuthology-archive/ideepika-2021-11-02_12:33:30-rbd-wip-ssd-cache-testing-distro-basic-smithi/6477554/teuthology.log

#16 Updated by Casey Bodley about 1 year ago

Patrick Donnelly wrote:

Casey Bodley wrote:

should the rgw suite be whitelisting these too?

I think ignorelisting these errors/warnings is a mistake as it can catch genuine problems in QA testing. Once this issue is fixed it's likely we forget to remove the ignorelist entry.

As annoying as it is, we should let this keep failing QA to give us the proper motivation to fix it.

waiting hasn't helped yet. i'm adding this to rgw's ignorelist so it doesn't show up in quincy backport testing

#17 Updated by Sridhar Seshasayee 7 months ago

/a/yuriw-2022-06-29_18:22:37-rados-wip-yuri2-testing-2022-06-29-0820-distro-default-smithi/6906054

#18 Updated by Radoslaw Zarzynski 7 months ago

  • Priority changed from Urgent to Normal

Lower the priority because: 1) the issue is basically just an irritation-causing one and 2) the reproduction rate is pretty low.

#19 Updated by Kamoltat (Junior) Sirivadhna 6 months ago

/a/yuriw-2022-08-04_11:58:29-rados-wip-yuri3-testing-2022-08-03-0828-pacific-distro-default-smithi/6958108

#21 Updated by Laura Flores 5 months ago

  • Duplicated by Bug #55825: cluster [WRN] Health check failed: Degraded data redundancy: 1 pg degraded (PG_DEGRADED)" in cluster added

#22 Updated by Laura Flores 5 months ago

/a/yuriw-2022-08-22_20:21:58-rados-wip-yuri11-testing-2022-08-22-1005-distro-default-smithi/6986324

Description: rados/monthrash/{ceph clusters/9-mons mon_election/connectivity msgr-failures/few msgr/async objectstore/bluestore-comp-zlib rados supported-random-distro$/{ubuntu_latest} thrashers/sync workloads/rados_5925}

2022-08-22T22:20:05.636 DEBUG:teuthology.orchestra.run.smithi050:> sudo egrep '\[ERR\]|\[WRN\]|\[SEC\]' /var/log/ceph/ceph.log | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | egrep -v ScrubResult | egrep -v 'scrub mismatch' | egrep -v 'overall HEALTH_' | egrep -v '\(MGR_DOWN\)' | egrep -v '\(PG_AVAILABILITY\)' | egrep -v '\(SLOW_OPS\)' | egrep -v 'slow request' | egrep -v '\(OSD_SLOW_PING_TIME' | egrep -v 'overall HEALTH_' | egrep -v '\(MON_DOWN\)' | egrep -v 'overall HEALTH_' | egrep -v '\(POOL_APP_NOT_ENABLED\)' | head -n 1
2022-08-22T22:20:05.659 INFO:teuthology.orchestra.run.smithi050.stdout:2022-08-22T22:17:00.832057+0000 mon.d (mon.6) 246 : cluster [WRN] Health check failed: Degraded data redundancy: 2/4 objects degraded (50.000%), 1 pg degraded (PG_DEGRADED)
2022-08-22T22:20:05.660 WARNING:tasks.ceph:Found errors (ERR|WRN|SEC) in cluster log
2022-08-22T22:20:05.660 DEBUG:teuthology.orchestra.run.smithi050:> sudo egrep '\[SEC\]' /var/log/ceph/ceph.log | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | egrep -v ScrubResult | egrep -v 'scrub mismatch' | egrep -v 'overall HEALTH_' | egrep -v '\(MGR_DOWN\)' | egrep -v '\(PG_AVAILABILITY\)' | egrep -v '\(SLOW_OPS\)' | egrep -v 'slow request' | egrep -v '\(OSD_SLOW_PING_TIME' | egrep -v 'overall HEALTH_' | egrep -v '\(MON_DOWN\)' | egrep -v 'overall HEALTH_' | egrep -v '\(POOL_APP_NOT_ENABLED\)' | head -n 1
2022-08-22T22:20:05.722 DEBUG:teuthology.orchestra.run.smithi050:> sudo egrep '\[ERR\]' /var/log/ceph/ceph.log | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | egrep -v ScrubResult | egrep -v 'scrub mismatch' | egrep -v 'overall HEALTH_' | egrep -v '\(MGR_DOWN\)' | egrep -v '\(PG_AVAILABILITY\)' | egrep -v '\(SLOW_OPS\)' | egrep -v 'slow request' | egrep -v '\(OSD_SLOW_PING_TIME' | egrep -v 'overall HEALTH_' | egrep -v '\(MON_DOWN\)' | egrep -v 'overall HEALTH_' | egrep -v '\(POOL_APP_NOT_ENABLED\)' | head -n 1
2022-08-22T22:20:05.740 DEBUG:teuthology.orchestra.run.smithi050:> sudo egrep '\[WRN\]' /var/log/ceph/ceph.log | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | egrep -v ScrubResult | egrep -v 'scrub mismatch' | egrep -v 'overall HEALTH_' | egrep -v '\(MGR_DOWN\)' | egrep -v '\(PG_AVAILABILITY\)' | egrep -v '\(SLOW_OPS\)' | egrep -v 'slow request' | egrep -v '\(OSD_SLOW_PING_TIME' | egrep -v 'overall HEALTH_' | egrep -v '\(MON_DOWN\)' | egrep -v 'overall HEALTH_' | egrep -v '\(POOL_APP_NOT_ENABLED\)' | head -n 1
2022-08-22T22:20:05.815 INFO:teuthology.orchestra.run.smithi050.stdout:2022-08-22T22:17:00.832057+0000 mon.d (mon.6) 246 : cluster [WRN] Health check failed: Degraded data redundancy: 2/4 objects degraded (50.000%), 1 pg degraded (PG_DEGRADED)

Passed in a rerun.

#23 Updated by Laura Flores 5 months ago

  • Tags set to test-failure

#24 Updated by Radoslaw Zarzynski 5 months ago

  • Assignee changed from Neha Ojha to Nitzan Mordechai

We would need judge whether it's expected. If so, we can extend the ignore list.

Also available in: Atom PDF