Bug #51282
pybind/mgr/mgr_util: .mgr pool may be created too early causing spurious PG_DEGRADED warnings
0%
Description
2021-06-16T22:22:43.040+0000 7f6e8e779700 20 mon.a@0(leader).mgrstat health checks: { "PG_DEGRADED": { "severity": "HEALTH_WARN", "summary": { "message": "Degraded data redundancy: 2/4 objects degraded (50.000%), 1 pg degraded", "count": 1 }, "detail": [ { "message": "pg 1.0 is active+undersized+degraded, acting [7]" } ] } }
From: /ceph/teuthology-archive/pdonnell-2021-06-16_21:26:55-fs-wip-pdonnell-testing-20210616.191804-distro-basic-smithi/6175605/remote/smithi120/log/ceph-mon.a.log.gz
and a few other tests:
Failure: "2021-06-16T22:22:43.881363+0000 mon.a (mon.0) 143 : cluster [WRN] Health check failed: Degraded data redundancy: 2/4 objects degraded (50.000%), 1 pg degraded (PG_DEGRADED)" in cluster log 7 jobs: ['6175605', '6175619', '6175556', '6175591', '6175671', '6175600', '6175639'] suites intersection: ['conf/{client', 'mds', 'mon', 'osd}', 'overrides/{frag_enable', 'whitelist_health', 'whitelist_wrongly_marked_down}'] suites union: ['clusters/1a11s-mds-1c-client-3node', 'clusters/1a3s-mds-1c-client', 'conf/{client', 'distro/{centos_8}', 'distro/{rhel_8}', 'distro/{ubuntu_latest}', 'fs/snaps/{begin', 'fs/workload/{begin', 'k-testing}', 'mds', 'mon', 'mount/fuse', 'mount/kclient/{mount', 'ms-die-on-skipped}}', 'objectstore-ec/bluestore-comp', 'objectstore-ec/bluestore-comp-ec-root', 'omap_limit/10', 'omap_limit/10000', 'osd-asserts', 'osd}', 'overrides/{distro/testing/{flavor/centos_latest', 'overrides/{distro/testing/{flavor/ubuntu_latest', 'overrides/{frag_enable', 'ranks/1', 'ranks/3', 'ranks/5', 'scrub/no', 'scrub/yes', 'session_timeout', 'standby-replay', 'tasks/workunit/snaps}', 'tasks/{0-check-counter', 'whitelist_health', 'whitelist_wrongly_marked_down}', 'workunit/fs/misc}', 'workunit/fs/test_o_trunc}', 'workunit/suites/ffsb}', 'workunit/suites/fsstress}', 'workunit/suites/iogen}', 'workunit/suites/iozone}', 'wsync/{no}}', 'wsync/{yes}}']
I'm thinking this check is not quite right:https://github.com/ceph/ceph/blob/05d7f883a04d230cf17b40a1c7e8044d402c6a30/src/pybind/mgr/mgr_module.py#L972-L981
(I lifted that code from the devicehealth module.)
Related issues
History
#1 Updated by Brad Hubbard over 1 year ago
- Subject changed from pybind/mgr/mgr_util: .mgr pool may be created to early causing spurious PG_DEGRADED warnings to pybind/mgr/mgr_util: .mgr pool may be created too early causing spurious PG_DEGRADED warnings
#2 Updated by Brad Hubbard over 1 year ago
- Project changed from Ceph to mgr
#3 Updated by Neha Ojha over 1 year ago
2021-06-16T22:22:36.116+0000 7f6e8bf74700 0 log_channel(cluster) log [DBG] : osdmap e18: 12 total, 4 up, 12 in ... 2021-06-16T22:22:37.144+0000 7f6e90f7e700 10 mon.a@0(leader).osd e18 update_pending_pgs pg 1.0 just added, up [7] p 7 acting [7] p 7 history ec=19/19 lis/c=0/0 les/c/f=0/0/0 sis=19 past_intervals ([0,0] all_participants= intervals=) ... 2021-06-16T22:22:43.040+0000 7f6e8e779700 20 mon.a@0(leader).mgrstat health checks: { "PG_DEGRADED": { "severity": "HEALTH_WARN", "summary": { "message": "Degraded data redundancy: 2/4 objects degraded (50.000%), 1 pg degraded", "count": 1 }, "detail": [ { "message": "pg 1.0 is active+undersized+degraded, acting [7]" } ] } }
The 4 osds that were up and in:
2021-06-16T22:22:36.112+0000 7f6e90f7e700 2 mon.a@0(leader).osd e17 osd.1 UP [v2:172.21.15.199:6808/20782,v1:172.21.15.199:6809/20782] 2021-06-16T22:22:36.112+0000 7f6e90f7e700 2 mon.a@0(leader).osd e17 osd.4 UP [v2:172.21.15.199:6800/20784,v1:172.21.15.199:6801/20784] 2021-06-16T22:22:36.112+0000 7f6e90f7e700 2 mon.a@0(leader).osd e17 osd.7 UP [v2:172.21.15.199:6816/20783,v1:172.21.15.199:6817/20783] 2021-06-16T22:22:36.112+0000 7f6e90f7e700 2 mon.a@0(leader).osd e17 osd.10 UP [v2:172.21.15.199:6824/20785,v1:172.21.15.199:6825/20785] 2021-06-16T22:22:36.112+0000 7f6e90f7e700 2 mon.a@0(leader).osd e17 osd.1 IN 2021-06-16T22:22:36.112+0000 7f6e90f7e700 2 mon.a@0(leader).osd e17 osd.4 IN 2021-06-16T22:22:36.112+0000 7f6e90f7e700 2 mon.a@0(leader).osd e17 osd.7 IN 2021-06-16T22:22:36.112+0000 7f6e90f7e700 2 mon.a@0(leader).osd e17 osd.10 IN
2021-06-16T22:10:25.464 INFO:teuthology.task.internal:roles: ubuntu@smithi199.front.sepia.ceph.com - ['mon.b', 'mgr.y', 'mds.b', 'mds.e', 'mds.h', 'mds.k', 'osd.1', 'osd.4', 'osd.7', 'osd.10']
Looks like the have_enough_osds check would have passed because we've got [4 up, 12 in] OSDs when the pool was created, but for some reason, we created the PG with 1 osd. Could be something to do with how crush is picking placements targets, since all those 4 OSDs are from the same host.
#4 Updated by Neha Ojha over 1 year ago
/a/sage-2021-06-29_21:27:07-rados-wip-sage3-testing-2021-06-28-1912-distro-basic-smithi/6244042
/a/rfriedma-2021-07-07_11:26:37-rados-wip-ronenf-list-object-distro-basic-smithi/6255195
These were seen in rados/objectstore on a single host but similar symptoms. I will try to see if I can reproduce this reliably.
2021-06-29T23:51:12.764+0000 7f5d530cf700 1 mon.a@0(leader).osd e8 e8: 3 total, 2 up, 3 in 2021-06-29T23:51:14.552+0000 7f5d558d4700 1 -- [v2:172.21.15.60:3300/0,v1:172.21.15.60:6789/0] <== mgr.4100 172.21.15.60:0/13831 63 ==== mon_command({"prefix": "osd pool create", "format": "json", "pool": ".mgr", "pg_num": 1, "pg_num_min": 1} v 0) v1 ==== 135+0+0 (secure 0 0 0) 0x5584574a7b00 con 0x5584574cc400 2021-06-29T23:51:15.536+0000 7f5d580d9700 10 mon.a@0(leader).osd e9 update_pending_pgs pg 1.0 just added, up [1] p 1 acting [1] p 1 history ec=10/10 lis/c=0/0 les/c/f=0/0/0 sis=10 past_intervals ([0,0] all_participants= intervals=)
#5 Updated by Neha Ojha over 1 year ago
- Duplicated by Bug #51727: cluster [WRN] Health check failed: Degraded data redundancy: 2/4 objects degraded (50.000%), 1 pg degraded (PG_DEGRADED)" in cluster log added
#6 Updated by Neha Ojha over 1 year ago
Neha Ojha wrote:
/a/sage-2021-06-29_21:27:07-rados-wip-sage3-testing-2021-06-28-1912-distro-basic-smithi/6244042
/a/rfriedma-2021-07-07_11:26:37-rados-wip-ronenf-list-object-distro-basic-smithi/6255195These were seen in rados/objectstore on a single host but similar symptoms. I will try to see if I can reproduce this reliably.
[...]
Fails 2/40 times https://pulpito.ceph.com/nojha-2021-07-19_17:59:45-rados:objectstore-master-distro-basic-smithi/
#7 Updated by Deepika Upadhyay over 1 year ago
lot of these instances in: https://pulpito.ceph.com/ideepika-2021-08-03_21:18:59-rbd-wip-rbd-update-feature-distro-basic-smithi/
#8 Updated by Casey Bodley over 1 year ago
should the rgw suite be whitelisting these too?
#9 Updated by Patrick Donnelly over 1 year ago
Casey Bodley wrote:
should the rgw suite be whitelisting these too?
I think ignorelisting these errors/warnings is a mistake as it can catch genuine problems in QA testing. Once this issue is fixed it's likely we forget to remove the ignorelist entry.
As annoying as it is, we should let this keep failing QA to give us the proper motivation to fix it.
#10 Updated by Neha Ojha over 1 year ago
- Assignee set to Neha Ojha
- Priority changed from High to Urgent
#11 Updated by Neha Ojha over 1 year ago
I reproduced this issue with additional logging using the same reproducer https://tracker.ceph.com/issues/51282#note-6. This time 1/40 jobs failed - https://pulpito.ceph.com/nojha-2021-08-20_22:23:09-rados:objectstore-wip-51282-distro-basic-smithi/
PASSED
2021-08-20T22:40:48.504+0000 7f2006311700 1 mon.a@0(leader).osd e9 update_pending_pgs osd.0 up in weight 1 up_from 8 up_thru 0 down_at 0 last_clean_interval [0,0) [v2:172.21.15.23:6804/28252,v1:172.21.15.23:6806/28252] [v2:172.21.15.23:6808/28252,v1:172.21.15.23:6810/28252] exists,up 83dfbd47-d084-43d9-bf49-fdd016340430 osd.1 up in weight 1 up_from 8 up_thru 0 down_at 0 last_clean_interval [0,0) [v2:172.21.15.23:6800/28253,v1:172.21.15.23:6801/28253] [v2:172.21.15.23:6802/28253,v1:172.21.15.23:6803/28253] exists,up ebba65bc-0d7b-4a81-98a6-3ac7376ac599 osd.2 up in weight 1 up_from 8 up_thru 0 down_at 0 last_clean_interval [0,0) [v2:172.21.15.23:6816/28254,v1:172.21.15.23:6817/28254] [v2:172.21.15.23:6818/28254,v1:172.21.15.23:6819/28254] exists,up 8bbb1841-c1c3-48a5-96eb-b697db0b6d29 2021-08-20T22:40:48.504+0000 7f2006311700 10 mon.a@0(leader).osd e9 update_pending_pgs pg 1.0 just added, up [1,0] p 1 acting [1,0] p 1 history ec=10/10 lis/c=0/0 les/c/f=0/0/0 sis=10 past_intervals ([0,0] all_participants= intervals=)
vs FAILED
2021-08-20T22:46:59.355+0000 7f6209b49700 1 mon.a@0(leader).osd e9 update_pending_pgs osd.0 up in weight 1 up_from 8 up_thru 0 down_at 0 last_clean_interval [0,0) [v2:172.21.15.149:6800/28223,v1:172.21.15.149:6801/28223] [v2:172.21.15.149:6802/28223,v1:172.21.15.149:6803/28223] exists,up 9b0f044d-f1bf-474d-b87e-0ebe954c3e7c osd.1 down in weight 1 up_from 0 up_thru 0 down_at 0 last_clean_interval [0,0) exists,new a91c812b-d9b3-4b78-a493-d1309fe6ca62 osd.2 up in weight 1 up_from 8 up_thru 0 down_at 0 last_clean_interval [0,0) [v2:172.21.15.149:6816/28226,v1:172.21.15.149:6817/28226] [v2:172.21.15.149:6818/28226,v1:172.21.15.149:6819/28226] exists,up ecb81170-2d1e-4c5e-acd3-91d38d254299 2021-08-20T22:46:59.355+0000 7f6209b49700 10 mon.a@0(leader).osd e9 update_pending_pgs pg 1.0 just added, up [0] p 0 acting [0] p 0 history ec=10/10 lis/c=0/0 les/c/f=0/0/0 sis=10 past_intervals ([0,0] all_participants= intervals=)
Looks like when pg 1.0 was added, though there were 2 candidate osds (osd.0 and osd.2) present in the osdmap, pg_to_up_acting_osds chose 1 osd and then we start seeing PG_DEGRADED warnings when a couple of objects are written to the pg. Need to check if anything has changed in this logic in master.
{ "PG_DEGRADED": { "severity": "HEALTH_WARN", "summary": { "message": "Degraded data redundancy: 2/4 objects degraded (50.000%), 1 pg degraded", "count": 1 }, "detail": [ { "message": "pg 1.0 is active+undersized+degraded, acting [0]" } ] } }
#12 Updated by Neha Ojha over 1 year ago
Couple of updates:
1. I am not able to reproduce the issue in pacific with the same reproducer, https://pulpito.ceph.com/nojha-2021-08-25_18:46:59-rados:objectstore-wip-yuri-testing-2021-08-24-0743-pacific-distro-basic-smithi/.
2. I looked at recent changes in the relevant code path and found https://github.com/ceph/ceph/pull/39631, so I decided to revert it and run it through my reproducer, and it passed! https://pulpito.ceph.com/nojha-2021-08-27_03:56:45-rados:objectstore-wip-51282-revert-distro-basic-smithi/
The interesting thing is that the patch was also backported to pacific, but did not cause this issue there.
In any case, Patrick/Casey/Deepika, can you please run wip-51282-revert through your corresponding suites and let me know if it fixes the issue for you as well?
#13 Updated by Patrick Donnelly over 1 year ago
Neha Ojha wrote:
Couple of updates:
1. I am not able to reproduce the issue in pacific with the same reproducer, https://pulpito.ceph.com/nojha-2021-08-25_18:46:59-rados:objectstore-wip-yuri-testing-2021-08-24-0743-pacific-distro-basic-smithi/.
2. I looked at recent changes in the relevant code path and found https://github.com/ceph/ceph/pull/39631, so I decided to revert it and run it through my reproducer, and it passed! https://pulpito.ceph.com/nojha-2021-08-27_03:56:45-rados:objectstore-wip-51282-revert-distro-basic-smithi/
The interesting thing is that the patch was also backported to pacific, but did not cause this issue there.
Well, in Pacific the .mgr pool does not exist. The device_health_metrics pool should generate the same warning though. There may be a subtle difference?
In any case, Patrick/Casey/Deepika, can you please run wip-51282-revert through your corresponding suites and let me know if it fixes the issue for you as well?
https://pulpito.ceph.com/pdonnell-2021-08-27_17:21:13-fs-wip-51282-revert-distro-basic-smithi/
#14 Updated by Neha Ojha over 1 year ago
Patrick Donnelly wrote:
Neha Ojha wrote:
Couple of updates:
1. I am not able to reproduce the issue in pacific with the same reproducer, https://pulpito.ceph.com/nojha-2021-08-25_18:46:59-rados:objectstore-wip-yuri-testing-2021-08-24-0743-pacific-distro-basic-smithi/.
2. I looked at recent changes in the relevant code path and found https://github.com/ceph/ceph/pull/39631, so I decided to revert it and run it through my reproducer, and it passed! https://pulpito.ceph.com/nojha-2021-08-27_03:56:45-rados:objectstore-wip-51282-revert-distro-basic-smithi/
The interesting thing is that the patch was also backported to pacific, but did not cause this issue there.Well, in Pacific the .mgr pool does not exist. The device_health_metrics pool should generate the same warning though. There may be a subtle difference?
In any case, Patrick/Casey/Deepika, can you please run wip-51282-revert through your corresponding suites and let me know if it fixes the issue for you as well?
https://pulpito.ceph.com/pdonnell-2021-08-27_17:21:13-fs-wip-51282-revert-distro-basic-smithi/
Comparing with the earlier run https://pulpito.ceph.com/pdonnell-2021-06-16_21:26:55-fs-wip-pdonnell-testing-20210616.191804-distro-basic-smithi/, there are fewer instances (2 to be accurate) of PG_DEGRADED, but these are not during pool creation, but are related to recovery, where these warnings are expected.
2021-08-27T19:36:02.842+0000 7f179e14c700 10 mon.a@0(leader).osd e19 update_pending_pgs pg 1.0 just added, up [7,10] p 7 acting [7,10] p 7 history ec=20/20 lis/c=0/0 les/c/f=0/0/0 sis=20 past_intervals ([0,0] all_participants= intervals=)
#15 Updated by Deepika Upadhyay over 1 year ago
@Neha hit this one in RBD suite:
description: rbd/mirror/{base/install clients/{mirror-extra mirror} cluster/{2-node openstack} msgr-failures/few objectstore/filestore-xfs supported-random-distro$/{rhel_8} workloads/rbd-mirror-workunit-min-compat-client-octopus} duration: 2079.4988112449646 failure_reason: '"2021-11-02T14:27:49.280378+0000 mon.a (mon.0) 114 : cluster [WRN] Health check failed: Degraded data redundancy: 2/4 objects degraded (50.000%), 1 pg degraded (PG_DEGRADED)" in cluster log'
/ceph/teuthology-archive/ideepika-2021-11-02_12:33:30-rbd-wip-ssd-cache-testing-distro-basic-smithi/6477554/teuthology.log
#16 Updated by Casey Bodley about 1 year ago
Patrick Donnelly wrote:
Casey Bodley wrote:
should the rgw suite be whitelisting these too?
I think ignorelisting these errors/warnings is a mistake as it can catch genuine problems in QA testing. Once this issue is fixed it's likely we forget to remove the ignorelist entry.
As annoying as it is, we should let this keep failing QA to give us the proper motivation to fix it.
waiting hasn't helped yet. i'm adding this to rgw's ignorelist so it doesn't show up in quincy backport testing
#17 Updated by Sridhar Seshasayee 7 months ago
/a/yuriw-2022-06-29_18:22:37-rados-wip-yuri2-testing-2022-06-29-0820-distro-default-smithi/6906054
#18 Updated by Radoslaw Zarzynski 7 months ago
- Priority changed from Urgent to Normal
Lower the priority because: 1) the issue is basically just an irritation-causing one and 2) the reproduction rate is pretty low.
#19 Updated by Kamoltat (Junior) Sirivadhna 6 months ago
/a/yuriw-2022-08-04_11:58:29-rados-wip-yuri3-testing-2022-08-03-0828-pacific-distro-default-smithi/6958108
#20 Updated by Kotresh Hiremath Ravishankar 6 months ago
#21 Updated by Laura Flores 5 months ago
- Duplicated by Bug #55825: cluster [WRN] Health check failed: Degraded data redundancy: 1 pg degraded (PG_DEGRADED)" in cluster added
#22 Updated by Laura Flores 5 months ago
/a/yuriw-2022-08-22_20:21:58-rados-wip-yuri11-testing-2022-08-22-1005-distro-default-smithi/6986324
Description: rados/monthrash/{ceph clusters/9-mons mon_election/connectivity msgr-failures/few msgr/async objectstore/bluestore-comp-zlib rados supported-random-distro$/{ubuntu_latest} thrashers/sync workloads/rados_5925}
2022-08-22T22:20:05.636 DEBUG:teuthology.orchestra.run.smithi050:> sudo egrep '\[ERR\]|\[WRN\]|\[SEC\]' /var/log/ceph/ceph.log | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | egrep -v ScrubResult | egrep -v 'scrub mismatch' | egrep -v 'overall HEALTH_' | egrep -v '\(MGR_DOWN\)' | egrep -v '\(PG_AVAILABILITY\)' | egrep -v '\(SLOW_OPS\)' | egrep -v 'slow request' | egrep -v '\(OSD_SLOW_PING_TIME' | egrep -v 'overall HEALTH_' | egrep -v '\(MON_DOWN\)' | egrep -v 'overall HEALTH_' | egrep -v '\(POOL_APP_NOT_ENABLED\)' | head -n 1
2022-08-22T22:20:05.659 INFO:teuthology.orchestra.run.smithi050.stdout:2022-08-22T22:17:00.832057+0000 mon.d (mon.6) 246 : cluster [WRN] Health check failed: Degraded data redundancy: 2/4 objects degraded (50.000%), 1 pg degraded (PG_DEGRADED)
2022-08-22T22:20:05.660 WARNING:tasks.ceph:Found errors (ERR|WRN|SEC) in cluster log
2022-08-22T22:20:05.660 DEBUG:teuthology.orchestra.run.smithi050:> sudo egrep '\[SEC\]' /var/log/ceph/ceph.log | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | egrep -v ScrubResult | egrep -v 'scrub mismatch' | egrep -v 'overall HEALTH_' | egrep -v '\(MGR_DOWN\)' | egrep -v '\(PG_AVAILABILITY\)' | egrep -v '\(SLOW_OPS\)' | egrep -v 'slow request' | egrep -v '\(OSD_SLOW_PING_TIME' | egrep -v 'overall HEALTH_' | egrep -v '\(MON_DOWN\)' | egrep -v 'overall HEALTH_' | egrep -v '\(POOL_APP_NOT_ENABLED\)' | head -n 1
2022-08-22T22:20:05.722 DEBUG:teuthology.orchestra.run.smithi050:> sudo egrep '\[ERR\]' /var/log/ceph/ceph.log | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | egrep -v ScrubResult | egrep -v 'scrub mismatch' | egrep -v 'overall HEALTH_' | egrep -v '\(MGR_DOWN\)' | egrep -v '\(PG_AVAILABILITY\)' | egrep -v '\(SLOW_OPS\)' | egrep -v 'slow request' | egrep -v '\(OSD_SLOW_PING_TIME' | egrep -v 'overall HEALTH_' | egrep -v '\(MON_DOWN\)' | egrep -v 'overall HEALTH_' | egrep -v '\(POOL_APP_NOT_ENABLED\)' | head -n 1
2022-08-22T22:20:05.740 DEBUG:teuthology.orchestra.run.smithi050:> sudo egrep '\[WRN\]' /var/log/ceph/ceph.log | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | egrep -v ScrubResult | egrep -v 'scrub mismatch' | egrep -v 'overall HEALTH_' | egrep -v '\(MGR_DOWN\)' | egrep -v '\(PG_AVAILABILITY\)' | egrep -v '\(SLOW_OPS\)' | egrep -v 'slow request' | egrep -v '\(OSD_SLOW_PING_TIME' | egrep -v 'overall HEALTH_' | egrep -v '\(MON_DOWN\)' | egrep -v 'overall HEALTH_' | egrep -v '\(POOL_APP_NOT_ENABLED\)' | head -n 1
2022-08-22T22:20:05.815 INFO:teuthology.orchestra.run.smithi050.stdout:2022-08-22T22:17:00.832057+0000 mon.d (mon.6) 246 : cluster [WRN] Health check failed: Degraded data redundancy: 2/4 objects degraded (50.000%), 1 pg degraded (PG_DEGRADED)
Passed in a rerun.
#23 Updated by Laura Flores 5 months ago
- Tags set to test-failure
#24 Updated by Radoslaw Zarzynski 5 months ago
- Assignee changed from Neha Ojha to Nitzan Mordechai
We would need judge whether it's expected. If so, we can extend the ignore list.