Bug #51282: pybind/mgr/mgr_util: .mgr pool may be created too early causing spurious PG_DEGRADED warnings - CephFS - Ceph

Actions

Copy link

Bug #51282

closed

pybind/mgr/mgr_util: .mgr pool may be created too early causing spurious PG_DEGRADED warnings

Added by Patrick Donnelly almost 3 years ago. Updated 20 days ago.

Status:

Resolved

Priority:

Normal

Assignee:

Patrick Donnelly

Category:

Testing

Target version:

Ceph - v20.0.0

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

4 - irritation

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Labels (FS):

Pull request ID:

56566

Crash signature (v1):

Crash signature (v2):

Description

2021-06-16T22:22:43.040+0000 7f6e8e779700 20 mon.a@0(leader).mgrstat health checks:
{   
    "PG_DEGRADED": {
        "severity": "HEALTH_WARN",
        "summary": {
            "message": "Degraded data redundancy: 2/4 objects degraded (50.000%), 1 pg degraded",
            "count": 1
        },
        "detail": [
            {   
                "message": "pg 1.0 is active+undersized+degraded, acting [7]" 
            }
        ]
    }
}

From: /ceph/teuthology-archive/pdonnell-2021-06-16_21:26:55-fs-wip-pdonnell-testing-20210616.191804-distro-basic-smithi/6175605/remote/smithi120/log/ceph-mon.a.log.gz

and a few other tests:

Failure: "2021-06-16T22:22:43.881363+0000 mon.a (mon.0) 143 : cluster [WRN] Health check failed: Degraded data redundancy: 2/4 objects degraded (50.000%), 1 pg degraded (PG_DEGRADED)" in cluster log
7 jobs: ['6175605', '6175619', '6175556', '6175591', '6175671', '6175600', '6175639']
suites intersection: ['conf/{client', 'mds', 'mon', 'osd}', 'overrides/{frag_enable', 'whitelist_health', 'whitelist_wrongly_marked_down}']
suites union: ['clusters/1a11s-mds-1c-client-3node', 'clusters/1a3s-mds-1c-client', 'conf/{client', 'distro/{centos_8}', 'distro/{rhel_8}', 'distro/{ubuntu_latest}', 'fs/snaps/{begin', 'fs/workload/{begin', 'k-testing}', 'mds', 'mon', 'mount/fuse', 'mount/kclient/{mount', 'ms-die-on-skipped}}', 'objectstore-ec/bluestore-comp', 'objectstore-ec/bluestore-comp-ec-root', 'omap_limit/10', 'omap_limit/10000', 'osd-asserts', 'osd}', 'overrides/{distro/testing/{flavor/centos_latest', 'overrides/{distro/testing/{flavor/ubuntu_latest', 'overrides/{frag_enable', 'ranks/1', 'ranks/3', 'ranks/5', 'scrub/no', 'scrub/yes', 'session_timeout', 'standby-replay', 'tasks/workunit/snaps}', 'tasks/{0-check-counter', 'whitelist_health', 'whitelist_wrongly_marked_down}', 'workunit/fs/misc}', 'workunit/fs/test_o_trunc}', 'workunit/suites/ffsb}', 'workunit/suites/fsstress}', 'workunit/suites/iogen}', 'workunit/suites/iozone}', 'wsync/{no}}', 'wsync/{yes}}']

I'm thinking this check is not quite right:https://github.com/ceph/ceph/blob/05d7f883a04d230cf17b40a1c7e8044d402c6a30/src/pybind/mgr/mgr_module.py#L972-L981

(I lifted that code from the devicehealth module.)

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by Brad Hubbard almost 3 years ago

Subject changed from pybind/mgr/mgr_util: .mgr pool may be created to early causing spurious PG_DEGRADED warnings to pybind/mgr/mgr_util: .mgr pool may be created too early causing spurious PG_DEGRADED warnings

Actions

Copy link

Updated by Brad Hubbard almost 3 years ago

Project changed from Ceph to mgr

Actions

Copy link

Updated by Neha Ojha almost 3 years ago

2021-06-16T22:22:36.116+0000 7f6e8bf74700  0 log_channel(cluster) log [DBG] : osdmap e18: 12 total, 4 up, 12 in
...
2021-06-16T22:22:37.144+0000 7f6e90f7e700 10 mon.a@0(leader).osd e18 update_pending_pgs  pg 1.0 just added,  up [7] p 7 acting [7] p 7 history ec=19/19 lis/c=0/0 les/c/f=0/0/0 sis=19 past_intervals ([0,0] all_participants= intervals=)
...
2021-06-16T22:22:43.040+0000 7f6e8e779700 20 mon.a@0(leader).mgrstat health checks:
{
    "PG_DEGRADED": {
        "severity": "HEALTH_WARN",
        "summary": {
            "message": "Degraded data redundancy: 2/4 objects degraded (50.000%), 1 pg degraded",
            "count": 1
        },
        "detail": [
            {
                "message": "pg 1.0 is active+undersized+degraded, acting [7]" 
            }
        ]
    }
}

The 4 osds that were up and in:

2021-06-16T22:22:36.112+0000 7f6e90f7e700  2 mon.a@0(leader).osd e17  osd.1 UP [v2:172.21.15.199:6808/20782,v1:172.21.15.199:6809/20782]
2021-06-16T22:22:36.112+0000 7f6e90f7e700  2 mon.a@0(leader).osd e17  osd.4 UP [v2:172.21.15.199:6800/20784,v1:172.21.15.199:6801/20784]
2021-06-16T22:22:36.112+0000 7f6e90f7e700  2 mon.a@0(leader).osd e17  osd.7 UP [v2:172.21.15.199:6816/20783,v1:172.21.15.199:6817/20783]
2021-06-16T22:22:36.112+0000 7f6e90f7e700  2 mon.a@0(leader).osd e17  osd.10 UP [v2:172.21.15.199:6824/20785,v1:172.21.15.199:6825/20785]
2021-06-16T22:22:36.112+0000 7f6e90f7e700  2 mon.a@0(leader).osd e17  osd.1 IN
2021-06-16T22:22:36.112+0000 7f6e90f7e700  2 mon.a@0(leader).osd e17  osd.4 IN
2021-06-16T22:22:36.112+0000 7f6e90f7e700  2 mon.a@0(leader).osd e17  osd.7 IN
2021-06-16T22:22:36.112+0000 7f6e90f7e700  2 mon.a@0(leader).osd e17  osd.10 IN

2021-06-16T22:10:25.464 INFO:teuthology.task.internal:roles: ubuntu@smithi199.front.sepia.ceph.com - ['mon.b', 'mgr.y', 'mds.b', 'mds.e', 'mds.h', 'mds.k', 'osd.1', 'osd.4', 'osd.7', 'osd.10']

Looks like the have_enough_osds check would have passed because we've got [4 up, 12 in] OSDs when the pool was created, but for some reason, we created the PG with 1 osd. Could be something to do with how crush is picking placements targets, since all those 4 OSDs are from the same host.

Actions

Copy link

Updated by Neha Ojha almost 3 years ago

/a/sage-2021-06-29_21:27:07-rados-wip-sage3-testing-2021-06-28-1912-distro-basic-smithi/6244042
/a/rfriedma-2021-07-07_11:26:37-rados-wip-ronenf-list-object-distro-basic-smithi/6255195

These were seen in rados/objectstore on a single host but similar symptoms. I will try to see if I can reproduce this reliably.

2021-06-29T23:51:12.764+0000 7f5d530cf700  1 mon.a@0(leader).osd e8 e8: 3 total, 2 up, 3 in
2021-06-29T23:51:14.552+0000 7f5d558d4700  1 -- [v2:172.21.15.60:3300/0,v1:172.21.15.60:6789/0] <== mgr.4100 172.21.15.60:0/13831 63 ==== mon_command({"prefix": "osd pool create", "format": "json", "pool": ".mgr", "pg_num": 1, "pg_num_min": 1} v 0) v1 ==== 135+0+0 (secure 0 0 0) 0x5584574a7b00 con 0x5584574cc400
2021-06-29T23:51:15.536+0000 7f5d580d9700 10 mon.a@0(leader).osd e9 update_pending_pgs  pg 1.0 just added,  up [1] p 1 acting [1] p 1 history ec=10/10 lis/c=0/0 les/c/f=0/0/0 sis=10 past_intervals ([0,0] all_participants= intervals=)

Actions

Copy link

Updated by Neha Ojha almost 3 years ago

Has duplicate Bug #51727: cluster [WRN] Health check failed: Degraded data redundancy: 2/4 objects degraded (50.000%), 1 pg degraded (PG_DEGRADED)" in cluster log added

Actions

Copy link

Updated by Neha Ojha almost 3 years ago

Neha Ojha wrote:

/a/sage-2021-06-29_21:27:07-rados-wip-sage3-testing-2021-06-28-1912-distro-basic-smithi/6244042
/a/rfriedma-2021-07-07_11:26:37-rados-wip-ronenf-list-object-distro-basic-smithi/6255195

These were seen in rados/objectstore on a single host but similar symptoms. I will try to see if I can reproduce this reliably.

[...]

Fails 2/40 times https://pulpito.ceph.com/nojha-2021-07-19_17:59:45-rados:objectstore-master-distro-basic-smithi/

Actions

Copy link

Updated by Deepika Upadhyay over 2 years ago

lot of these instances in: https://pulpito.ceph.com/ideepika-2021-08-03_21:18:59-rbd-wip-rbd-update-feature-distro-basic-smithi/

Actions

Copy link

Updated by Casey Bodley over 2 years ago

should the rgw suite be whitelisting these too?

Actions

Copy link

Updated by Patrick Donnelly over 2 years ago

Casey Bodley wrote:

should the rgw suite be whitelisting these too?

I think ignorelisting these errors/warnings is a mistake as it can catch genuine problems in QA testing. Once this issue is fixed it's likely we forget to remove the ignorelist entry.

As annoying as it is, we should let this keep failing QA to give us the proper motivation to fix it.

Actions

Copy link

#10

Updated by Neha Ojha over 2 years ago

Assignee set to Neha Ojha
Priority changed from High to Urgent

Actions

Copy link

#11

Updated by Neha Ojha over 2 years ago

I reproduced this issue with additional logging using the same reproducer https://tracker.ceph.com/issues/51282#note-6. This time 1/40 jobs failed - https://pulpito.ceph.com/nojha-2021-08-20_22:23:09-rados:objectstore-wip-51282-distro-basic-smithi/

PASSED

2021-08-20T22:40:48.504+0000 7f2006311700  1 mon.a@0(leader).osd e9 update_pending_pgs osd.0 up   in  weight 1 up_from 8 up_thru 0 down_at 0 last_clean_interval [0,0) [v2:172.21.15.23:6804/28252,v1:172.21.15.23:6806/28252] [v2:172.21.15.23:6808/28252,v1:172.21.15.23:6810/28252] exists,up 83dfbd47-d084-43d9-bf49-fdd016340430
osd.1 up   in  weight 1 up_from 8 up_thru 0 down_at 0 last_clean_interval [0,0) [v2:172.21.15.23:6800/28253,v1:172.21.15.23:6801/28253] [v2:172.21.15.23:6802/28253,v1:172.21.15.23:6803/28253] exists,up ebba65bc-0d7b-4a81-98a6-3ac7376ac599
osd.2 up   in  weight 1 up_from 8 up_thru 0 down_at 0 last_clean_interval [0,0) [v2:172.21.15.23:6816/28254,v1:172.21.15.23:6817/28254] [v2:172.21.15.23:6818/28254,v1:172.21.15.23:6819/28254] exists,up 8bbb1841-c1c3-48a5-96eb-b697db0b6d29

2021-08-20T22:40:48.504+0000 7f2006311700 10 mon.a@0(leader).osd e9 update_pending_pgs  pg 1.0 just added,  up [1,0] p 1 acting [1,0] p 1 history ec=10/10 lis/c=0/0 les/c/f=0/0/0 sis=10 past_intervals ([0,0] all_participants= intervals=)

vs FAILED

2021-08-20T22:46:59.355+0000 7f6209b49700  1 mon.a@0(leader).osd e9 update_pending_pgs osd.0 up   in  weight 1 up_from 8 up_thru 0 down_at 0 last_clean_interval [0,0) [v2:172.21.15.149:6800/28223,v1:172.21.15.149:6801/28223] [v2:172.21.15.149:6802/28223,v1:172.21.15.149:6803/28223] exists,up 9b0f044d-f1bf-474d-b87e-0ebe954c3e7c
osd.1 down in  weight 1 up_from 0 up_thru 0 down_at 0 last_clean_interval [0,0)   exists,new a91c812b-d9b3-4b78-a493-d1309fe6ca62
osd.2 up   in  weight 1 up_from 8 up_thru 0 down_at 0 last_clean_interval [0,0) [v2:172.21.15.149:6816/28226,v1:172.21.15.149:6817/28226] [v2:172.21.15.149:6818/28226,v1:172.21.15.149:6819/28226] exists,up ecb81170-2d1e-4c5e-acd3-91d38d254299

2021-08-20T22:46:59.355+0000 7f6209b49700 10 mon.a@0(leader).osd e9 update_pending_pgs  pg 1.0 just added,  up [0] p 0 acting [0] p 0 history ec=10/10 lis/c=0/0 les/c/f=0/0/0 sis=10 past_intervals ([0,0] all_participants= intervals=)

Looks like when pg 1.0 was added, though there were 2 candidate osds (osd.0 and osd.2) present in the osdmap, pg_to_up_acting_osds chose 1 osd and then we start seeing PG_DEGRADED warnings when a couple of objects are written to the pg. Need to check if anything has changed in this logic in master.

{
    "PG_DEGRADED": {
        "severity": "HEALTH_WARN",
        "summary": {
            "message": "Degraded data redundancy: 2/4 objects degraded (50.000%), 1 pg degraded",
            "count": 1
        },
        "detail": [
            {
                "message": "pg 1.0 is active+undersized+degraded, acting [0]" 
            }
        ]
    }
}

Actions

Copy link

#12

Updated by Neha Ojha over 2 years ago

Couple of updates:

1. I am not able to reproduce the issue in pacific with the same reproducer, https://pulpito.ceph.com/nojha-2021-08-25_18:46:59-rados:objectstore-wip-yuri-testing-2021-08-24-0743-pacific-distro-basic-smithi/.

2. I looked at recent changes in the relevant code path and found https://github.com/ceph/ceph/pull/39631, so I decided to revert it and run it through my reproducer, and it passed! https://pulpito.ceph.com/nojha-2021-08-27_03:56:45-rados:objectstore-wip-51282-revert-distro-basic-smithi/
The interesting thing is that the patch was also backported to pacific, but did not cause this issue there.

In any case, Patrick/Casey/Deepika, can you please run wip-51282-revert through your corresponding suites and let me know if it fixes the issue for you as well?

Actions

Copy link

#13

Updated by Patrick Donnelly over 2 years ago

Neha Ojha wrote:

Couple of updates:

1. I am not able to reproduce the issue in pacific with the same reproducer, https://pulpito.ceph.com/nojha-2021-08-25_18:46:59-rados:objectstore-wip-yuri-testing-2021-08-24-0743-pacific-distro-basic-smithi/.

2. I looked at recent changes in the relevant code path and found https://github.com/ceph/ceph/pull/39631, so I decided to revert it and run it through my reproducer, and it passed! https://pulpito.ceph.com/nojha-2021-08-27_03:56:45-rados:objectstore-wip-51282-revert-distro-basic-smithi/
The interesting thing is that the patch was also backported to pacific, but did not cause this issue there.

Well, in Pacific the .mgr pool does not exist. The device_health_metrics pool should generate the same warning though. There may be a subtle difference?

In any case, Patrick/Casey/Deepika, can you please run wip-51282-revert through your corresponding suites and let me know if it fixes the issue for you as well?

https://pulpito.ceph.com/pdonnell-2021-08-27_17:21:13-fs-wip-51282-revert-distro-basic-smithi/

Actions

Copy link

#14

Updated by Neha Ojha over 2 years ago

Patrick Donnelly wrote:

Neha Ojha wrote:

Couple of updates:

1. I am not able to reproduce the issue in pacific with the same reproducer, https://pulpito.ceph.com/nojha-2021-08-25_18:46:59-rados:objectstore-wip-yuri-testing-2021-08-24-0743-pacific-distro-basic-smithi/.

2. I looked at recent changes in the relevant code path and found https://github.com/ceph/ceph/pull/39631, so I decided to revert it and run it through my reproducer, and it passed! https://pulpito.ceph.com/nojha-2021-08-27_03:56:45-rados:objectstore-wip-51282-revert-distro-basic-smithi/
The interesting thing is that the patch was also backported to pacific, but did not cause this issue there.

Well, in Pacific the .mgr pool does not exist. The device_health_metrics pool should generate the same warning though. There may be a subtle difference?

In any case, Patrick/Casey/Deepika, can you please run wip-51282-revert through your corresponding suites and let me know if it fixes the issue for you as well?

https://pulpito.ceph.com/pdonnell-2021-08-27_17:21:13-fs-wip-51282-revert-distro-basic-smithi/

Comparing with the earlier run https://pulpito.ceph.com/pdonnell-2021-06-16_21:26:55-fs-wip-pdonnell-testing-20210616.191804-distro-basic-smithi/, there are fewer instances (2 to be accurate) of PG_DEGRADED, but these are not during pool creation, but are related to recovery, where these warnings are expected.

2021-08-27T19:36:02.842+0000 7f179e14c700 10 mon.a@0(leader).osd e19 update_pending_pgs  pg 1.0 just added,  up [7,10] p 7 acting [7,10] p 7 history ec=20/20 lis/c=0/0 les/c/f=0/0/0 sis=20 past_intervals ([0,0] all_participants= intervals=)

Actions

Copy link

#15

Updated by Deepika Upadhyay over 2 years ago

@Neha . hit this one in RBD suite:

description: rbd/mirror/{base/install clients/{mirror-extra mirror} cluster/{2-node
  openstack} msgr-failures/few objectstore/filestore-xfs supported-random-distro$/{rhel_8}
  workloads/rbd-mirror-workunit-min-compat-client-octopus}
duration: 2079.4988112449646
failure_reason: '"2021-11-02T14:27:49.280378+0000 mon.a (mon.0) 114 : cluster [WRN]
  Health check failed: Degraded data redundancy: 2/4 objects degraded (50.000%), 1
  pg degraded (PG_DEGRADED)" in cluster log'

/ceph/teuthology-archive/ideepika-2021-11-02_12:33:30-rbd-wip-ssd-cache-testing-distro-basic-smithi/6477554/teuthology.log

Actions

Copy link

#16

Updated by Casey Bodley over 2 years ago

Patrick Donnelly wrote:

Casey Bodley wrote:

should the rgw suite be whitelisting these too?

I think ignorelisting these errors/warnings is a mistake as it can catch genuine problems in QA testing. Once this issue is fixed it's likely we forget to remove the ignorelist entry.

As annoying as it is, we should let this keep failing QA to give us the proper motivation to fix it.

waiting hasn't helped yet. i'm adding this to rgw's ignorelist so it doesn't show up in quincy backport testing

Actions

Copy link

#17

Updated by Sridhar Seshasayee almost 2 years ago

/a/yuriw-2022-06-29_18:22:37-rados-wip-yuri2-testing-2022-06-29-0820-distro-default-smithi/6906054

Actions

Copy link

#18

Updated by Radoslaw Zarzynski almost 2 years ago

Priority changed from Urgent to Normal

Lower the priority because: 1) the issue is basically just an irritation-causing one and 2) the reproduction rate is pretty low.

Actions

Copy link

#19

Updated by Kamoltat (Junior) Sirivadhna over 1 year ago

/a/yuriw-2022-08-04_11:58:29-rados-wip-yuri3-testing-2022-08-03-0828-pacific-distro-default-smithi/6958108

Actions

Copy link

#20

Updated by Kotresh Hiremath Ravishankar over 1 year ago

Seen in recent quincy run

https://pulpito.ceph.com/yuriw-2022-08-04_11:54:20-fs-wip-yuri8-testing-2022-08-03-1028-quincy-distro-default-smithi/6957854
https://pulpito.ceph.com/yuriw-2022-08-04_11:54:20-fs-wip-yuri8-testing-2022-08-03-1028-quincy-distro-default-smithi/6957905

Actions

Copy link

#21

Updated by Laura Flores over 1 year ago

Has duplicate Bug #55825: cluster [WRN] Health check failed: Degraded data redundancy: 1 pg degraded (PG_DEGRADED)" in cluster added

Actions

Copy link

#22

Updated by Laura Flores over 1 year ago

/a/yuriw-2022-08-22_20:21:58-rados-wip-yuri11-testing-2022-08-22-1005-distro-default-smithi/6986324

Description: rados/monthrash/{ceph clusters/9-mons mon_election/connectivity msgr-failures/few msgr/async objectstore/bluestore-comp-zlib rados supported-random-distro$/{ubuntu_latest} thrashers/sync workloads/rados_5925}

2022-08-22T22:20:05.636 DEBUG:teuthology.orchestra.run.smithi050:> sudo egrep '\[ERR\]|\[WRN\]|\[SEC\]' /var/log/ceph/ceph.log | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | egrep -v ScrubResult | egrep -v 'scrub mismatch' | egrep -v 'overall HEALTH_' | egrep -v '\(MGR_DOWN\)' | egrep -v '\(PG_AVAILABILITY\)' | egrep -v '\(SLOW_OPS\)' | egrep -v 'slow request' | egrep -v '\(OSD_SLOW_PING_TIME' | egrep -v 'overall HEALTH_' | egrep -v '\(MON_DOWN\)' | egrep -v 'overall HEALTH_' | egrep -v '\(POOL_APP_NOT_ENABLED\)' | head -n 1
2022-08-22T22:20:05.659 INFO:teuthology.orchestra.run.smithi050.stdout:2022-08-22T22:17:00.832057+0000 mon.d (mon.6) 246 : cluster [WRN] Health check failed: Degraded data redundancy: 2/4 objects degraded (50.000%), 1 pg degraded (PG_DEGRADED)
2022-08-22T22:20:05.660 WARNING:tasks.ceph:Found errors (ERR|WRN|SEC) in cluster log
2022-08-22T22:20:05.660 DEBUG:teuthology.orchestra.run.smithi050:> sudo egrep '\[SEC\]' /var/log/ceph/ceph.log | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | egrep -v ScrubResult | egrep -v 'scrub mismatch' | egrep -v 'overall HEALTH_' | egrep -v '\(MGR_DOWN\)' | egrep -v '\(PG_AVAILABILITY\)' | egrep -v '\(SLOW_OPS\)' | egrep -v 'slow request' | egrep -v '\(OSD_SLOW_PING_TIME' | egrep -v 'overall HEALTH_' | egrep -v '\(MON_DOWN\)' | egrep -v 'overall HEALTH_' | egrep -v '\(POOL_APP_NOT_ENABLED\)' | head -n 1
2022-08-22T22:20:05.722 DEBUG:teuthology.orchestra.run.smithi050:> sudo egrep '\[ERR\]' /var/log/ceph/ceph.log | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | egrep -v ScrubResult | egrep -v 'scrub mismatch' | egrep -v 'overall HEALTH_' | egrep -v '\(MGR_DOWN\)' | egrep -v '\(PG_AVAILABILITY\)' | egrep -v '\(SLOW_OPS\)' | egrep -v 'slow request' | egrep -v '\(OSD_SLOW_PING_TIME' | egrep -v 'overall HEALTH_' | egrep -v '\(MON_DOWN\)' | egrep -v 'overall HEALTH_' | egrep -v '\(POOL_APP_NOT_ENABLED\)' | head -n 1
2022-08-22T22:20:05.740 DEBUG:teuthology.orchestra.run.smithi050:> sudo egrep '\[WRN\]' /var/log/ceph/ceph.log | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | egrep -v ScrubResult | egrep -v 'scrub mismatch' | egrep -v 'overall HEALTH_' | egrep -v '\(MGR_DOWN\)' | egrep -v '\(PG_AVAILABILITY\)' | egrep -v '\(SLOW_OPS\)' | egrep -v 'slow request' | egrep -v '\(OSD_SLOW_PING_TIME' | egrep -v 'overall HEALTH_' | egrep -v '\(MON_DOWN\)' | egrep -v 'overall HEALTH_' | egrep -v '\(POOL_APP_NOT_ENABLED\)' | head -n 1
2022-08-22T22:20:05.815 INFO:teuthology.orchestra.run.smithi050.stdout:2022-08-22T22:17:00.832057+0000 mon.d (mon.6) 246 : cluster [WRN] Health check failed: Degraded data redundancy: 2/4 objects degraded (50.000%), 1 pg degraded (PG_DEGRADED)

Passed in a rerun.

Actions

Copy link

#23

Updated by Laura Flores over 1 year ago

Translation missing: en.field_tag_list set to test-failure

Actions

Copy link

#24

Updated by Radoslaw Zarzynski over 1 year ago

Assignee changed from Neha Ojha to Nitzan Mordechai

We would need judge whether it's expected. If so, we can extend the ignore list.

Actions

Copy link

#25

Updated by Laura Flores 12 months ago

/a/yuriw-2023-04-25_03:31:51-smoke-pacific-release-distro-default-smithi/7250679

Actions

Copy link

#26

Updated by Laura Flores 12 months ago

/a/yuriw-2023-05-06_14:41:44-rados-pacific-release-distro-default-smithi/7264066

Actions

Copy link

#27

Updated by Kotresh Hiremath Ravishankar 11 months ago

reef
https://pulpito.ceph.com/yuriw-2023-05-10_18:53:39-fs-wip-yuri3-testing-2023-05-10-0851-reef-distro-default-smithi/7270346

Actions

Copy link

#28

Updated by Kotresh Hiremath Ravishankar 11 months ago

reef
https://pulpito.ceph.com/yuriw-2023-05-22_14:44:12-fs-wip-yuri3-testing-2023-05-21-0740-reef-distro-default-smithi/7282603

Actions

Copy link

#29

Updated by Kotresh Hiremath Ravishankar 11 months ago

quincy:
https://pulpito.ceph.com/yuriw-2023-05-23_15:23:11-fs-wip-yuri10-testing-2023-05-18-0815-quincy-distro-default-smithi/7284049
https://pulpito.ceph.com/yuriw-2023-05-23_15:23:11-fs-wip-yuri10-testing-2023-05-18-0815-quincy-distro-default-smithi/7284105

Actions

Copy link

#30

Updated by Milind Changire 11 months ago

quincy:
http://pulpito.front.sepia.ceph.com/yuriw-2023-05-31_21:56:15-fs-wip-yuri6-testing-2023-05-31-0933-quincy-distro-default-smithi/7292811

Actions

Copy link

#31

Updated by Nitzan Mordechai 11 months ago

Milind Changire wrote:

quincy:
http://pulpito.front.sepia.ceph.com/yuriw-2023-05-31_21:56:15-fs-wip-yuri6-testing-2023-05-31-0933-quincy-distro-default-smithi/7292811

That one doesn't look related to .mgr pool that created too early. the pg 102.1e peering process remap and drop osd.3 and added osd.4 which caused recovery and PG_DEGRADED

Actions

Copy link

#32