Cleanup #65521
openAdd expected warnings in cluster log to ignorelists
0%
Description
Relevant Slack conversation:
Hey all, as I brought up in today's RADOS call, there has been a surge of cluster warnings in the rados and upgrade suites due to the merge of https://github.com/ceph/ceph/pull/54312 to main and squid.
Here are recent main baselines, where we have a huge percentage of failures due to cluster warnings:- rados suite - https://pulpito.ceph.com/teuthology-2024-04-14_20:00:15-rados-main-distro-default-smithi/
- upgrade suite - https://pulpito.ceph.com/teuthology-2024-04-13_03:08:05-upgrade-main-distro-default-smithi/
- rados suite - https://pulpito.ceph.com/teuthology-2024-04-11_21:00:03-rados-squid-distro-default-smithi/
- upgrade suite - https://pulpito.ceph.com/teuthology-2024-04-12_02:08:08-upgrade-squid-distro-default-smithi/
Here are some I've opened lately:
- https://tracker.ceph.com/issues/65422
- https://tracker.ceph.com/issues/64868
- https://tracker.ceph.com/issues/65235
- https://tracker.ceph.com/issues/62776
- https://tracker.ceph.com/issues/64870
Any ideas on how we can effectively divide up the work and fix the suites is welcome. The idea is to go through each failure, identify whether the warning is expected (i.e. OSD_DOWN warnings are expected in thrash tests), and add it to the correct ignorelist in a PR like this: https://github.com/ceph/ceph/pull/56619
The mon_cluster_log_to_file change has not yet been backported to Quincy or Reef, but the same work will need to be done for these. I think we should run all suites against these patches and merge them along with ignorelist changes, rather than merging first and fixing second.- Reef backport - https://github.com/ceph/ceph/pull/55431
- Quincy backport - https://github.com/ceph/ceph/pull/55430
Updated by Laura Flores 19 days ago
- Related to Bug #65422: upgrade/quincy-x/parallel: "1 pg degraded (PG_DEGRADED)" in cluster log added
Updated by Laura Flores 19 days ago
- Related to Bug #64868: cephadm/osds, cephadm/workunits: Health check failed: 1 pool(s) do not have an application enabled (POOL_APP_NOT_ENABLED) in cluster log added
Updated by Laura Flores 19 days ago
- Related to Bug #65235: upgrade/reef-x/stress-split: "OSDMAP_FLAGS: noscrub flag(s) set" warning in cluster log added
- Related to Bug #62776: rados: cluster [WRN] overall HEALTH_WARN - do not have an application enabled added
- Related to Bug #64870: Health check failed: 1 osds down (OSD_DOWN)" in cluster log added
Updated by Matan Breizman 17 days ago
/a/yuriw-2024-04-16_23:25:35-rados-wip-yuriw-testing-20240416.150233-distro-default-smithi/7659305
Updated by Laura Flores 13 days ago
Updated by Laura Flores 6 days ago
/a/yuriw-2024-04-20_15:32:38-rados-wip-yuriw-testing-20240419.185239-main-distro-default-smithi/7664685
"2024-04-20T16:10:00.000158+0000 mon.a (mon.0) 1407 : cluster [WRN] Health detail: HEALTH_WARN nodeep-scrub flag(s) set; Reduced data availability: 1 pg peering" in cluster log
Updated by Laura Flores 6 days ago
In this one, we are intentionally setting OSDs down, so the warning is expected.
/a/yuriw-2024-04-20_15:32:38-rados-wip-yuriw-testing-20240419.185239-main-distro-default-smithi/7664689
2024-04-20T16:04:18.136 INFO:journalctl@ceph.mon.a.smithi012.stdout:Apr 20 16:04:17 smithi012 ceph-mon[17195]: from='mgr.14152 172.21.15.12:0/2573228170' entity='mgr.a' cmd=[{"prefix": "config generate-minimal-conf"}]: dispatch 2024-04-20T16:04:18.136 INFO:journalctl@ceph.mon.a.smithi012.stdout:Apr 20 16:04:17 smithi012 ceph-mon[17195]: from='mgr.14152 172.21.15.12:0/2573228170' entity='mgr.a' cmd=[{"prefix": "auth get", "entity": "client.admin"}]: dispatch 2024-04-20T16:04:18.136 INFO:journalctl@ceph.mon.a.smithi012.stdout:Apr 20 16:04:17 smithi012 ceph-mon[17195]: from='mgr.14152 172.21.15.12:0/2573228170' entity='mgr.a' cmd=[{"prefix": "osd df", "format": "json"}]: dispatch 2024-04-20T16:04:18.136 INFO:journalctl@ceph.mon.a.smithi012.stdout:Apr 20 16:04:17 smithi012 ceph-mon[17195]: from='mgr.14152 172.21.15.12:0/2573228170' entity='mgr.a' cmd=[{"prefix": "osd df", "format": "json"}]: dispatch 2024-04-20T16:04:18.136 INFO:journalctl@ceph.mon.a.smithi012.stdout:Apr 20 16:04:17 smithi012 ceph-mon[17195]: from='mgr.14152 172.21.15.12:0/2573228170' entity='mgr.a' cmd=[{"prefix": "osd safe-to-destroy", "ids": ["3"]}]: dispatch 2024-04-20T16:04:18.136 INFO:journalctl@ceph.mon.a.smithi012.stdout:Apr 20 16:04:17 smithi012 ceph-mon[17195]: from='mgr.14152 172.21.15.12:0/2573228170' entity='mgr.a' cmd=[{"prefix": "osd down", "ids": ["3"]}]: dispatch 2024-04-20T16:04:19.135 INFO:journalctl@ceph.mon.a.smithi012.stdout:Apr 20 16:04:18 smithi012 ceph-mon[17195]: from='mon.0 -' entity='mon.' cmd=[{"prefix": "osd df", "format": "json"}]: dispatch 2024-04-20T16:04:19.135 INFO:journalctl@ceph.mon.a.smithi012.stdout:Apr 20 16:04:18 smithi012 ceph-mon[17195]: from='mon.0 -' entity='mon.' cmd=[{"prefix": "osd df", "format": "json"}]: dispatch 2024-04-20T16:04:19.135 INFO:journalctl@ceph.mon.a.smithi012.stdout:Apr 20 16:04:18 smithi012 ceph-mon[17195]: from='mon.0 -' entity='mon.' cmd=[{"prefix": "osd safe-to-destroy", "ids": ["3"]}]: dispatch 2024-04-20T16:04:19.135 INFO:journalctl@ceph.mon.a.smithi012.stdout:Apr 20 16:04:18 smithi012 ceph-mon[17195]: Health check failed: 1 osds down (OSD_DOWN)
Updated by Laura Flores 6 days ago
- Related to Bug #64872: rados/cephadm/smoke: Health check failed: 1 stray daemon(s) not managed by cephadm (CEPHADM_STRAY_DAEMON) in cluster log added
Updated by Laura Flores 5 days ago
/a/yuriw-2024-04-20_15:32:38-rados-wip-yuriw-testing-20240419.185239-main-distro-default-smithi/7664686
2024-04-20T16:26:04.659 INFO:teuthology.orchestra.run.smithi144.stdout:2024-04-20T16:10:00.000158+0000 mon.a (mon.0) 1407 : cluster [WRN] Health detail: HEALTH_WARN nodeep-scrub flag(s) set; Reduced data availability: 1 pg peering
Updated by Laura Flores 5 days ago · Edited
/a/yuriw-2024-04-20_15:32:38-rados-wip-yuriw-testing-20240419.185239-main-distro-default-smithi/7664765
/a/yuriw-2024-04-20_15:32:38-rados-wip-yuriw-testing-20240419.185239-main-distro-default-smithi/7664810
Updated by Laura Flores 5 days ago · Edited
/a/yuriw-2024-04-20_15:32:38-rados-wip-yuriw-testing-20240419.185239-main-distro-default-smithi/7664854
/a/yuriw-2024-04-20_15:32:38-rados-wip-yuriw-testing-20240419.185239-main-distro-default-smithi/7664891
POOL_APP_NOT_ENABLED
Updated by Laura Flores 5 days ago
/a/yuriw-2024-04-20_15:32:38-rados-wip-yuriw-testing-20240419.185239-main-distro-default-smithi/7664903
2024-04-20T17:46:51.770 INFO:teuthology.orchestra.run.smithi012.stdout:2024-04-20T17:44:38.893501+0000 mon.a (mon.0) 1023 : cluster [WRN] Health check failed: 2 Cephadm Agent(s) are not reporting. Hosts may be offline (CEPHADM_AGENT_DOWN)
Updated by Laura Flores 5 days ago
/a/yuriw-2024-04-20_15:32:38-rados-wip-yuriw-testing-20240419.185239-main-distro-default-smithi/7664940
OSD_DOWN
Updated by Laura Flores 5 days ago
- Related to Bug #65728: Alertmanager in an unknown state added
Updated by Matan Breizman 5 days ago · Edited
/a/yuriw-2024-04-20_01:10:46-rados-wip-yuri7-testing-2024-04-18-1351-reef-distro-default-smithi/7664127
/a/yuriw-2024-04-20_01:10:46-rados-wip-yuri7-testing-2024-04-18-1351-reef-distro-default-smithi/7664245
Updated by Laura Flores 4 days ago
Partial fix for some of the warnings: https://github.com/ceph/ceph/pull/57218