Cleanup #65521
open
Add expected warnings in cluster log to ignorelists
Added by Laura Flores about 1 month ago.
Updated 6 days ago.
Description
Relevant Slack conversation:
Hey all, as I brought up in today's RADOS call, there has been a surge of cluster warnings in the rados and upgrade suites due to the merge of https://github.com/ceph/ceph/pull/54312 to main and squid.
Here are recent main baselines, where we have a huge percentage of failures due to cluster warnings:
Squid doesn't look nearly as bad, but still needs some attention especially in the upgrade suite:
I've been making tracker issues to fix a lot of these warnings, but since there are so many and they are non-deterministic, I think this will need to be a group effort.
Here are some I've opened lately:
Any ideas on how we can effectively divide up the work and fix the suites is welcome. The idea is to go through each failure, identify whether the warning is expected (i.e. OSD_DOWN warnings are expected in thrash tests), and add it to the correct ignorelist in a PR like this: https://github.com/ceph/ceph/pull/56619
The mon_cluster_log_to_file change has not yet been backported to Quincy or Reef, but the same work will need to be done for these. I think we should run all suites against these patches and merge them along with ignorelist changes, rather than merging first and fixing second.
Related issues
9 (9 open — 0 closed)
- Related to Bug #65422: upgrade/quincy-x/parallel: "1 pg degraded (PG_DEGRADED)" in cluster log added
- Related to Bug #64868: cephadm/osds, cephadm/workunits: Health check failed: 1 pool(s) do not have an application enabled (POOL_APP_NOT_ENABLED) in cluster log added
- Related to Bug #65235: upgrade/reef-x/stress-split: "OSDMAP_FLAGS: noscrub flag(s) set" warning in cluster log added
- Related to Bug #62776: rados: cluster [WRN] overall HEALTH_WARN - do not have an application enabled added
- Related to Bug #64870: Health check failed: 1 osds down (OSD_DOWN)" in cluster log added
/a/yuriw-2024-04-16_23:25:35-rados-wip-yuriw-testing-20240416.150233-distro-default-smithi/7659305
/a/yuriw-2024-04-20_15:32:38-rados-wip-yuriw-testing-20240419.185239-main-distro-default-smithi/7664685
"2024-04-20T16:10:00.000158+0000 mon.a (mon.0) 1407 : cluster [WRN] Health detail: HEALTH_WARN nodeep-scrub flag(s) set; Reduced data availability: 1 pg peering" in cluster log
In this one, we are intentionally setting OSDs down, so the warning is expected.
/a/yuriw-2024-04-20_15:32:38-rados-wip-yuriw-testing-20240419.185239-main-distro-default-smithi/7664689
2024-04-20T16:04:18.136 INFO:journalctl@ceph.mon.a.smithi012.stdout:Apr 20 16:04:17 smithi012 ceph-mon[17195]: from='mgr.14152 172.21.15.12:0/2573228170' entity='mgr.a' cmd=[{"prefix": "config generate-minimal-conf"}]: dispatch
2024-04-20T16:04:18.136 INFO:journalctl@ceph.mon.a.smithi012.stdout:Apr 20 16:04:17 smithi012 ceph-mon[17195]: from='mgr.14152 172.21.15.12:0/2573228170' entity='mgr.a' cmd=[{"prefix": "auth get", "entity": "client.admin"}]: dispatch
2024-04-20T16:04:18.136 INFO:journalctl@ceph.mon.a.smithi012.stdout:Apr 20 16:04:17 smithi012 ceph-mon[17195]: from='mgr.14152 172.21.15.12:0/2573228170' entity='mgr.a' cmd=[{"prefix": "osd df", "format": "json"}]: dispatch
2024-04-20T16:04:18.136 INFO:journalctl@ceph.mon.a.smithi012.stdout:Apr 20 16:04:17 smithi012 ceph-mon[17195]: from='mgr.14152 172.21.15.12:0/2573228170' entity='mgr.a' cmd=[{"prefix": "osd df", "format": "json"}]: dispatch
2024-04-20T16:04:18.136 INFO:journalctl@ceph.mon.a.smithi012.stdout:Apr 20 16:04:17 smithi012 ceph-mon[17195]: from='mgr.14152 172.21.15.12:0/2573228170' entity='mgr.a' cmd=[{"prefix": "osd safe-to-destroy", "ids": ["3"]}]: dispatch
2024-04-20T16:04:18.136 INFO:journalctl@ceph.mon.a.smithi012.stdout:Apr 20 16:04:17 smithi012 ceph-mon[17195]: from='mgr.14152 172.21.15.12:0/2573228170' entity='mgr.a' cmd=[{"prefix": "osd down", "ids": ["3"]}]: dispatch
2024-04-20T16:04:19.135 INFO:journalctl@ceph.mon.a.smithi012.stdout:Apr 20 16:04:18 smithi012 ceph-mon[17195]: from='mon.0 -' entity='mon.' cmd=[{"prefix": "osd df", "format": "json"}]: dispatch
2024-04-20T16:04:19.135 INFO:journalctl@ceph.mon.a.smithi012.stdout:Apr 20 16:04:18 smithi012 ceph-mon[17195]: from='mon.0 -' entity='mon.' cmd=[{"prefix": "osd df", "format": "json"}]: dispatch
2024-04-20T16:04:19.135 INFO:journalctl@ceph.mon.a.smithi012.stdout:Apr 20 16:04:18 smithi012 ceph-mon[17195]: from='mon.0 -' entity='mon.' cmd=[{"prefix": "osd safe-to-destroy", "ids": ["3"]}]: dispatch
2024-04-20T16:04:19.135 INFO:journalctl@ceph.mon.a.smithi012.stdout:Apr 20 16:04:18 smithi012 ceph-mon[17195]: Health check failed: 1 osds down (OSD_DOWN)
- Related to Bug #64872: rados/cephadm/smoke: Health check failed: 1 stray daemon(s) not managed by cephadm (CEPHADM_STRAY_DAEMON) in cluster log added
/a/yuriw-2024-04-20_15:32:38-rados-wip-yuriw-testing-20240419.185239-main-distro-default-smithi/7664686
2024-04-20T16:26:04.659 INFO:teuthology.orchestra.run.smithi144.stdout:2024-04-20T16:10:00.000158+0000 mon.a (mon.0) 1407 : cluster [WRN] Health detail: HEALTH_WARN nodeep-scrub flag(s) set; Reduced data availability: 1 pg peering
/a/yuriw-2024-04-20_15:32:38-rados-wip-yuriw-testing-20240419.185239-main-distro-default-smithi/7664765
/a/yuriw-2024-04-20_15:32:38-rados-wip-yuriw-testing-20240419.185239-main-distro-default-smithi/7664810
/a/yuriw-2024-04-20_15:32:38-rados-wip-yuriw-testing-20240419.185239-main-distro-default-smithi/7664854
/a/yuriw-2024-04-20_15:32:38-rados-wip-yuriw-testing-20240419.185239-main-distro-default-smithi/7664891
POOL_APP_NOT_ENABLED
/a/yuriw-2024-04-20_15:32:38-rados-wip-yuriw-testing-20240419.185239-main-distro-default-smithi/7664903
2024-04-20T17:46:51.770 INFO:teuthology.orchestra.run.smithi012.stdout:2024-04-20T17:44:38.893501+0000 mon.a (mon.0) 1023 : cluster [WRN] Health check failed: 2 Cephadm Agent(s) are not reporting. Hosts may be offline (CEPHADM_AGENT_DOWN)
/a/yuriw-2024-04-20_15:32:38-rados-wip-yuriw-testing-20240419.185239-main-distro-default-smithi/7664940
OSD_DOWN
- Related to Bug #65728: Daemon managed by cephadm in an unknown state (CEPHADM_FAILED_DAEMON) added
/a/yuriw-2024-04-20_01:10:46-rados-wip-yuri7-testing-2024-04-18-1351-reef-distro-default-smithi/7664127
/a/yuriw-2024-04-20_01:10:46-rados-wip-yuri7-testing-2024-04-18-1351-reef-distro-default-smithi/7664245
- Related to Bug #65768: rados/verify: Health check failed: 1 osds down (OSD_DOWN)" in cluster log added
- Related to Bug #65824: rados/thrash-old-clients: cluster [WRN] Health detail: HEALTH_WARN noscrub flag(s) set" in cluster log added
/a/yuriw-2024-05-04_16:45:43-rados-wip-yuriw-testing-20240503.213524-main-distro-default-smithi/7691265
/a/yuriw-2024-04-11_17:03:54-rados-wip-yuri6-testing-2024-04-02-1310-distro-default-smithi/7652461
/a/yuriw-2024-04-11_17:03:54-rados-wip-yuri6-testing-2024-04-02-1310-distro-default-smithi/7652465
/a/yuriw-2024-04-11_17:03:54-rados-wip-yuri6-testing-2024-04-02-1310-distro-default-smithi/7652467
/a/yuriw-2024-04-11_17:03:54-rados-wip-yuri6-testing-2024-04-02-1310-distro-default-smithi/7652474
/a/yuriw-2024-04-11_17:03:54-rados-wip-yuri6-testing-2024-04-02-1310-distro-default-smithi/7652477
Also available in: Atom
PDF