Bug #64870
openHealth check failed: 1 osds down (OSD_DOWN)" in cluster log
0%
Description
Description of problem¶
/a/yuriw-2024-03-08_16:20:46-rados-wip-yuri4-testing-2024-03-05-0854-distro-default-smithi/7587648
Test Description:
rados/dashboard/{0-single-container-host debug/mgr mon_election/connectivity random-objectstore$/{bluestore-comp-lz4} tasks/e2e}
The test 04-osds.e2e-spec.ts marks OSDs down due to which OSD_DOWN health warning is raised.
Logs show that the health warning is cleared in a few secs but the warning is logged due to which
the test is marked failed. All the dashboard tests passed. The warning should probably be added to
the ignorelist as OSD_DOWN event is expected.
Actual results¶
2024-03-10T00:32:13.960 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:13 smithi110 ceph-mon[29592]: from='mon.0 -' entity='mon.' cmd=[{"prefix": "osd safe-to-destroy", "ids": ["3"]}]: dispatch 2024-03-10T00:32:13.960 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:13 smithi110 ceph-mon[29592]: pgmap v423: 1 pgs: 1 active+clean; 577 KiB data, 419 MiB used, 313 GiB / 313 GiB avail 2024-03-10T00:32:13.960 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:13 smithi110 ceph-mon[29592]: Health check failed: 1 osds down (OSD_DOWN) 2024-03-10T00:32:13.961 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:13 smithi110 ceph-mon[29592]: from='mgr.14152 172.21.15.110:0/917657050' entity='mgr.a' cmd='[{"prefix": "osd down", "ids": ["3"]}]': finished 2024-03-10T00:32:13.961 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:13 smithi110 ceph-mon[29592]: osdmap e40: 6 total, 5 up, 6 in 2024-03-10T00:32:13.961 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:13 smithi110 ceph-mon[29592]: Monitor daemon marked osd.3 down, but it is still running 2024-03-10T00:32:13.961 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:13 smithi110 ceph-mon[29592]: map e40 wrongly marked me down at e40 2024-03-10T00:32:13.961 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:13 smithi110 ceph-mon[29592]: osd.3 marked itself dead as of e40 2024-03-10T00:32:14.960 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:14 smithi110 ceph-mon[29592]: osd.3 now down 2024-03-10T00:32:14.961 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:14 smithi110 ceph-mon[29592]: Removing daemon osd.3 from smithi186 -- ports [] 2024-03-10T00:32:14.961 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:14 smithi110 ceph-mon[29592]: osdmap e41: 6 total, 5 up, 6 in 2024-03-10T00:32:15.960 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:15 smithi110 ceph-mon[29592]: pgmap v426: 1 pgs: 1 active+clean; 577 KiB data, 208 MiB used, 313 GiB / 313 GiB avail 2024-03-10T00:32:16.641 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:16 smithi110 ceph-mon[29592]: from='mgr.14152 172.21.15.110:0/917657050' entity='mgr.a' cmd=[{"prefix": "auth rm", "entity": "osd.3"}]: dispatch 2024-03-10T00:32:16.641 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:16 smithi110 ceph-mon[29592]: from='mgr.14152 172.21.15.110:0/917657050' entity='mgr.a' cmd='[{"prefix": "auth rm", "entity": "osd.3"}]': finished 2024-03-10T00:32:16.641 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:16 smithi110 ceph-mon[29592]: from='mgr.14152 172.21.15.110:0/917657050' entity='mgr.a' cmd=[{"prefix": "osd purge-actual", "id": 3, "yes_i_really_mean_it": true}]: dispatch 2024-03-10T00:32:16.641 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:16 smithi110 ceph-mon[29592]: Health check cleared: OSD_DOWN (was: 1 osds down) 2024-03-10T00:32:16.641 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:16 smithi110 ceph-mon[29592]: Cluster is now healthy 2024-03-10T00:32:16.641 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:16 smithi110 ceph-mon[29592]: from='mgr.14152 172.21.15.110:0/917657050' entity='mgr.a' cmd='[{"prefix": "osd purge-actual", "id": 3, "yes_i_really_mean_it": true}]': finished 2024-03-10T00:32:16.642 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:16 smithi110 ceph-mon[29592]: osdmap e42: 5 total, 5 up, 5 in
Updated by Laura Flores 18 days ago
- Subject changed from mgr/dashboard: Health check failed: 1 osds down (OSD_DOWN)" in cluster log to Health check failed: 1 osds down (OSD_DOWN)" in cluster log
Also found in an upgrade test:
description: rados/upgrade/parallel/{0-random-distro$/{ubuntu_22.04} 0-start 1-tasks
mon_election/classic upgrade-sequence workload/{ec-rados-default rados_api rados_loadgenbig
rbd_import_export test_rbd_api test_rbd_python}}
/a/yuriw-2024-04-10_14:17:51-rados-wip-yuri6-testing-2024-04-02-1310-distro-default-smithi/7650687
Updated by Laura Flores 18 days ago
And in a cephadm test: /a/yuriw-2024-04-10_14:17:51-rados-wip-yuri6-testing-2024-04-02-1310-distro-default-smithi/7650670
Updated by Laura Flores 18 days ago
Updated by Laura Flores 18 days ago
- Related to Cleanup #65521: Add expected warnings in cluster log to ignorelists added