Bug #64865: cephadm: Health check failed: 1 osds down (OSD_DOWN) in cluster log - Orchestrator - Ceph

Actions

Copy link

Bug #64865

closed

cephadm: Health check failed: 1 osds down (OSD_DOWN) in cluster log

Added by Sridhar Seshasayee about 2 months ago. Updated 18 days ago.

Status:

Resolved

Priority:

Normal

Assignee:

Nitzan Mordechai

Category:

orchestrator

Target version:

% Done:

Source:

Tags:

backport_processed

Backport:

squid

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

56613

Crash signature (v1):

Crash signature (v2):

Description

The following tests in the cephadm suite failed with the warning:

/a/yuriw-2024-03-08_16:20:46-rados-wip-yuri4-testing-2024-03-05-0854-distro-default-smithi/7587301
/a/yuriw-2024-03-08_16:20:46-rados-wip-yuri4-testing-2024-03-05-0854-distro-default-smithi/7587410
/a/yuriw-2024-03-08_16:20:46-rados-wip-yuri4-testing-2024-03-05-0854-distro-default-smithi/7587630
/a/yuriw-2024-03-08_16:20:46-rados-wip-yuri4-testing-2024-03-05-0854-distro-default-smithi/7587912
/a/yuriw-2024-03-08_16:20:46-rados-wip-yuri4-testing-2024-03-05-0854-distro-default-smithi/7587938

The tests above bring down OSD and therefore the OSD_DOWN events are expected. The logs below
from 7587301 shows that the OSD_DOWN warning is raised temporarily and eventually cleared:

2024-03-09T17:13:56.698 INFO:teuthology.orchestra.run.smithi012.stderr:+ ceph orch osd rm status
2024-03-09T17:13:56.699 INFO:teuthology.orchestra.run.smithi012.stderr:+ grep '^1'
...
2024-03-09T17:13:57.039 INFO:teuthology.orchestra.run.smithi012.stdout:1    smithi012  done, waiting for purge    0  False    False  False
2024-03-09T17:13:57.040 INFO:teuthology.orchestra.run.smithi012.stderr:+ sleep 5
...
2024-03-09T17:13:58.008 INFO:journalctl@ceph.mon.smithi012.smithi012.stdout:Mar 09 17:13:57 smithi012 ceph-mon[25263]: Health check failed: 1 osds down (OSD_DOWN)
2024-03-09T17:13:58.009 INFO:journalctl@ceph.mon.smithi012.smithi012.stdout:Mar 09 17:13:57 smithi012 ceph-mon[25263]: from='mgr.14215 172.21.15.12:0/2551335845' entity='mgr.smithi012.zfjpsz' cmd='[{"prefix": "osd down", "ids": ["1"]}]': finished
2024-03-09T17:13:58.009 INFO:journalctl@ceph.mon.smithi012.smithi012.stdout:Mar 09 17:13:57 smithi012 ceph-mon[25263]: osdmap e45: 8 total, 7 up, 8 in
2024-03-09T17:13:58.009 INFO:journalctl@ceph.mon.smithi012.smithi012.stdout:Mar 09 17:13:57 smithi012 ceph-mon[25263]: osd.1 now down
2024-03-09T17:13:58.009 INFO:journalctl@ceph.mon.smithi012.smithi012.stdout:Mar 09 17:13:57 smithi012 ceph-mon[25263]: Removing daemon osd.1 from smithi012 -- ports []
...
2024-03-09T17:14:02.235 INFO:journalctl@ceph.mon.smithi012.smithi012.stdout:Mar 09 17:14:01 smithi012 ceph-mon[25263]: Removing key for osd.1
2024-03-09T17:14:02.235 INFO:journalctl@ceph.mon.smithi012.smithi012.stdout:Mar 09 17:14:01 smithi012 ceph-mon[25263]: from='mgr.14215 172.21.15.12:0/2551335845' entity='mgr.smithi012.zfjpsz' cmd=[{"prefix": "auth rm", "entity": "osd.1"}]: dispatch
2024-03-09T17:14:02.235 INFO:journalctl@ceph.mon.smithi012.smithi012.stdout:Mar 09 17:14:01 smithi012 ceph-mon[25263]: from='mgr.14215 172.21.15.12:0/2551335845' entity='mgr.smithi012.zfjpsz' cmd='[{"prefix": "auth rm", "entity": "osd.1"}]': finished
2024-03-09T17:14:02.236 INFO:journalctl@ceph.mon.smithi012.smithi012.stdout:Mar 09 17:14:01 smithi012 ceph-mon[25263]: Successfully removed osd.1 on smithi012
2024-03-09T17:14:02.236 INFO:journalctl@ceph.mon.smithi012.smithi012.stdout:Mar 09 17:14:01 smithi012 ceph-mon[25263]: from='mgr.14215 172.21.15.12:0/2551335845' entity='mgr.smithi012.zfjpsz' cmd=[{"prefix": "osd purge-actual", "id": 1, "yes_i_really_mean_it": true}]: dispatch
2024-03-09T17:14:02.236 INFO:journalctl@ceph.mon.smithi012.smithi012.stdout:Mar 09 17:14:01 smithi012 ceph-mon[25263]: Health check cleared: OSD_DOWN (was: 1 osds down)
2024-03-09T17:14:02.236 INFO:journalctl@ceph.mon.smithi012.smithi012.stdout:Mar 09 17:14:01 smithi012 ceph-mon[25263]: Cluster is now healthy

Therefore, the tests should add the warning to the ignorelist.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Sridhar Seshasayee about 2 months ago

Translation missing: en.field_tag_list set to test-failure
Tags deleted (~~test-failure~~)

Actions

Copy link

Updated by Aishwarya Mathuria about 1 month ago

/a/yuriw-2024-03-19_00:09:45-rados-wip-yuri5-testing-2024-03-18-1144-distro-default-smithi/7609832

Actions

Copy link

Updated by Nitzan Mordechai about 1 month ago

/a/yuriw-2024-03-25_00:22:23-rados-wip-yuri3-testing-2024-03-24-1519-distro-default-smithi/7620805
/a/yuriw-2024-03-25_00:22:23-rados-wip-yuri3-testing-2024-03-24-1519-distro-default-smithi/7620914
/a/yuriw-2024-03-25_00:22:23-rados-wip-yuri3-testing-2024-03-24-1519-distro-default-smithi/7620938
/a/yuriw-2024-03-25_00:22:23-rados-wip-yuri3-testing-2024-03-24-1519-distro-default-smithi/7621014
/a/yuriw-2024-03-25_00:22:23-rados-wip-yuri3-testing-2024-03-24-1519-distro-default-smithi/7621050
/a/yuriw-2024-03-25_00:22:23-rados-wip-yuri3-testing-2024-03-24-1519-distro-default-smithi/7621076
/a/yuriw-2024-03-25_00:22:23-rados-wip-yuri3-testing-2024-03-24-1519-distro-default-smithi/7621103

Actions

Copy link