Project

General

Profile

Actions

Bug #64865

closed

cephadm: Health check failed: 1 osds down (OSD_DOWN) in cluster log

Added by Sridhar Seshasayee about 2 months ago. Updated 18 days ago.

Status:
Resolved
Priority:
Normal
Category:
orchestrator
Target version:
-
% Done:

0%

Source:
Tags:
backport_processed
Backport:
squid
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

The following tests in the cephadm suite failed with the warning:

/a/yuriw-2024-03-08_16:20:46-rados-wip-yuri4-testing-2024-03-05-0854-distro-default-smithi/7587301
/a/yuriw-2024-03-08_16:20:46-rados-wip-yuri4-testing-2024-03-05-0854-distro-default-smithi/7587410
/a/yuriw-2024-03-08_16:20:46-rados-wip-yuri4-testing-2024-03-05-0854-distro-default-smithi/7587630
/a/yuriw-2024-03-08_16:20:46-rados-wip-yuri4-testing-2024-03-05-0854-distro-default-smithi/7587912
/a/yuriw-2024-03-08_16:20:46-rados-wip-yuri4-testing-2024-03-05-0854-distro-default-smithi/7587938

The tests above bring down OSD and therefore the OSD_DOWN events are expected. The logs below
from 7587301 shows that the OSD_DOWN warning is raised temporarily and eventually cleared:

2024-03-09T17:13:56.698 INFO:teuthology.orchestra.run.smithi012.stderr:+ ceph orch osd rm status
2024-03-09T17:13:56.699 INFO:teuthology.orchestra.run.smithi012.stderr:+ grep '^1'
...
2024-03-09T17:13:57.039 INFO:teuthology.orchestra.run.smithi012.stdout:1    smithi012  done, waiting for purge    0  False    False  False
2024-03-09T17:13:57.040 INFO:teuthology.orchestra.run.smithi012.stderr:+ sleep 5
...
2024-03-09T17:13:58.008 INFO:journalctl@ceph.mon.smithi012.smithi012.stdout:Mar 09 17:13:57 smithi012 ceph-mon[25263]: Health check failed: 1 osds down (OSD_DOWN)
2024-03-09T17:13:58.009 INFO:journalctl@ceph.mon.smithi012.smithi012.stdout:Mar 09 17:13:57 smithi012 ceph-mon[25263]: from='mgr.14215 172.21.15.12:0/2551335845' entity='mgr.smithi012.zfjpsz' cmd='[{"prefix": "osd down", "ids": ["1"]}]': finished
2024-03-09T17:13:58.009 INFO:journalctl@ceph.mon.smithi012.smithi012.stdout:Mar 09 17:13:57 smithi012 ceph-mon[25263]: osdmap e45: 8 total, 7 up, 8 in
2024-03-09T17:13:58.009 INFO:journalctl@ceph.mon.smithi012.smithi012.stdout:Mar 09 17:13:57 smithi012 ceph-mon[25263]: osd.1 now down
2024-03-09T17:13:58.009 INFO:journalctl@ceph.mon.smithi012.smithi012.stdout:Mar 09 17:13:57 smithi012 ceph-mon[25263]: Removing daemon osd.1 from smithi012 -- ports []
...
2024-03-09T17:14:02.235 INFO:journalctl@ceph.mon.smithi012.smithi012.stdout:Mar 09 17:14:01 smithi012 ceph-mon[25263]: Removing key for osd.1
2024-03-09T17:14:02.235 INFO:journalctl@ceph.mon.smithi012.smithi012.stdout:Mar 09 17:14:01 smithi012 ceph-mon[25263]: from='mgr.14215 172.21.15.12:0/2551335845' entity='mgr.smithi012.zfjpsz' cmd=[{"prefix": "auth rm", "entity": "osd.1"}]: dispatch
2024-03-09T17:14:02.235 INFO:journalctl@ceph.mon.smithi012.smithi012.stdout:Mar 09 17:14:01 smithi012 ceph-mon[25263]: from='mgr.14215 172.21.15.12:0/2551335845' entity='mgr.smithi012.zfjpsz' cmd='[{"prefix": "auth rm", "entity": "osd.1"}]': finished
2024-03-09T17:14:02.236 INFO:journalctl@ceph.mon.smithi012.smithi012.stdout:Mar 09 17:14:01 smithi012 ceph-mon[25263]: Successfully removed osd.1 on smithi012
2024-03-09T17:14:02.236 INFO:journalctl@ceph.mon.smithi012.smithi012.stdout:Mar 09 17:14:01 smithi012 ceph-mon[25263]: from='mgr.14215 172.21.15.12:0/2551335845' entity='mgr.smithi012.zfjpsz' cmd=[{"prefix": "osd purge-actual", "id": 1, "yes_i_really_mean_it": true}]: dispatch
2024-03-09T17:14:02.236 INFO:journalctl@ceph.mon.smithi012.smithi012.stdout:Mar 09 17:14:01 smithi012 ceph-mon[25263]: Health check cleared: OSD_DOWN (was: 1 osds down)
2024-03-09T17:14:02.236 INFO:journalctl@ceph.mon.smithi012.smithi012.stdout:Mar 09 17:14:01 smithi012 ceph-mon[25263]: Cluster is now healthy

Therefore, the tests should add the warning to the ignorelist.


Related issues 1 (0 open1 closed)

Copied to Orchestrator - Backport #65414: squid: cephadm: Health check failed: 1 osds down (OSD_DOWN) in cluster log ResolvedNitzan MordechaiActions
Actions #1

Updated by Sridhar Seshasayee about 2 months ago

  • Translation missing: en.field_tag_list set to test-failure
  • Tags deleted (test-failure)
Actions #2

Updated by Aishwarya Mathuria about 1 month ago

/a/yuriw-2024-03-19_00:09:45-rados-wip-yuri5-testing-2024-03-18-1144-distro-default-smithi/7609832

Actions #3

Updated by Nitzan Mordechai about 1 month ago

/a/yuriw-2024-03-25_00:22:23-rados-wip-yuri3-testing-2024-03-24-1519-distro-default-smithi/7620805
/a/yuriw-2024-03-25_00:22:23-rados-wip-yuri3-testing-2024-03-24-1519-distro-default-smithi/7620914
/a/yuriw-2024-03-25_00:22:23-rados-wip-yuri3-testing-2024-03-24-1519-distro-default-smithi/7620938
/a/yuriw-2024-03-25_00:22:23-rados-wip-yuri3-testing-2024-03-24-1519-distro-default-smithi/7621014
/a/yuriw-2024-03-25_00:22:23-rados-wip-yuri3-testing-2024-03-24-1519-distro-default-smithi/7621050
/a/yuriw-2024-03-25_00:22:23-rados-wip-yuri3-testing-2024-03-24-1519-distro-default-smithi/7621076
/a/yuriw-2024-03-25_00:22:23-rados-wip-yuri3-testing-2024-03-24-1519-distro-default-smithi/7621103

Actions #4

Updated by Nitzan Mordechai about 1 month ago

  • Assignee set to Nitzan Mordechai
Actions #5

Updated by Nitzan Mordechai about 1 month ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 56613
Actions #6

Updated by Adam King 24 days ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to squid
Actions #7

Updated by Backport Bot 24 days ago

  • Copied to Backport #65414: squid: cephadm: Health check failed: 1 osds down (OSD_DOWN) in cluster log added
Actions #8

Updated by Backport Bot 24 days ago

  • Tags set to backport_processed
Actions #9

Updated by Adam King 18 days ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF