Project

General

Profile

Actions

Bug #53723

closed

Cephadm agent fails to report and causes a health timeout

Added by Laura Flores over 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
quincy
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/a/yuriw-2021-12-22_22:11:35-rados-wip-yuri3-testing-2021-12-22-1047-distro-default-smithi/6580439

Description: rados/cephadm/workunits/{agent/on mon_election/connectivity task/test_orch_cli}
Failure reason: timeout expired in wait_until_healthy

2021-12-23T06:18:45.300 INFO:teuthology.orchestra.run.smithi068.stdout:
2021-12-23T06:18:45.300 INFO:teuthology.orchestra.run.smithi068.stdout:{"status":"HEALTH_WARN","checks":{"CEPHADM_AGENT_DOWN":{"severity":"HEALTH_WARN","summary":{"message":"1 Cephadm Agent(s) are not reporting. Hosts may be offline","count":1},"muted":false},"CEPHADM_FAILED_DAEMON":{"severity":"HEALTH_WARN","summary":{"message":"1 failed cephadm daemon(s)","count":1},"muted":false}},"mutes":[]}
2021-12-23T06:18:45.323 INFO:journalctl@ceph.mon.a.smithi068.stdout:Dec 23 06:18:44 smithi068 bash[14626]: cluster 2021-12-23T06:18:43.922425+0000 mgr.a (mgr.14150) 357 : cluster [DBG] pgmap v343: 1 pgs: 1 active+clean; 577 KiB data, 17 MiB used, 268 GiB / 268 GiB avail
2021-12-23T06:18:46.323 INFO:journalctl@ceph.mon.a.smithi068.stdout:Dec 23 06:18:46 smithi068 bash[14626]: audit 2021-12-23T06:18:45.297779+0000 mon.a (mon.0) 347 : audit [DBG] from='client.? 172.21.15.68:0/3865627448' entity='client.admin' cmd=[{"prefix": "health", "format": "json"}]: dispatch
2021-12-23T06:18:46.726 INFO:tasks.cephadm:Teardown begin
2021-12-23T06:18:46.727 ERROR:teuthology.contextutil:Saw exception from nested tasks
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_95a7d4799b562f3bbb5ec66107094963abd62fa1/teuthology/contextutil.py", line 33, in nested
    yield vars
  File "/home/teuthworker/src/github.com_ceph_ceph-c_1121b3c9661a85cfbc852d654ea7d22c1d1be751/qa/tasks/cephadm.py", line 1548, in task
    healthy(ctx=ctx, config=config)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_1121b3c9661a85cfbc852d654ea7d22c1d1be751/qa/tasks/ceph.py", line 1469, in healthy
    manager.wait_until_healthy(timeout=300)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_1121b3c9661a85cfbc852d654ea7d22c1d1be751/qa/tasks/ceph_manager.py", line 3146, in wait_until_healthy
    'timeout expired in wait_until_healthy'
AssertionError: timeout expired in wait_until_healthy

Related issues 1 (0 open1 closed)

Related to Orchestrator - Bug #53448: cephadm: agent failures double reported by two health checksResolvedAdam King

Actions
Actions #1

Updated by Laura Flores over 2 years ago

/a/yuriw-2021-12-17_22:45:37-rados-wip-yuri10-testing-2021-12-17-1119-distro-default-smithi/6569647

Actions #2

Updated by Laura Flores over 2 years ago

Actions #3

Updated by Sebastian Wagner over 2 years ago

  • Related to Bug #53448: cephadm: agent failures double reported by two health checks added
Actions #4

Updated by Adam King over 2 years ago

  • Assignee set to Adam King

Going by the sentry event for these failures, it looks like this started being a common failure right as https://github.com/ceph/ceph/pull/44031 merged. I think having the agent down health check set when agents report might be necessary to get the cluster into a healthy state fast enough for the tests to pass.

Actions #5

Updated by Laura Flores over 2 years ago

/a/yuriw-2022-01-04_21:52:15-rados-wip-yuri7-testing-2022-01-04-1159-distro-default-smithi/6595253

Actions #6

Updated by Laura Flores over 2 years ago

  • Related to deleted (Bug #53448: cephadm: agent failures double reported by two health checks)
Actions #7

Updated by Laura Flores over 2 years ago

  • Related to Bug #53448: cephadm: agent failures double reported by two health checks added
Actions #8

Updated by Adam King over 2 years ago

  • Status changed from New to In Progress
  • Pull request ID set to 44489
Actions #9

Updated by Sebastian Wagner over 2 years ago

  • Status changed from In Progress to Pending Backport
Actions #10

Updated by Redouane Kachach Elhichou almost 2 years ago

backported to quincy

Actions #11

Updated by Adam King almost 2 years ago

  • Status changed from Pending Backport to Resolved
  • Backport set to quincy
Actions

Also available in: Atom PDF