Project

General

Profile

Actions

Bug #47612

open

ERROR: setUpClass (tasks.mgr.dashboard.test_health.HealthTest)

Added by Ernesto Puerta over 3 years ago. Updated 7 days ago.

Status:
New
Priority:
Normal
Category:
Testing - API
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Nautilus:

2020-09-23 10:28:57,731.731 INFO:__main__:----------------------------------------------------------------------
2020-09-23 10:28:57,731.731 INFO:__main__:Traceback (most recent call last):
2020-09-23 10:28:57,732.732 INFO:__main__:  File "/home/jenkins-build/build/workspace/ceph-api/qa/tasks/mgr/dashboard/helper.py", line 117, in setUpClass
2020-09-23 10:28:57,732.732 INFO:__main__:    cls._load_module("dashboard")
2020-09-23 10:28:57,732.732 INFO:__main__:  File "/home/jenkins-build/build/workspace/ceph-api/qa/tasks/mgr/mgr_test_case.py", line 144, in _load_module
2020-09-23 10:28:57,732.732 INFO:__main__:    cls.wait_until_true(has_restarted, timeout=30)
2020-09-23 10:28:57,733.733 INFO:__main__:  File "/home/jenkins-build/build/workspace/ceph-api/qa/tasks/ceph_test_case.py", line 197, in wait_until_true
2020-09-23 10:28:57,733.733 INFO:__main__:    raise TestTimeoutError("Timed out after {0}s".format(elapsed))
2020-09-23 10:28:57,733.733 INFO:__main__:tasks.ceph_test_case.TestTimeoutError: Timed out after 30s
2020-09-23 10:28:57,733.733 INFO:__main__:
2020-09-23 10:28:57,734.734 INFO:__main__:----------------------------------------------------------------------
2020-09-23 10:28:57,734.734 INFO:__main__:Ran 49 tests in 559.575s
2020-09-23 10:28:57,734.734 INFO:__main__:
2020-09-23 10:28:57,734.734 INFO:__main__:

Actions #1

Updated by Ernesto Puerta over 3 years ago

  • Description updated (diff)
Actions #2

Updated by Ernesto Puerta about 3 years ago

  • Project changed from mgr to Dashboard
  • Category changed from API Tests to Testing - API
Actions #4

Updated by Casey Bodley 3 months ago

  • Assignee deleted (Laura Paduano)
Actions #7

Updated by Pedro González Gómez 3 months ago

  • Assignee set to Pedro González Gómez
Actions #8

Updated by Casey Bodley about 1 month ago

https://jenkins.ceph.com/job/ceph-api/71512/

2024-03-29 08:38:15,187.187 INFO:__main__:----------------------------------------------------------------------
2024-03-29 08:38:15,187.187 INFO:__main__:Traceback (most recent call last):
2024-03-29 08:38:15,187.187 INFO:__main__:  File "/home/jenkins-build/build/workspace/ceph-api/qa/tasks/mgr/dashboard/helper.py", line 222, in setUpClass
2024-03-29 08:38:15,187.187 INFO:__main__:    cls.mds_cluster.mds_stop()
2024-03-29 08:38:15,187.187 INFO:__main__:  File "/home/jenkins-build/build/workspace/ceph-api/qa/tasks/cephfs/filesystem.py", line 364, in mds_stop
2024-03-29 08:38:15,187.187 INFO:__main__:    self._one_or_all(mds_id, lambda id_: self.mds_daemons[id_].stop())
2024-03-29 08:38:15,188.188 INFO:__main__:  File "/home/jenkins-build/build/workspace/ceph-api/qa/tasks/cephfs/filesystem.py", line 338, in _one_or_all
2024-03-29 08:38:15,188.188 INFO:__main__:    with parallel() as p:
2024-03-29 08:38:15,188.188 INFO:__main__:  File "/tmp/tmp.SK1qhyV79P/venv/lib/python3.10/site-packages/teuthology/parallel.py", line 84, in __exit__
2024-03-29 08:38:15,188.188 INFO:__main__:    for result in self:
2024-03-29 08:38:15,188.188 INFO:__main__:  File "/tmp/tmp.SK1qhyV79P/venv/lib/python3.10/site-packages/teuthology/parallel.py", line 98, in __next__
2024-03-29 08:38:15,188.188 INFO:__main__:    resurrect_traceback(result)
2024-03-29 08:38:15,188.188 INFO:__main__:  File "/tmp/tmp.SK1qhyV79P/venv/lib/python3.10/site-packages/teuthology/parallel.py", line 30, in resurrect_traceback
2024-03-29 08:38:15,188.188 INFO:__main__:    raise exc.exc_info[1]
2024-03-29 08:38:15,188.188 INFO:__main__:  File "/tmp/tmp.SK1qhyV79P/venv/lib/python3.10/site-packages/teuthology/parallel.py", line 23, in capture_traceback
2024-03-29 08:38:15,188.188 INFO:__main__:    return func(*args, **kwargs)
2024-03-29 08:38:15,189.189 INFO:__main__:  File "/home/jenkins-build/build/workspace/ceph-api/qa/tasks/cephfs/filesystem.py", line 364, in 
2024-03-29 08:38:15,189.189 INFO:__main__:    self._one_or_all(mds_id, lambda id_: self.mds_daemons[id_].stop())
2024-03-29 08:38:15,189.189 INFO:__main__:  File "/home/jenkins-build/build/workspace/ceph-api/build/../qa/tasks/vstart_runner.py", line 563, in stop
2024-03-29 08:38:15,189.189 INFO:__main__:    raise MaxWhileTries(
2024-03-29 08:38:15,189.189 INFO:__main__:teuthology.exceptions.MaxWhileTries: Timed out waiting for daemon mds.a
Actions #11

Updated by Casey Bodley 21 days ago

https://jenkins.ceph.com/job/ceph-api/72561/

2024-04-17 17:40:43,405.405 INFO:__main__:----------------------------------------------------------------------
2024-04-17 17:40:43,405.405 INFO:__main__:Traceback (most recent call last):
2024-04-17 17:40:43,405.405 INFO:__main__:  File "/home/jenkins-build/build/workspace/ceph-api/qa/tasks/mgr/dashboard/helper.py", line 222, in setUpClass
2024-04-17 17:40:43,405.405 INFO:__main__:    cls.mds_cluster.mds_stop()
2024-04-17 17:40:43,405.405 INFO:__main__:  File "/home/jenkins-build/build/workspace/ceph-api/qa/tasks/cephfs/filesystem.py", line 364, in mds_stop
2024-04-17 17:40:43,405.405 INFO:__main__:    self._one_or_all(mds_id, lambda id_: self.mds_daemons[id_].stop())
2024-04-17 17:40:43,405.405 INFO:__main__:  File "/home/jenkins-build/build/workspace/ceph-api/qa/tasks/cephfs/filesystem.py", line 338, in _one_or_all
2024-04-17 17:40:43,405.405 INFO:__main__:    with parallel() as p:
2024-04-17 17:40:43,405.405 INFO:__main__:  File "/tmp/tmp.E54XocQKFg/venv/lib/python3.10/site-packages/teuthology/parallel.py", line 84, in __exit__
2024-04-17 17:40:43,405.405 INFO:__main__:    for result in self:
2024-04-17 17:40:43,406.406 INFO:__main__:  File "/tmp/tmp.E54XocQKFg/venv/lib/python3.10/site-packages/teuthology/parallel.py", line 98, in __next__
2024-04-17 17:40:43,406.406 INFO:__main__:    resurrect_traceback(result)
2024-04-17 17:40:43,406.406 INFO:__main__:  File "/tmp/tmp.E54XocQKFg/venv/lib/python3.10/site-packages/teuthology/parallel.py", line 30, in resurrect_traceback
2024-04-17 17:40:43,406.406 INFO:__main__:    raise exc.exc_info[1]
2024-04-17 17:40:43,406.406 INFO:__main__:  File "/tmp/tmp.E54XocQKFg/venv/lib/python3.10/site-packages/teuthology/parallel.py", line 23, in capture_traceback
2024-04-17 17:40:43,406.406 INFO:__main__:    return func(*args, **kwargs)
2024-04-17 17:40:43,406.406 INFO:__main__:  File "/home/jenkins-build/build/workspace/ceph-api/qa/tasks/cephfs/filesystem.py", line 364, in 
2024-04-17 17:40:43,406.406 INFO:__main__:    self._one_or_all(mds_id, lambda id_: self.mds_daemons[id_].stop())
2024-04-17 17:40:43,406.406 INFO:__main__:  File "/home/jenkins-build/build/workspace/ceph-api/build/../qa/tasks/vstart_runner.py", line 563, in stop
2024-04-17 17:40:43,406.406 INFO:__main__:    raise MaxWhileTries(
2024-04-17 17:40:43,407.407 INFO:__main__:teuthology.exceptions.MaxWhileTries: Timed out waiting for daemon mds.a
Actions #12

Updated by Casey Bodley 8 days ago

https://jenkins.ceph.com/job/ceph-api/73295/

2024-05-01 14:33:49,801.801 INFO:__main__:----------------------------------------------------------------------
2024-05-01 14:33:49,801.801 INFO:__main__:Traceback (most recent call last):
2024-05-01 14:33:49,801.801 INFO:__main__:  File "/home/jenkins-build/build/workspace/ceph-api/qa/tasks/mgr/dashboard/helper.py", line 222, in setUpClass
2024-05-01 14:33:49,801.801 INFO:__main__:    cls.mds_cluster.mds_stop()
2024-05-01 14:33:49,802.802 INFO:__main__:  File "/home/jenkins-build/build/workspace/ceph-api/qa/tasks/cephfs/filesystem.py", line 370, in mds_stop
2024-05-01 14:33:49,802.802 INFO:__main__:    self._one_or_all(mds_id, lambda id_: self.mds_daemons[id_].stop())
2024-05-01 14:33:49,802.802 INFO:__main__:  File "/home/jenkins-build/build/workspace/ceph-api/qa/tasks/cephfs/filesystem.py", line 344, in _one_or_all
2024-05-01 14:33:49,802.802 INFO:__main__:    with parallel() as p:
2024-05-01 14:33:49,802.802 INFO:__main__:  File "/tmp/tmp.cbe5g4hISM/venv/lib/python3.10/site-packages/teuthology/parallel.py", line 84, in __exit__
2024-05-01 14:33:49,802.802 INFO:__main__:    for result in self:
2024-05-01 14:33:49,802.802 INFO:__main__:  File "/tmp/tmp.cbe5g4hISM/venv/lib/python3.10/site-packages/teuthology/parallel.py", line 98, in __next__
2024-05-01 14:33:49,802.802 INFO:__main__:    resurrect_traceback(result)
2024-05-01 14:33:49,802.802 INFO:__main__:  File "/tmp/tmp.cbe5g4hISM/venv/lib/python3.10/site-packages/teuthology/parallel.py", line 30, in resurrect_traceback
2024-05-01 14:33:49,802.802 INFO:__main__:    raise exc.exc_info[1]
2024-05-01 14:33:49,802.802 INFO:__main__:  File "/tmp/tmp.cbe5g4hISM/venv/lib/python3.10/site-packages/teuthology/parallel.py", line 23, in capture_traceback
2024-05-01 14:33:49,802.802 INFO:__main__:    return func(*args, **kwargs)
2024-05-01 14:33:49,802.802 INFO:__main__:  File "/home/jenkins-build/build/workspace/ceph-api/qa/tasks/cephfs/filesystem.py", line 370, in 
2024-05-01 14:33:49,802.802 INFO:__main__:    self._one_or_all(mds_id, lambda id_: self.mds_daemons[id_].stop())
2024-05-01 14:33:49,802.802 INFO:__main__:  File "/home/jenkins-build/build/workspace/ceph-api/build/../qa/tasks/vstart_runner.py", line 573, in stop
2024-05-01 14:33:49,802.802 INFO:__main__:    raise MaxWhileTries(
2024-05-01 14:33:49,802.802 INFO:__main__:teuthology.exceptions.MaxWhileTries: Timed out waiting for daemon mds.a

hi @Pedro González Gómez , have you made any progress with this?

Actions #13

Updated by Leonid Usov 8 days ago

I was about to open an issue and got to this one with a search. I suspect that this should be handled by the cephfs team, but before we can proceed we should change the logging setup for mds to at least 10/10. Currently, the only logs I see for failed runs like this one: https://jenkins.ceph.com/job/ceph-api/73292/ is

2024-05-01T13:42:20.605+0000 7f0883e6fbc0  0 ceph version Development (no_version) squid (dev), process ceph-mds, pid 3146654
2024-05-01T13:42:20.605+0000 7f0883e6fbc0  1 main not setting numa affinity
2024-05-01T13:42:20.625+0000 7f087f632640  1 mds.a Updating MDS map to version 1 from mon.0
2024-05-01T13:42:20.629+0000 7f087f632640  1 mds.a Updating MDS map to version 2 from mon.0
2024-05-01T13:42:20.629+0000 7f087f632640  1 mds.a Monitors have assigned me to become a standby
2024-05-01T13:48:53.961+0000 7f16b6bf4bc0 -1 WARNING: all dangerous and experimental features are enabled.
2024-05-01T13:48:53.977+0000 7f16b6bf4bc0 -1 WARNING: all dangerous and experimental features are enabled.
2024-05-01T13:48:53.977+0000 7f16b6bf4bc0  0 ceph version Development (no_version) squid (dev), process ceph-mds, pid 3165902
2024-05-01T13:48:53.977+0000 7f16b6bf4bc0  1 main not setting numa affinity
2024-05-01T13:48:53.981+0000 7f16b6bf4bc0 -1 WARNING: all dangerous and experimental features are enabled.
2024-05-01T13:48:54.017+0000 7f16b23b7640  1 mds.a Updating MDS map to version 7 from mon.1
2024-05-01T13:48:54.317+0000 7f16b23b7640  1 mds.a Updating MDS map to version 8 from mon.1
2024-05-01T13:48:54.317+0000 7f16b23b7640  1 mds.a Monitors have assigned me to become a standby
2024-05-01T13:48:54.373+0000 7f16b23b7640  1 mds.a Updating MDS map to version 9 from mon.1
2024-05-01T13:48:54.377+0000 7f16b23b7640  1 mds.0.9 handle_mds_map I am now mds.0.9
2024-05-01T13:48:54.377+0000 7f16b23b7640  1 mds.0.9 handle_mds_map state change up:standby --> up:creating
2024-05-01T13:48:54.377+0000 7f16b23b7640  0 mds.0.cache creating system inode with ino:0x1
2024-05-01T13:48:54.377+0000 7f16b23b7640  0 mds.0.cache creating system inode with ino:0x100
2024-05-01T13:48:54.377+0000 7f16b23b7640  0 mds.0.cache creating system inode with ino:0x600
2024-05-01T13:48:54.377+0000 7f16b23b7640  0 mds.0.cache creating system inode with ino:0x601
2024-05-01T13:48:54.377+0000 7f16b23b7640  0 mds.0.cache creating system inode with ino:0x602
2024-05-01T13:48:54.377+0000 7f16b23b7640  0 mds.0.cache creating system inode with ino:0x603
2024-05-01T13:48:54.381+0000 7f16b23b7640  0 mds.0.cache creating system inode with ino:0x604
2024-05-01T13:48:54.381+0000 7f16b23b7640  0 mds.0.cache creating system inode with ino:0x605
2024-05-01T13:48:54.381+0000 7f16b23b7640  0 mds.0.cache creating system inode with ino:0x606
2024-05-01T13:48:54.381+0000 7f16b23b7640  0 mds.0.cache creating system inode with ino:0x607
2024-05-01T13:48:54.381+0000 7f16b23b7640  0 mds.0.cache creating system inode with ino:0x608
2024-05-01T13:48:54.381+0000 7f16b23b7640  0 mds.0.cache creating system inode with ino:0x609
2024-05-01T13:48:54.437+0000 7f16abbaa640  1 mds.0.9 creating_done
2024-05-01T13:48:55.373+0000 7f16b23b7640  1 mds.a Updating MDS map to version 10 from mon.1
2024-05-01T13:48:55.373+0000 7f16b23b7640  1 mds.0.9 handle_mds_map I am now mds.0.9
2024-05-01T13:48:55.373+0000 7f16b23b7640  1 mds.0.9 handle_mds_map state change up:creating --> up:active
2024-05-01T13:48:55.373+0000 7f16b23b7640  1 mds.0.9 recovery_done -- successful recovery!
2024-05-01T13:48:55.373+0000 7f16b23b7640  1 mds.0.9 active_start
2024-05-01T13:48:55.937+0000 7f16b53bd640  1 mds.a asok_command: status {prefix=status} (starting...)
2024-05-01T13:48:59.390+0000 7f16afbb2640 -1 mds.pinger is_rank_lagging: rank=0 was never sent ping request.
2024-05-01T13:49:08.470+0000 7f16b53bd640  1 mds.a asok_command: session ls {format=json,prefix=session ls} (starting...)
2024-05-01T13:49:10.970+0000 7f16b53bd640  1 mds.a asok_command: session ls {format=json,prefix=session ls} (starting...)
2024-05-01T13:49:32.330+0000 7f16b53bd640  1 mds.a asok_command: session ls {format=json,prefix=session ls} (starting...)
2024-05-01T13:50:11.027+0000 7f16b23b7640  1 mds.a Updating MDS map to version 11 from mon.1
2024-05-01T13:50:15.179+0000 7f16b23b7640  1 mds.a Updating MDS map to version 12 from mon.1
2024-05-01T13:50:30.572+0000 7f16b23b7640  1 mds.a Updating MDS map to version 13 from mon.1
2024-05-01T13:50:34.648+0000 7f16b23b7640  1 mds.a Updating MDS map to version 14 from mon.1
2024-05-01T13:57:33.073+0000 7f16b3bba640 -1 received  signal: Terminated from python ../qa/tasks/vstart_runner.py --ignore-missing-binaries --no-verbose tasks.mgr.test_dashboard tasks.mgr.dashboard.test_api tasks.mgr.dashboard.test_auth tasks.mgr.dashboard.test_cephfs tasks.mgr.dashboard.test_cluster_configuration tasks.mgr.dashboard.test_cluster tasks.mgr.dashboard.test_crush_rule tasks.mgr.dashboard.test_erasure_code_profile tasks.mgr.dashboard.test_feedback tasks.mgr.dashboard.test_health tasks.mgr.dashboard.test_host tasks.mgr.dashboard.test_logs tasks.mgr.dashboard.test_mgr_module tasks.mgr.dashboard.test_monitor tasks.mgr.dashboard.test_motd tasks.mgr.dashboard.test_orchestrator tasks.mgr.dashboard.test_osd tasks.mgr.dashboard.test_perf_counters tasks.mgr.dashboard.test_pool tasks.mgr.dashboard.test_rbd_mirroring tasks.mgr.dashboard.test_rbd tasks.mgr.dashboard.test_requests tasks.mgr.dashboard.test_rgw tasks.mgr.dashboard.test_role tasks.mgr.dashboard.test_settings tasks.mgr.dashboard.test_summary tasks.mgr.dashboard.test_telemetry tasks.mgr.dashboard.test_user tasks.mgr.test_module_selftest  (PID: 3154024) UID: 1112

This is not sufficient to understand why the said mds was causing the timeout.

However, we should also review whether this is a valid approach in DashboardTestCase::setUpClass:

            # To avoid any issues with e.g. unlink bugs, we destroy and recreate
            # the filesystem rather than just doing a rm -rf of files
            cls.mds_cluster.mds_stop()
            cls.mds_cluster.mds_fail()
            cls.mds_cluster.delete_all_filesystems()

We may end up having two separate issues to deal with, in which case we'd open a separate ticket for the stop timeout and have this ticket dedicated to a more stable approach to re-creating a file system

Actions #14

Updated by Patrick Donnelly 7 days ago

Suggest updating your tearDown/setUp procedures to mirror what CephFSTestCase is doing.

Actions

Also available in: Atom PDF