Project

General

Profile

Actions

Bug #54307

open

test_cls_rgw.sh: 'index_list_delimited' test times out

Added by Laura Flores about 2 years ago. Updated over 1 year ago.

Status:
New
Priority:
Normal
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/a/yuriw-2022-02-16_15:53:49-rados-wip-yuri11-testing-2022-02-15-1643-distro-default-smithi/6688784

Last test run before Traceback:

2022-02-16T19:29:23.645 INFO:tasks.workunit.client.0.smithi070.stdout:[       OK ] cls_rgw.index_list (688 ms)
2022-02-16T19:29:23.645 INFO:tasks.workunit.client.0.smithi070.stdout:[ RUN      ] cls_rgw.index_list_delimited

...

2022-02-16T20:07:19.010 INFO:tasks.thrashosds.thrasher:Traceback (most recent call last):
  File "/home/teuthworker/src/github.com_ceph_ceph-c_c11e21b2a403e128a89552f2aa1019a3a9f8a012/qa/tasks/ceph_manager.py", line 189, in wrapper
    return func(self)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_c11e21b2a403e128a89552f2aa1019a3a9f8a012/qa/tasks/ceph_manager.py", line 1412, in _do_thrash
    self.choose_action()()
  File "/home/teuthworker/src/github.com_ceph_ceph-c_c11e21b2a403e128a89552f2aa1019a3a9f8a012/qa/tasks/ceph_manager.py", line 347, in kill_osd
    self.ceph_manager.kill_osd(osd)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_c11e21b2a403e128a89552f2aa1019a3a9f8a012/qa/tasks/ceph_manager.py", line 2977, in kill_osd
    self.ctx.daemons.get_daemon('osd', osd, self.cluster).stop()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_65c38a8ac6d6694ef8ab9ff2325a7faf73afdc22/teuthology/orchestra/daemon/state.py", line 139, in stop
    run.wait([self.proc], timeout=timeout)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_65c38a8ac6d6694ef8ab9ff2325a7faf73afdc22/teuthology/orchestra/run.py", line 473, in wait
    check_time()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_65c38a8ac6d6694ef8ab9ff2325a7faf73afdc22/teuthology/contextutil.py", line 133, in __call__
    raise MaxWhileTries(error_msg)
teuthology.exceptions.MaxWhileTries: reached maximum tries (50) after waiting for 300 seconds


Related issues 1 (0 open1 closed)

Has duplicate RADOS - Bug #54507: workunit test cls/test_cls_rgw: Manager failed: thrashosdsDuplicate

Actions
Actions #1

Updated by Casey Bodley about 2 years ago

  • Assignee set to J. Eric Ivancich
Actions #2

Updated by Laura Flores about 2 years ago

/a/yuriw-2022-03-04_00:56:58-rados-wip-yuri4-testing-2022-03-03-1448-distro-default-smithi/6718934

Actions #3

Updated by Laura Flores about 2 years ago

  • Has duplicate Bug #54507: workunit test cls/test_cls_rgw: Manager failed: thrashosds added
Actions #4

Updated by Laura Flores about 2 years ago

/a/yuriw-2022-03-10_02:41:10-rados-wip-yuri3-testing-2022-03-09-1350-distro-default-smithi/6729279

Actions #5

Updated by Casey Bodley about 2 years ago

very strange that this is timing out in the rados suite. the same test case is running in the rgw suite without failures

Actions #6

Updated by Casey Bodley about 2 years ago

@Laura i see there's a duplicate issue https://tracker.ceph.com/issues/54507 that points at problems with ceph-mgr? is this really a rgw bug?

Actions #7

Updated by Laura Flores about 2 years ago

@Casey the first "timeout" Traceback message appears at the 71% mark in the teuthology log:

2022-02-16T20:07:19.010 INFO:tasks.thrashosds.thrasher:Traceback (most recent call last):
  File "/home/teuthworker/src/github.com_ceph_ceph-c_c11e21b2a403e128a89552f2aa1019a3a9f8a012/qa/tasks/ceph_manager.py", line 189, in wrapper
    return func(self)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_c11e21b2a403e128a89552f2aa1019a3a9f8a012/qa/tasks/ceph_manager.py", line 1412, in _do_thrash
    self.choose_action()()
  File "/home/teuthworker/src/github.com_ceph_ceph-c_c11e21b2a403e128a89552f2aa1019a3a9f8a012/qa/tasks/ceph_manager.py", line 347, in kill_osd
    self.ceph_manager.kill_osd(osd)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_c11e21b2a403e128a89552f2aa1019a3a9f8a012/qa/tasks/ceph_manager.py", line 2977, in kill_osd
    self.ctx.daemons.get_daemon('osd', osd, self.cluster).stop()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_65c38a8ac6d6694ef8ab9ff2325a7faf73afdc22/teuthology/orchestra/daemon/state.py", line 139, in stop
    run.wait([self.proc], timeout=timeout)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_65c38a8ac6d6694ef8ab9ff2325a7faf73afdc22/teuthology/orchestra/run.py", line 473, in wait
    check_time()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_65c38a8ac6d6694ef8ab9ff2325a7faf73afdc22/teuthology/contextutil.py", line 133, in __call__
    raise MaxWhileTries(error_msg)
teuthology.exceptions.MaxWhileTries: reached maximum tries (50) after waiting for 300 seconds

The "Manager failed" Traceback appears later at 99%:

2022-02-16T22:31:46.789 ERROR:teuthology.run_tasks:Manager failed: thrashosds
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_65c38a8ac6d6694ef8ab9ff2325a7faf73afdc22/teuthology/run_tasks.py", line 176, in run_tasks
    suppress = manager.__exit__(*exc_info)
  File "/usr/lib/python3.6/contextlib.py", line 88, in __exit__
    next(self.gen)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_c11e21b2a403e128a89552f2aa1019a3a9f8a012/qa/tasks/thrashosds.py", line 215, in task
    cluster_manager.wait_for_all_osds_up()
  File "/home/teuthworker/src/github.com_ceph_ceph-c_c11e21b2a403e128a89552f2aa1019a3a9f8a012/qa/tasks/ceph_manager.py", line 2759, in wait_for_all_osds_up
    while not self.are_all_osds_up():
  File "/home/teuthworker/src/github.com_ceph_ceph-c_c11e21b2a403e128a89552f2aa1019a3a9f8a012/qa/tasks/ceph_manager.py", line 2749, in are_all_osds_up
    x = self.get_osd_dump()
  File "/home/teuthworker/src/github.com_ceph_ceph-c_c11e21b2a403e128a89552f2aa1019a3a9f8a012/qa/tasks/ceph_manager.py", line 2522, in get_osd_dump
    return self.get_osd_dump_json()['osds']
  File "/home/teuthworker/src/github.com_ceph_ceph-c_c11e21b2a403e128a89552f2aa1019a3a9f8a012/qa/tasks/ceph_manager.py", line 2514, in get_osd_dump_json
    out = self.raw_cluster_cmd('osd', 'dump', '--format=json')
  File "/home/teuthworker/src/github.com_ceph_ceph-c_c11e21b2a403e128a89552f2aa1019a3a9f8a012/qa/tasks/ceph_manager.py", line 1597, in raw_cluster_cmd
    return self.run_cluster_cmd(**kwargs).stdout.getvalue()
  File "/home/teuthworker/src/github.com_ceph_ceph-c_c11e21b2a403e128a89552f2aa1019a3a9f8a012/qa/tasks/ceph_manager.py", line 1588, in run_cluster_cmd
    return self.controller.run(**kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_65c38a8ac6d6694ef8ab9ff2325a7faf73afdc22/teuthology/orchestra/remote.py", line 509, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_65c38a8ac6d6694ef8ab9ff2325a7faf73afdc22/teuthology/orchestra/run.py", line 455, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_65c38a8ac6d6694ef8ab9ff2325a7faf73afdc22/teuthology/orchestra/run.py", line 161, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_65c38a8ac6d6694ef8ab9ff2325a7faf73afdc22/teuthology/orchestra/run.py", line 183, in _raise_for_status
    node=self.hostname, label=self.label
teuthology.exceptions.CommandFailedError: Command failed on smithi070 with status 124: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph osd dump --format=json'

This is why I categorized it as an RGW failure, since the manager failure seems to happen as a result of the timed out test. Another detail to note is that the test does not fail deterministically in the rados suite (see this passed test in the most recent master baseline: http://pulpito.front.sepia.ceph.com/yuriw-2022-03-04_21:59:11-rados-master-distro-default-smithi/6721856/), so it's possible that the failure hasn't shown up yet in the rgw suite. If possible, can you post a link to an example of this test in the rgw suite? I want to verify that it's been passing there before refiling this under rados.

Actions #8

Updated by Sridhar Seshasayee over 1 year ago

Seeing this in a Quincy run:
/a/yuriw-2022-08-08_22:19:32-rados-wip-yuri4-testing-2022-08-08-1009-quincy-distro-default-smithi/6962166

Actions

Also available in: Atom PDF