Bug #54521: daemon: Error while waiting for process to exit - RADOS - Ceph

Actions

Copy link

Bug #54521

open

daemon: Error while waiting for process to exit

Added by Kamoltat (Junior) Sirivadhna about 2 years ago. Updated about 2 years ago.

Status:

Need More Info

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

This causes dead job: hit max job timeout

/a/yuriw-2022-03-04_21:56:41-rados-wip-yuri4-testing-2022-03-03-1448-distro-default-smithi/6721637

2022-03-05T10:00:47.896 INFO:tasks.ceph.mgr.y:Stopped
2022-03-05T10:00:47.897 DEBUG:tasks.ceph.mgr.x:waiting for process to exit
2022-03-05T10:00:47.898 INFO:teuthology.orchestra.run:waiting for 300
2022-03-05T10:00:47.899 INFO:tasks.ceph.mgr.x:Stopped
2022-03-05T10:00:47.899 INFO:teuthology.misc:Shutting down mon daemons...
2022-03-05T10:00:47.900 DEBUG:tasks.ceph.mon.a:waiting for process to exit
2022-03-05T10:00:47.901 INFO:teuthology.orchestra.run:waiting for 300
2022-03-05T10:00:47.902 INFO:tasks.ceph.mon.a:Stopped
2022-03-05T10:00:47.902 DEBUG:tasks.ceph.mon.c:waiting for process to exit
2022-03-05T10:00:47.903 INFO:teuthology.orchestra.run:waiting for 300
2022-03-05T10:00:47.904 INFO:tasks.ceph.mon.c:Stopped
2022-03-05T10:00:47.905 DEBUG:tasks.ceph.mon.b:waiting for process to exit
2022-03-05T10:00:47.905 INFO:teuthology.orchestra.run:waiting for 300
2022-03-05T10:00:47.906 DEBUG:teuthology.orchestra.run:got remote process result: 42
2022-03-05T10:00:47.907 ERROR:teuthology.orchestra.daemon.state:Error while waiting for process to exit
Traceback (most recent call last):
  File "/home/teuthworker/src/github.com_ceph_ceph-c_c8f79f870e0d6a996c92d420e6256d312bac1c7c/qa/tasks/ceph.py", line 1895, in task
    yield
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_eeaeee38264fad8a01fd1ce912d8908676ed64bc/teuthology/run_tasks.py", line 176, in run_tasks
    suppress = manager.__exit__(*exc_info)
  File "/usr/lib/python3.6/contextlib.py", line 88, in __exit__
    next(self.gen)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_c8f79f870e0d6a996c92d420e6256d312bac1c7c/qa/tasks/thrashosds.py", line 215, in task
    cluster_manager.wait_for_all_osds_up()
  File "/home/teuthworker/src/github.com_ceph_ceph-c_c8f79f870e0d6a996c92d420e6256d312bac1c7c/qa/tasks/ceph_manager.py", line 2759, in wait_for_all_osds_up
    while not self.are_all_osds_up():
  File "/home/teuthworker/src/github.com_ceph_ceph-c_c8f79f870e0d6a996c92d420e6256d312bac1c7c/qa/tasks/ceph_manager.py", line 2749, in are_all_osds_up
    x = self.get_osd_dump()
  File "/home/teuthworker/src/github.com_ceph_ceph-c_c8f79f870e0d6a996c92d420e6256d312bac1c7c/qa/tasks/ceph_manager.py", line 2522, in get_osd_dump
    return self.get_osd_dump_json()['osds']
  File "/home/teuthworker/src/github.com_ceph_ceph-c_c8f79f870e0d6a996c92d420e6256d312bac1c7c/qa/tasks/ceph_manager.py", line 2514, in get_osd_dump_json
    out = self.raw_cluster_cmd('osd', 'dump', '--format=json')
  File "/home/teuthworker/src/github.com_ceph_ceph-c_c8f79f870e0d6a996c92d420e6256d312bac1c7c/qa/tasks/ceph_manager.py", line 1597, in raw_cluster_cmd
    return self.run_cluster_cmd(**kwargs).stdout.getvalue()
  File "/home/teuthworker/src/github.com_ceph_ceph-c_c8f79f870e0d6a996c92d420e6256d312bac1c7c/qa/tasks/ceph_manager.py", line 1588, in run_cluster_cmd
    return self.controller.run(**kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_eeaeee38264fad8a01fd1ce912d8908676ed64bc/teuthology/orchestra/remote.py", line 509, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_eeaeee38264fad8a01fd1ce912d8908676ed64bc/teuthology/orchestra/run.py", line 455, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_eeaeee38264fad8a01fd1ce912d8908676ed64bc/teuthology/orchestra/run.py", line 161, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_eeaeee38264fad8a01fd1ce912d8908676ed64bc/teuthology/orchestra/run.py", line 183, in _raise_for_status
    node=self.hostname, label=self.label
teuthology.exceptions.CommandFailedError: Command failed on smithi039 with status 124: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph osd dump --format=json'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/teuthworker/src/github.com_ceph_ceph-c_c8f79f870e0d6a996c92d420e6256d312bac1c7c/qa/tasks/ceph.py", line 1438, in run_daemon
    yield
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_eeaeee38264fad8a01fd1ce912d8908676ed64bc/teuthology/contextutil.py", line 33, in nested
    yield vars
  File "/home/teuthworker/src/github.com_ceph_ceph-c_c8f79f870e0d6a996c92d420e6256d312bac1c7c/qa/tasks/ceph.py", line 1899, in task
    ctx.managers[config['cluster']].stop_pg_num_changes()
  File "/home/teuthworker/src/github.com_ceph_ceph-c_c8f79f870e0d6a996c92d420e6256d312bac1c7c/qa/tasks/ceph_manager.py", line 2280, in stop_pg_num_changes
    osd_dump = self.get_osd_dump_json()
  File "/home/teuthworker/src/github.com_ceph_ceph-c_c8f79f870e0d6a996c92d420e6256d312bac1c7c/qa/tasks/ceph_manager.py", line 2514, in get_osd_dump_json
    out = self.raw_cluster_cmd('osd', 'dump', '--format=json')
  File "/home/teuthworker/src/github.com_ceph_ceph-c_c8f79f870e0d6a996c92d420e6256d312bac1c7c/qa/tasks/ceph_manager.py", line 1597, in raw_cluster_cmd
    return self.run_cluster_cmd(**kwargs).stdout.getvalue()
  File "/home/teuthworker/src/github.com_ceph_ceph-c_c8f79f870e0d6a996c92d420e6256d312bac1c7c/qa/tasks/ceph_manager.py", line 1588, in run_cluster_cmd
    return self.controller.run(**kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_eeaeee38264fad8a01fd1ce912d8908676ed64bc/teuthology/orchestra/remote.py", line 509, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_eeaeee38264fad8a01fd1ce912d8908676ed64bc/teuthology/orchestra/run.py", line 455, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_eeaeee38264fad8a01fd1ce912d8908676ed64bc/teuthology/orchestra/run.py", line 161, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_eeaeee38264fad8a01fd1ce912d8908676ed64bc/teuthology/orchestra/run.py", line 183, in _raise_for_status

Actions

Copy link

Updated by Laura Flores about 2 years ago

This looks a lot like a valgrind failure, but there were unfortunately no osd logs collected.

lflores@teuthology:/a/yuriw-2022-03-04_21:56:41-rados-wip-yuri4-testing-2022-03-03-1448-distro-default-smithi/6721637$ cat teuthology.log | grep "Exit " 
2022-03-05T04:07:51.117 INFO:tasks.ceph.osd.5.smithi125.stderr:==00:00:03:42.941 63268== Exit program on first error (--exit-on-first-error=yes)
2022-03-05T04:13:03.265 INFO:tasks.ceph.mon.b.smithi125.stderr:==00:00:20:37.374 37296== Exit program on first error (--exit-on-first-error=yes)
2022-03-05T04:13:09.568 INFO:tasks.ceph.osd.4.smithi125.stderr:==00:00:20:17.139 37933== Exit program on first error (--exit-on-first-error=yes)
2022-03-05T04:13:09.680 INFO:tasks.ceph.osd.2.smithi039.stderr:==00:00:20:17.265 37830== Exit program on first error (--exit-on-first-error=yes)
2022-03-05T04:13:09.778 INFO:tasks.ceph.osd.7.smithi125.stderr:==00:00:20:17.337 37934== Exit program on first error (--exit-on-first-error=yes)
2022-03-05T04:13:09.839 INFO:tasks.ceph.osd.0.smithi039.stderr:==00:00:20:17.432 37825== Exit program on first error (--exit-on-first-error=yes)
2022-03-05T04:13:09.941 INFO:tasks.ceph.osd.1.smithi039.stderr:==00:00:20:17.527 37832== Exit program on first error (--exit-on-first-error=yes)

Exited first on osd.5, smithi125. However, no logs were collected.

Sometimes this test fails because of a lagging/failed LibRados test or a failed ceph_assert. However, none of that is to be found in the teuthology log.

Actions

Copy link

Updated by Aishwarya Mathuria about 2 years ago

/a/yuriw-2022-03-29_21:35:01-rados-wip-yuri3-testing-2022-03-29-1133-distro-default-smithi/6767712

Actions

Copy link

Updated by Neha Ojha about 2 years ago

Aishwarya Mathuria wrote:

/a/yuriw-2022-03-29_21:35:01-rados-wip-yuri3-testing-2022-03-29-1133-distro-default-smithi/6767712

/a/yuriw-2022-03-29_21:35:01-rados-wip-yuri3-testing-2022-03-29-1133-distro-default-smithi/6767712/remote/smithi109/log/valgrind/osd.6.log.gz - This is a dup of https://tracker.ceph.com/issues/52124

Actions

Copy link

Updated by Neha Ojha about 2 years ago

Status changed from New to Need More Info

2022-03-05T10:00:50.833 DEBUG:tasks.ceph:file /var/log/ceph/valgrind/osd.0.log kind   <kind>Leak_PossiblyLost</kind>
2022-03-05T10:00:50.833 ERROR:tasks.ceph:saw valgrind issue   <kind>Leak_PossiblyLost</kind> in /var/log/ceph/valgrind/osd.0.log
2022-03-05T10:00:50.839 DEBUG:tasks.ceph:file /var/log/ceph/valgrind/osd.1.log kind   <kind>Leak_PossiblyLost</kind>
2022-03-05T10:00:50.839 ERROR:tasks.ceph:saw valgrind issue   <kind>Leak_PossiblyLost</kind> in /var/log/ceph/valgrind/osd.1.log
2022-03-05T10:00:50.846 DEBUG:tasks.ceph:file /var/log/ceph/valgrind/osd.2.log kind   <kind>Leak_PossiblyLost</kind>
2022-03-05T10:00:50.847 ERROR:tasks.ceph:saw valgrind issue   <kind>Leak_PossiblyLost</kind> in /var/log/ceph/valgrind/osd.2.log
2022-03-05T10:00:50.858 INFO:teuthology.orchestra.run.smithi125.stdout:/var/log/ceph/valgrind/mon.b.log:  <kind>Leak_IndirectlyLost</kind>
2022-03-05T10:00:50.858 INFO:teuthology.orchestra.run.smithi125.stdout:/var/log/ceph/valgrind/osd.4.log:  <kind>Leak_PossiblyLost</kind>
2022-03-05T10:00:50.859 INFO:teuthology.orchestra.run.smithi125.stdout:/var/log/ceph/valgrind/osd.5.log:  <kind>InvalidRead</kind>
2022-03-05T10:00:50.859 INFO:teuthology.orchestra.run.smithi125.stdout:/var/log/ceph/valgrind/osd.7.log:  <kind>Leak_PossiblyLost</kind>
2022-03-05T10:00:50.859 DEBUG:tasks.ceph:file /var/log/ceph/valgrind/mon.b.log kind   <kind>Leak_IndirectlyLost</kind>
2022-03-05T10:00:50.860 ERROR:tasks.ceph:saw valgrind issue   <kind>Leak_IndirectlyLost</kind> in /var/log/ceph/valgrind/mon.b.log
2022-03-05T10:00:50.865 DEBUG:tasks.ceph:file /var/log/ceph/valgrind/osd.4.log kind   <kind>Leak_PossiblyLost</kind>
2022-03-05T10:00:50.866 ERROR:tasks.ceph:saw valgrind issue   <kind>Leak_PossiblyLost</kind> in /var/log/ceph/valgrind/osd.4.log
2022-03-05T10:00:50.872 DEBUG:tasks.ceph:file /var/log/ceph/valgrind/osd.5.log kind   <kind>InvalidRead</kind>
2022-03-05T10:00:50.872 ERROR:tasks.ceph:saw valgrind issue   <kind>InvalidRead</kind> in /var/log/ceph/valgrind/osd.5.log
2022-03-05T10:00:50.879 DEBUG:tasks.ceph:file /var/log/ceph/valgrind/osd.7.log kind   <kind>Leak_PossiblyLost</kind>
2022-03-05T10:00:50.879 ERROR:tasks.ceph:saw valgrind issue   <kind>Leak_PossiblyLost</kind> in /var/log/ceph/valgrind/osd.7.log

We still need don't have the valgrind logs

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #54521

daemon: Error while waiting for process to exit

Updated by Laura Flores about 2 years ago

Updated by Aishwarya Mathuria about 2 years ago

Updated by Neha Ojha about 2 years ago

Updated by Neha Ojha about 2 years ago