Bug #54521
opendaemon: Error while waiting for process to exit
0%
Description
This causes dead job: hit max job timeout
/a/yuriw-2022-03-04_21:56:41-rados-wip-yuri4-testing-2022-03-03-1448-distro-default-smithi/6721637
2022-03-05T10:00:47.896 INFO:tasks.ceph.mgr.y:Stopped 2022-03-05T10:00:47.897 DEBUG:tasks.ceph.mgr.x:waiting for process to exit 2022-03-05T10:00:47.898 INFO:teuthology.orchestra.run:waiting for 300 2022-03-05T10:00:47.899 INFO:tasks.ceph.mgr.x:Stopped 2022-03-05T10:00:47.899 INFO:teuthology.misc:Shutting down mon daemons... 2022-03-05T10:00:47.900 DEBUG:tasks.ceph.mon.a:waiting for process to exit 2022-03-05T10:00:47.901 INFO:teuthology.orchestra.run:waiting for 300 2022-03-05T10:00:47.902 INFO:tasks.ceph.mon.a:Stopped 2022-03-05T10:00:47.902 DEBUG:tasks.ceph.mon.c:waiting for process to exit 2022-03-05T10:00:47.903 INFO:teuthology.orchestra.run:waiting for 300 2022-03-05T10:00:47.904 INFO:tasks.ceph.mon.c:Stopped 2022-03-05T10:00:47.905 DEBUG:tasks.ceph.mon.b:waiting for process to exit 2022-03-05T10:00:47.905 INFO:teuthology.orchestra.run:waiting for 300 2022-03-05T10:00:47.906 DEBUG:teuthology.orchestra.run:got remote process result: 42 2022-03-05T10:00:47.907 ERROR:teuthology.orchestra.daemon.state:Error while waiting for process to exit Traceback (most recent call last): File "/home/teuthworker/src/github.com_ceph_ceph-c_c8f79f870e0d6a996c92d420e6256d312bac1c7c/qa/tasks/ceph.py", line 1895, in task yield File "/home/teuthworker/src/git.ceph.com_git_teuthology_eeaeee38264fad8a01fd1ce912d8908676ed64bc/teuthology/run_tasks.py", line 176, in run_tasks suppress = manager.__exit__(*exc_info) File "/usr/lib/python3.6/contextlib.py", line 88, in __exit__ next(self.gen) File "/home/teuthworker/src/github.com_ceph_ceph-c_c8f79f870e0d6a996c92d420e6256d312bac1c7c/qa/tasks/thrashosds.py", line 215, in task cluster_manager.wait_for_all_osds_up() File "/home/teuthworker/src/github.com_ceph_ceph-c_c8f79f870e0d6a996c92d420e6256d312bac1c7c/qa/tasks/ceph_manager.py", line 2759, in wait_for_all_osds_up while not self.are_all_osds_up(): File "/home/teuthworker/src/github.com_ceph_ceph-c_c8f79f870e0d6a996c92d420e6256d312bac1c7c/qa/tasks/ceph_manager.py", line 2749, in are_all_osds_up x = self.get_osd_dump() File "/home/teuthworker/src/github.com_ceph_ceph-c_c8f79f870e0d6a996c92d420e6256d312bac1c7c/qa/tasks/ceph_manager.py", line 2522, in get_osd_dump return self.get_osd_dump_json()['osds'] File "/home/teuthworker/src/github.com_ceph_ceph-c_c8f79f870e0d6a996c92d420e6256d312bac1c7c/qa/tasks/ceph_manager.py", line 2514, in get_osd_dump_json out = self.raw_cluster_cmd('osd', 'dump', '--format=json') File "/home/teuthworker/src/github.com_ceph_ceph-c_c8f79f870e0d6a996c92d420e6256d312bac1c7c/qa/tasks/ceph_manager.py", line 1597, in raw_cluster_cmd return self.run_cluster_cmd(**kwargs).stdout.getvalue() File "/home/teuthworker/src/github.com_ceph_ceph-c_c8f79f870e0d6a996c92d420e6256d312bac1c7c/qa/tasks/ceph_manager.py", line 1588, in run_cluster_cmd return self.controller.run(**kwargs) File "/home/teuthworker/src/git.ceph.com_git_teuthology_eeaeee38264fad8a01fd1ce912d8908676ed64bc/teuthology/orchestra/remote.py", line 509, in run r = self._runner(client=self.ssh, name=self.shortname, **kwargs) File "/home/teuthworker/src/git.ceph.com_git_teuthology_eeaeee38264fad8a01fd1ce912d8908676ed64bc/teuthology/orchestra/run.py", line 455, in run r.wait() File "/home/teuthworker/src/git.ceph.com_git_teuthology_eeaeee38264fad8a01fd1ce912d8908676ed64bc/teuthology/orchestra/run.py", line 161, in wait self._raise_for_status() File "/home/teuthworker/src/git.ceph.com_git_teuthology_eeaeee38264fad8a01fd1ce912d8908676ed64bc/teuthology/orchestra/run.py", line 183, in _raise_for_status node=self.hostname, label=self.label teuthology.exceptions.CommandFailedError: Command failed on smithi039 with status 124: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph osd dump --format=json' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/teuthworker/src/github.com_ceph_ceph-c_c8f79f870e0d6a996c92d420e6256d312bac1c7c/qa/tasks/ceph.py", line 1438, in run_daemon yield File "/home/teuthworker/src/git.ceph.com_git_teuthology_eeaeee38264fad8a01fd1ce912d8908676ed64bc/teuthology/contextutil.py", line 33, in nested yield vars File "/home/teuthworker/src/github.com_ceph_ceph-c_c8f79f870e0d6a996c92d420e6256d312bac1c7c/qa/tasks/ceph.py", line 1899, in task ctx.managers[config['cluster']].stop_pg_num_changes() File "/home/teuthworker/src/github.com_ceph_ceph-c_c8f79f870e0d6a996c92d420e6256d312bac1c7c/qa/tasks/ceph_manager.py", line 2280, in stop_pg_num_changes osd_dump = self.get_osd_dump_json() File "/home/teuthworker/src/github.com_ceph_ceph-c_c8f79f870e0d6a996c92d420e6256d312bac1c7c/qa/tasks/ceph_manager.py", line 2514, in get_osd_dump_json out = self.raw_cluster_cmd('osd', 'dump', '--format=json') File "/home/teuthworker/src/github.com_ceph_ceph-c_c8f79f870e0d6a996c92d420e6256d312bac1c7c/qa/tasks/ceph_manager.py", line 1597, in raw_cluster_cmd return self.run_cluster_cmd(**kwargs).stdout.getvalue() File "/home/teuthworker/src/github.com_ceph_ceph-c_c8f79f870e0d6a996c92d420e6256d312bac1c7c/qa/tasks/ceph_manager.py", line 1588, in run_cluster_cmd return self.controller.run(**kwargs) File "/home/teuthworker/src/git.ceph.com_git_teuthology_eeaeee38264fad8a01fd1ce912d8908676ed64bc/teuthology/orchestra/remote.py", line 509, in run r = self._runner(client=self.ssh, name=self.shortname, **kwargs) File "/home/teuthworker/src/git.ceph.com_git_teuthology_eeaeee38264fad8a01fd1ce912d8908676ed64bc/teuthology/orchestra/run.py", line 455, in run r.wait() File "/home/teuthworker/src/git.ceph.com_git_teuthology_eeaeee38264fad8a01fd1ce912d8908676ed64bc/teuthology/orchestra/run.py", line 161, in wait self._raise_for_status() File "/home/teuthworker/src/git.ceph.com_git_teuthology_eeaeee38264fad8a01fd1ce912d8908676ed64bc/teuthology/orchestra/run.py", line 183, in _raise_for_status
Updated by Laura Flores about 2 years ago
This looks a lot like a valgrind failure, but there were unfortunately no osd logs collected.
lflores@teuthology:/a/yuriw-2022-03-04_21:56:41-rados-wip-yuri4-testing-2022-03-03-1448-distro-default-smithi/6721637$ cat teuthology.log | grep "Exit "
2022-03-05T04:07:51.117 INFO:tasks.ceph.osd.5.smithi125.stderr:==00:00:03:42.941 63268== Exit program on first error (--exit-on-first-error=yes)
2022-03-05T04:13:03.265 INFO:tasks.ceph.mon.b.smithi125.stderr:==00:00:20:37.374 37296== Exit program on first error (--exit-on-first-error=yes)
2022-03-05T04:13:09.568 INFO:tasks.ceph.osd.4.smithi125.stderr:==00:00:20:17.139 37933== Exit program on first error (--exit-on-first-error=yes)
2022-03-05T04:13:09.680 INFO:tasks.ceph.osd.2.smithi039.stderr:==00:00:20:17.265 37830== Exit program on first error (--exit-on-first-error=yes)
2022-03-05T04:13:09.778 INFO:tasks.ceph.osd.7.smithi125.stderr:==00:00:20:17.337 37934== Exit program on first error (--exit-on-first-error=yes)
2022-03-05T04:13:09.839 INFO:tasks.ceph.osd.0.smithi039.stderr:==00:00:20:17.432 37825== Exit program on first error (--exit-on-first-error=yes)
2022-03-05T04:13:09.941 INFO:tasks.ceph.osd.1.smithi039.stderr:==00:00:20:17.527 37832== Exit program on first error (--exit-on-first-error=yes)
Exited first on osd.5, smithi125. However, no logs were collected.
Sometimes this test fails because of a lagging/failed LibRados test or a failed ceph_assert. However, none of that is to be found in the teuthology log.
Updated by Aishwarya Mathuria about 2 years ago
/a/yuriw-2022-03-29_21:35:01-rados-wip-yuri3-testing-2022-03-29-1133-distro-default-smithi/6767712
Updated by Neha Ojha about 2 years ago
Aishwarya Mathuria wrote:
/a/yuriw-2022-03-29_21:35:01-rados-wip-yuri3-testing-2022-03-29-1133-distro-default-smithi/6767712
/a/yuriw-2022-03-29_21:35:01-rados-wip-yuri3-testing-2022-03-29-1133-distro-default-smithi/6767712/remote/smithi109/log/valgrind/osd.6.log.gz - This is a dup of https://tracker.ceph.com/issues/52124
Updated by Neha Ojha about 2 years ago
- Status changed from New to Need More Info
2022-03-05T10:00:50.833 DEBUG:tasks.ceph:file /var/log/ceph/valgrind/osd.0.log kind <kind>Leak_PossiblyLost</kind> 2022-03-05T10:00:50.833 ERROR:tasks.ceph:saw valgrind issue <kind>Leak_PossiblyLost</kind> in /var/log/ceph/valgrind/osd.0.log 2022-03-05T10:00:50.839 DEBUG:tasks.ceph:file /var/log/ceph/valgrind/osd.1.log kind <kind>Leak_PossiblyLost</kind> 2022-03-05T10:00:50.839 ERROR:tasks.ceph:saw valgrind issue <kind>Leak_PossiblyLost</kind> in /var/log/ceph/valgrind/osd.1.log 2022-03-05T10:00:50.846 DEBUG:tasks.ceph:file /var/log/ceph/valgrind/osd.2.log kind <kind>Leak_PossiblyLost</kind> 2022-03-05T10:00:50.847 ERROR:tasks.ceph:saw valgrind issue <kind>Leak_PossiblyLost</kind> in /var/log/ceph/valgrind/osd.2.log 2022-03-05T10:00:50.858 INFO:teuthology.orchestra.run.smithi125.stdout:/var/log/ceph/valgrind/mon.b.log: <kind>Leak_IndirectlyLost</kind> 2022-03-05T10:00:50.858 INFO:teuthology.orchestra.run.smithi125.stdout:/var/log/ceph/valgrind/osd.4.log: <kind>Leak_PossiblyLost</kind> 2022-03-05T10:00:50.859 INFO:teuthology.orchestra.run.smithi125.stdout:/var/log/ceph/valgrind/osd.5.log: <kind>InvalidRead</kind> 2022-03-05T10:00:50.859 INFO:teuthology.orchestra.run.smithi125.stdout:/var/log/ceph/valgrind/osd.7.log: <kind>Leak_PossiblyLost</kind> 2022-03-05T10:00:50.859 DEBUG:tasks.ceph:file /var/log/ceph/valgrind/mon.b.log kind <kind>Leak_IndirectlyLost</kind> 2022-03-05T10:00:50.860 ERROR:tasks.ceph:saw valgrind issue <kind>Leak_IndirectlyLost</kind> in /var/log/ceph/valgrind/mon.b.log 2022-03-05T10:00:50.865 DEBUG:tasks.ceph:file /var/log/ceph/valgrind/osd.4.log kind <kind>Leak_PossiblyLost</kind> 2022-03-05T10:00:50.866 ERROR:tasks.ceph:saw valgrind issue <kind>Leak_PossiblyLost</kind> in /var/log/ceph/valgrind/osd.4.log 2022-03-05T10:00:50.872 DEBUG:tasks.ceph:file /var/log/ceph/valgrind/osd.5.log kind <kind>InvalidRead</kind> 2022-03-05T10:00:50.872 ERROR:tasks.ceph:saw valgrind issue <kind>InvalidRead</kind> in /var/log/ceph/valgrind/osd.5.log 2022-03-05T10:00:50.879 DEBUG:tasks.ceph:file /var/log/ceph/valgrind/osd.7.log kind <kind>Leak_PossiblyLost</kind> 2022-03-05T10:00:50.879 ERROR:tasks.ceph:saw valgrind issue <kind>Leak_PossiblyLost</kind> in /var/log/ceph/valgrind/osd.7.log
We still need don't have the valgrind logs