Bug #53789
CommandFailedError (rados/test_python.sh): "RADOS object not found" causes test_rados.TestWatchNotify.test_aio_notify to fail
0%
Description
Description: rados/basic/{ceph clusters/{fixed-2 openstack} mon_election/connectivity msgr-failures/many msgr/async-v1only objectstore/bluestore-low-osd-mem-target rados supported-random-distro$/{rhel_8} tasks/rados_python}
Failure Reason:
Command failed (workunit test rados/test_python.sh) on smithi016 with status 1: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=11ec2f2c963dded7e0cccf5a8c9afdde6d1c0f46 TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 CEPH_MNT=/home/ubuntu/cephtest/mnt.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test_python.sh'
/a/lflores-2022-01-05_19:04:35-rados-wip-lflores-mgr-rocksdb-distro-default-smithi/6596519
2022-01-05T19:57:39.635 INFO:tasks.workunit.client.0.smithi016.stderr:test_rados.TestWatchNotify.test_aio_notify ... ERROR
2022-01-05T19:57:39.644 INFO:tasks.workunit.client.0.smithi016.stderr:test_rados.test_rados_init_error ... ok
2022-01-05T19:57:39.704 INFO:tasks.workunit.client.0.smithi016.stderr:test_rados.test_rados_init ... ok
2022-01-05T19:57:39.719 INFO:tasks.workunit.client.0.smithi016.stderr:test_rados.test_ioctx_context_manager ... ok
2022-01-05T19:57:39.729 INFO:tasks.workunit.client.0.smithi016.stderr:test_rados.test_parse_argv ... ok
2022-01-05T19:57:39.734 INFO:tasks.workunit.client.0.smithi016.stderr:test_rados.test_parse_argv_empty_str ... ok
2022-01-05T19:57:39.735 INFO:tasks.workunit.client.0.smithi016.stderr:
2022-01-05T19:57:39.735 INFO:tasks.workunit.client.0.smithi016.stderr:======================================================================
2022-01-05T19:57:39.735 INFO:tasks.workunit.client.0.smithi016.stderr:ERROR: test_rados.TestWatchNotify.test_aio_notify
2022-01-05T19:57:39.736 INFO:tasks.workunit.client.0.smithi016.stderr:----------------------------------------------------------------------
2022-01-05T19:57:39.736 INFO:tasks.workunit.client.0.smithi016.stderr:Traceback (most recent call last):
2022-01-05T19:57:39.736 INFO:tasks.workunit.client.0.smithi016.stderr: File "/usr/lib/python3.6/site-packages/nose/case.py", line 197, in runTest
2022-01-05T19:57:39.736 INFO:tasks.workunit.client.0.smithi016.stderr: self.test(*self.arg)
2022-01-05T19:57:39.737 INFO:tasks.workunit.client.0.smithi016.stderr: File "/home/ubuntu/cephtest/clone.client.0/src/test/pybind/test_rados.py", line 1533, in test_aio_notify
2022-01-05T19:57:39.737 INFO:tasks.workunit.client.0.smithi016.stderr: assert_raises(NotConnected, watch1.check)
2022-01-05T19:57:39.737 INFO:tasks.workunit.client.0.smithi016.stderr: File "/usr/lib64/python3.6/unittest/case.py", line 750, in assertRaises
2022-01-05T19:57:39.737 INFO:tasks.workunit.client.0.smithi016.stderr: return context.handle('assertRaises', args, kwargs)
2022-01-05T19:57:39.738 INFO:tasks.workunit.client.0.smithi016.stderr: File "/usr/lib64/python3.6/unittest/case.py", line 195, in handle
2022-01-05T19:57:39.738 INFO:tasks.workunit.client.0.smithi016.stderr: callable_obj(*args, **kwargs)
2022-01-05T19:57:39.738 INFO:tasks.workunit.client.0.smithi016.stderr: File "rados.pyx", line 2101, in rados.Watch.check
2022-01-05T19:57:39.738 INFO:tasks.workunit.client.0.smithi016.stderr:rados.ObjectNotFound: [errno 2] RADOS object not found (check error)
2022-01-05T19:57:39.739 INFO:tasks.workunit.client.0.smithi016.stderr:
2022-01-05T19:57:39.739 INFO:tasks.workunit.client.0.smithi016.stderr:----------------------------------------------------------------------
2022-01-05T19:57:39.739 INFO:tasks.workunit.client.0.smithi016.stderr:Ran 89 tests in 317.759s
2022-01-05T19:57:39.739 INFO:tasks.workunit.client.0.smithi016.stderr:
2022-01-05T19:57:39.740 INFO:tasks.workunit.client.0.smithi016.stderr:FAILED (errors=1)
2022-01-05T19:57:39.754 DEBUG:teuthology.orchestra.run:got remote process result: 1
2022-01-05T19:57:39.755 INFO:tasks.workunit:Stopping ['rados/test_python.sh'] on client.0...
2022-01-05T19:57:39.756 DEBUG:teuthology.orchestra.run.smithi016:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0
2022-01-05T19:57:40.032 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
File "/home/teuthworker/src/git.ceph.com_git_teuthology_95a7d4799b562f3bbb5ec66107094963abd62fa1/teuthology/run_tasks.py", line 91, in run_tasks
manager = run_one_task(taskname, ctx=ctx, config=config)
File "/home/teuthworker/src/git.ceph.com_git_teuthology_95a7d4799b562f3bbb5ec66107094963abd62fa1/teuthology/run_tasks.py", line 70, in run_one_task
return task(**kwargs)
File "/home/teuthworker/src/git.ceph.com_ceph-c_11ec2f2c963dded7e0cccf5a8c9afdde6d1c0f46/qa/tasks/workunit.py", line 135, in task
coverage_and_limits=not config.get('no_coverage_and_limits', None))
File "/home/teuthworker/src/git.ceph.com_git_teuthology_95a7d4799b562f3bbb5ec66107094963abd62fa1/teuthology/parallel.py", line 84, in __exit__
for result in self:
File "/home/teuthworker/src/git.ceph.com_git_teuthology_95a7d4799b562f3bbb5ec66107094963abd62fa1/teuthology/parallel.py", line 98, in __next__
resurrect_traceback(result)
File "/home/teuthworker/src/git.ceph.com_git_teuthology_95a7d4799b562f3bbb5ec66107094963abd62fa1/teuthology/parallel.py", line 30, in resurrect_traceback
raise exc.exc_info[1]
File "/home/teuthworker/src/git.ceph.com_git_teuthology_95a7d4799b562f3bbb5ec66107094963abd62fa1/teuthology/parallel.py", line 23, in capture_traceback
return func(*args, **kwargs)
File "/home/teuthworker/src/git.ceph.com_ceph-c_11ec2f2c963dded7e0cccf5a8c9afdde6d1c0f46/qa/tasks/workunit.py", line 427, in _run_tests
label="workunit test {workunit}".format(workunit=workunit)
File "/home/teuthworker/src/git.ceph.com_git_teuthology_95a7d4799b562f3bbb5ec66107094963abd62fa1/teuthology/orchestra/remote.py", line 509, in run
r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
File "/home/teuthworker/src/git.ceph.com_git_teuthology_95a7d4799b562f3bbb5ec66107094963abd62fa1/teuthology/orchestra/run.py", line 455, in run
r.wait()
File "/home/teuthworker/src/git.ceph.com_git_teuthology_95a7d4799b562f3bbb5ec66107094963abd62fa1/teuthology/orchestra/run.py", line 161, in wait
self._raise_for_status()
File "/home/teuthworker/src/git.ceph.com_git_teuthology_95a7d4799b562f3bbb5ec66107094963abd62fa1/teuthology/orchestra/run.py", line 183, in _raise_for_status
node=self.hostname, label=self.label
teuthology.exceptions.CommandFailedError: Command failed (workunit test rados/test_python.sh) on smithi016 with status 1: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=11ec2f2c963dded7e0cccf5a8c9afdde6d1c0f46 TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 CEPH_MNT=/home/ubuntu/cephtest/mnt.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test_python.sh'
Related issues
History
#1 Updated by Laura Flores almost 2 years ago
- Related to Bug #45721: CommandFailedError: Command failed (workunit test rados/test_python.sh) FAIL: test_rados.TestWatchNotify.test added
#2 Updated by Laura Flores almost 2 years ago
/a/yuriw-2022-01-06_15:50:38-rados-wip-yuri8-testing-2022-01-05-1411-distro-default-smithi/6598917
#3 Updated by Laura Flores over 1 year ago
/a/nojha-2022-04-13_16:47:41-rados-wip-yuri2-testing-2022-04-13-0703-distro-basic-smithi/6790486
#4 Updated by Nitzan Mordechai over 1 year ago
- Status changed from New to In Progress
- Assignee set to Nitzan Mordechai
#5 Updated by Nitzan Mordechai over 1 year ago
I hit another issue when we have socket failure injection active when running the tests. I think this is not only the test issue, we also have a bug in the python API.
the watch callback is called when reconnect occurs. unlike the c\c++ librados, which means that all the asserts that count the notification may hit errors.
in both cases of error (1st in test_aio_notify and 2nd in estWatchNotify.test) we got -2 error for (check error)) that could be related to the reconnect, but I couldn't recreate it yet (always get -107 for disconnect)
The close of 1 watch shouldn't affect the other watch (I think) but it looks like we are closing both of them when we are closing the first one.
def close(self):
"""
Unregister an interest in an object.
:raises: :class:`Error`
"""
if self.id == 0:
return
self.ioctx.require_ioctx_open()
cdef:
uint64_t _cookie = self.id
with nogil:
ret = rados_unwatch2(self.ioctx.io, _cookie)
if ret < 0 and ret != -errno.ENOENT:
raise make_ex(ret, "unwatch error")
self.id = 0
with nogil:
cluster = rados_ioctx_get_cluster(self.ioctx.io)
ret = rados_watch_flush(cluster);
if ret < 0:
raise make_ex(ret, "watch_flush error")
@self.ioctx.close()@
#6 Updated by Radoslaw Zarzynski over 1 year ago
I think the lack of -2
(ENOENT
) might be caused the errno normalization Objecter
has.
#7 Updated by Laura Flores over 1 year ago
/a/yuriw-2022-06-18_00:01:31-rados-quincy-release-distro-default-smithi/6884838
#8 Updated by Laura Flores over 1 year ago
- Backport set to quincy,pacific
/a/yuriw-2022-06-21_16:28:27-rados-wip-yuri4-testing-2022-06-21-0704-pacific-distro-default-smithi/6889549
#9 Updated by Matan Breizman about 1 year ago
/a/yuriw-2022-11-23_21:36:17-rados-wip-yuri11-testing-2022-11-18-1506-distro-default-smithi/7089814/
#10 Updated by Radoslaw Zarzynski about 1 year ago
- Assignee changed from Nitzan Mordechai to Radoslaw Zarzynski
#11 Updated by Laura Flores 12 months ago
/a/yuriw-2022-12-06_15:43:07-rados-wip-yuri8-testing-2022-12-05-1031-pacific-distro-default-smithi/7105473$
#12 Updated by Radoslaw Zarzynski 12 months ago
Hypothesis no 1: the issue is a fallout from 65d05fdd579d21dd57b72b1d9148380bc6074269 (PR https://github.com/ceph/ceph/pull/36694 merged in August 2020) where we started overriding last_error
potentially multiple times BUT we normalize it only once:
diff --git a/src/osdc/Objecter.cc b/src/osdc/Objecter.cc
index 816d358ad12..8a87ec5892d 100644
--- a/src/osdc/Objecter.cc
+++ b/src/osdc/Objecter.cc
@@ -653,17 +653,17 @@ void Objecter::_linger_reconnect(LingerOp *info, bs::error_code ec)
{
ldout(cct, 10) << __func__ << " " << info->linger_id << " = " << ec
<< " (last_error " << info->last_error << ")" << dendl;
+ std::unique_lock wl(info->watch_lock);
if (ec) {
- std::unique_lock wl(info->watch_lock);
if (!info->last_error) {
ec = _normalize_watch_error(ec);
- info->last_error = ec;
if (info->handle) {
boost::asio::defer(finish_strand, CB_DoWatchError(this, info, ec));
}
}
- wl.unlock();
}
+
+ info->last_error = ec;
}
#13 Updated by Radoslaw Zarzynski 12 months ago
- Related to Bug #47004: watch/notify may not function well after lingerOp failed added
#14 Updated by Radoslaw Zarzynski 12 months ago
- Status changed from In Progress to Fix Under Review
- Pull request ID set to 49462
#15 Updated by Aishwarya Mathuria 4 months ago
Seen in a Pacific run : /a/yuriw-2023-08-16_22:40:18-rados-wip-yuri2-testing-2023-08-16-1142-pacific-distro-default-smithi/7370810
#16 Updated by Radoslaw Zarzynski 3 months ago
- Priority changed from Normal to High
Let's keep an eye on the PR.
#17 Updated by Laura Flores 3 months ago
/a/yuriw-2023-09-01_19:14:47-rados-wip-batrick-testing-20230831.124848-pacific-distro-default-smithi/7386290
#18 Updated by Radoslaw Zarzynski 3 months ago
The fix approved but waits for QA. Bumping this up.