Project

General

Profile

Actions

Bug #53789

open

CommandFailedError (rados/test_python.sh): "RADOS object not found" causes test_rados.TestWatchNotify.test_aio_notify to fail

Added by Laura Flores over 2 years ago. Updated 3 months ago.

Status:
Pending Backport
Priority:
High
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
backport_processed
Backport:
quincy,pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Description: rados/basic/{ceph clusters/{fixed-2 openstack} mon_election/connectivity msgr-failures/many msgr/async-v1only objectstore/bluestore-low-osd-mem-target rados supported-random-distro$/{rhel_8} tasks/rados_python}

Failure Reason:

Command failed (workunit test rados/test_python.sh) on smithi016 with status 1: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=11ec2f2c963dded7e0cccf5a8c9afdde6d1c0f46 TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 CEPH_MNT=/home/ubuntu/cephtest/mnt.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test_python.sh'

/a/lflores-2022-01-05_19:04:35-rados-wip-lflores-mgr-rocksdb-distro-default-smithi/6596519

2022-01-05T19:57:39.635 INFO:tasks.workunit.client.0.smithi016.stderr:test_rados.TestWatchNotify.test_aio_notify ... ERROR
2022-01-05T19:57:39.644 INFO:tasks.workunit.client.0.smithi016.stderr:test_rados.test_rados_init_error ... ok
2022-01-05T19:57:39.704 INFO:tasks.workunit.client.0.smithi016.stderr:test_rados.test_rados_init ... ok
2022-01-05T19:57:39.719 INFO:tasks.workunit.client.0.smithi016.stderr:test_rados.test_ioctx_context_manager ... ok
2022-01-05T19:57:39.729 INFO:tasks.workunit.client.0.smithi016.stderr:test_rados.test_parse_argv ... ok
2022-01-05T19:57:39.734 INFO:tasks.workunit.client.0.smithi016.stderr:test_rados.test_parse_argv_empty_str ... ok
2022-01-05T19:57:39.735 INFO:tasks.workunit.client.0.smithi016.stderr:
2022-01-05T19:57:39.735 INFO:tasks.workunit.client.0.smithi016.stderr:======================================================================
2022-01-05T19:57:39.735 INFO:tasks.workunit.client.0.smithi016.stderr:ERROR: test_rados.TestWatchNotify.test_aio_notify
2022-01-05T19:57:39.736 INFO:tasks.workunit.client.0.smithi016.stderr:----------------------------------------------------------------------
2022-01-05T19:57:39.736 INFO:tasks.workunit.client.0.smithi016.stderr:Traceback (most recent call last):
2022-01-05T19:57:39.736 INFO:tasks.workunit.client.0.smithi016.stderr:  File "/usr/lib/python3.6/site-packages/nose/case.py", line 197, in runTest
2022-01-05T19:57:39.736 INFO:tasks.workunit.client.0.smithi016.stderr:    self.test(*self.arg)
2022-01-05T19:57:39.737 INFO:tasks.workunit.client.0.smithi016.stderr:  File "/home/ubuntu/cephtest/clone.client.0/src/test/pybind/test_rados.py", line 1533, in test_aio_notify
2022-01-05T19:57:39.737 INFO:tasks.workunit.client.0.smithi016.stderr:    assert_raises(NotConnected, watch1.check)
2022-01-05T19:57:39.737 INFO:tasks.workunit.client.0.smithi016.stderr:  File "/usr/lib64/python3.6/unittest/case.py", line 750, in assertRaises
2022-01-05T19:57:39.737 INFO:tasks.workunit.client.0.smithi016.stderr:    return context.handle('assertRaises', args, kwargs)
2022-01-05T19:57:39.738 INFO:tasks.workunit.client.0.smithi016.stderr:  File "/usr/lib64/python3.6/unittest/case.py", line 195, in handle
2022-01-05T19:57:39.738 INFO:tasks.workunit.client.0.smithi016.stderr:    callable_obj(*args, **kwargs)
2022-01-05T19:57:39.738 INFO:tasks.workunit.client.0.smithi016.stderr:  File "rados.pyx", line 2101, in rados.Watch.check
2022-01-05T19:57:39.738 INFO:tasks.workunit.client.0.smithi016.stderr:rados.ObjectNotFound: [errno 2] RADOS object not found (check error)
2022-01-05T19:57:39.739 INFO:tasks.workunit.client.0.smithi016.stderr:
2022-01-05T19:57:39.739 INFO:tasks.workunit.client.0.smithi016.stderr:----------------------------------------------------------------------
2022-01-05T19:57:39.739 INFO:tasks.workunit.client.0.smithi016.stderr:Ran 89 tests in 317.759s
2022-01-05T19:57:39.739 INFO:tasks.workunit.client.0.smithi016.stderr:
2022-01-05T19:57:39.740 INFO:tasks.workunit.client.0.smithi016.stderr:FAILED (errors=1)
2022-01-05T19:57:39.754 DEBUG:teuthology.orchestra.run:got remote process result: 1
2022-01-05T19:57:39.755 INFO:tasks.workunit:Stopping ['rados/test_python.sh'] on client.0...
2022-01-05T19:57:39.756 DEBUG:teuthology.orchestra.run.smithi016:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0
2022-01-05T19:57:40.032 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_95a7d4799b562f3bbb5ec66107094963abd62fa1/teuthology/run_tasks.py", line 91, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_95a7d4799b562f3bbb5ec66107094963abd62fa1/teuthology/run_tasks.py", line 70, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_11ec2f2c963dded7e0cccf5a8c9afdde6d1c0f46/qa/tasks/workunit.py", line 135, in task
    coverage_and_limits=not config.get('no_coverage_and_limits', None))
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_95a7d4799b562f3bbb5ec66107094963abd62fa1/teuthology/parallel.py", line 84, in __exit__
    for result in self:
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_95a7d4799b562f3bbb5ec66107094963abd62fa1/teuthology/parallel.py", line 98, in __next__
    resurrect_traceback(result)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_95a7d4799b562f3bbb5ec66107094963abd62fa1/teuthology/parallel.py", line 30, in resurrect_traceback
    raise exc.exc_info[1]
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_95a7d4799b562f3bbb5ec66107094963abd62fa1/teuthology/parallel.py", line 23, in capture_traceback
    return func(*args, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_11ec2f2c963dded7e0cccf5a8c9afdde6d1c0f46/qa/tasks/workunit.py", line 427, in _run_tests
    label="workunit test {workunit}".format(workunit=workunit)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_95a7d4799b562f3bbb5ec66107094963abd62fa1/teuthology/orchestra/remote.py", line 509, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_95a7d4799b562f3bbb5ec66107094963abd62fa1/teuthology/orchestra/run.py", line 455, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_95a7d4799b562f3bbb5ec66107094963abd62fa1/teuthology/orchestra/run.py", line 161, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_95a7d4799b562f3bbb5ec66107094963abd62fa1/teuthology/orchestra/run.py", line 183, in _raise_for_status
    node=self.hostname, label=self.label
teuthology.exceptions.CommandFailedError: Command failed (workunit test rados/test_python.sh) on smithi016 with status 1: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=11ec2f2c963dded7e0cccf5a8c9afdde6d1c0f46 TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 CEPH_MNT=/home/ubuntu/cephtest/mnt.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test_python.sh'


Related issues 4 (1 open3 closed)

Related to RADOS - Bug #45721: CommandFailedError: Command failed (workunit test rados/test_python.sh) FAIL: test_rados.TestWatchNotify.testResolvedNitzan Mordechai

Actions
Related to RADOS - Bug #47004: watch/notify may not function well after lingerOp failedResolvedshun song

Actions
Copied to RADOS - Backport #64157: pacific: CommandFailedError (rados/test_python.sh): "RADOS object not found" causes test_rados.TestWatchNotify.test_aio_notify to failRejectedRadoslaw ZarzynskiActions
Copied to RADOS - Backport #64158: quincy: CommandFailedError (rados/test_python.sh): "RADOS object not found" causes test_rados.TestWatchNotify.test_aio_notify to failNewRadoslaw ZarzynskiActions
Actions #1

Updated by Laura Flores over 2 years ago

  • Related to Bug #45721: CommandFailedError: Command failed (workunit test rados/test_python.sh) FAIL: test_rados.TestWatchNotify.test added
Actions #2

Updated by Laura Flores over 2 years ago

/a/yuriw-2022-01-06_15:50:38-rados-wip-yuri8-testing-2022-01-05-1411-distro-default-smithi/6598917

Actions #3

Updated by Laura Flores about 2 years ago

/a/nojha-2022-04-13_16:47:41-rados-wip-yuri2-testing-2022-04-13-0703-distro-basic-smithi/6790486

Actions #4

Updated by Nitzan Mordechai almost 2 years ago

  • Status changed from New to In Progress
  • Assignee set to Nitzan Mordechai
Actions #5

Updated by Nitzan Mordechai almost 2 years ago

I hit another issue when we have socket failure injection active when running the tests. I think this is not only the test issue, we also have a bug in the python API.
the watch callback is called when reconnect occurs. unlike the c\c++ librados, which means that all the asserts that count the notification may hit errors.

in both cases of error (1st in test_aio_notify and 2nd in estWatchNotify.test) we got -2 error for (check error)) that could be related to the reconnect, but I couldn't recreate it yet (always get -107 for disconnect)

The close of 1 watch shouldn't affect the other watch (I think) but it looks like we are closing both of them when we are closing the first one.

def close(self):
        """ 
        Unregister an interest in an object.

        :raises: :class:`Error`
        """ 
        if self.id == 0:
            return

        self.ioctx.require_ioctx_open()

        cdef:
            uint64_t _cookie = self.id

        with nogil:
            ret = rados_unwatch2(self.ioctx.io, _cookie)
        if ret < 0 and ret != -errno.ENOENT:
            raise make_ex(ret, "unwatch error")
        self.id = 0

        with nogil:
            cluster = rados_ioctx_get_cluster(self.ioctx.io)
            ret = rados_watch_flush(cluster);
        if ret < 0:
            raise make_ex(ret, "watch_flush error")

        @self.ioctx.close()@
Actions #6

Updated by Radoslaw Zarzynski almost 2 years ago

I think the lack of -2 (ENOENT) might be caused the errno normalization Objecter has.

Actions #7

Updated by Laura Flores almost 2 years ago

/a/yuriw-2022-06-18_00:01:31-rados-quincy-release-distro-default-smithi/6884838

Actions #8

Updated by Laura Flores almost 2 years ago

  • Backport set to quincy,pacific

/a/yuriw-2022-06-21_16:28:27-rados-wip-yuri4-testing-2022-06-21-0704-pacific-distro-default-smithi/6889549

Actions #9

Updated by Matan Breizman over 1 year ago

/a/yuriw-2022-11-23_21:36:17-rados-wip-yuri11-testing-2022-11-18-1506-distro-default-smithi/7089814/

Actions #10

Updated by Radoslaw Zarzynski over 1 year ago

  • Assignee changed from Nitzan Mordechai to Radoslaw Zarzynski
Actions #11

Updated by Laura Flores over 1 year ago

/a/yuriw-2022-12-06_15:43:07-rados-wip-yuri8-testing-2022-12-05-1031-pacific-distro-default-smithi/7105473$

Actions #12

Updated by Radoslaw Zarzynski over 1 year ago

Hypothesis no 1: the issue is a fallout from 65d05fdd579d21dd57b72b1d9148380bc6074269 (PR https://github.com/ceph/ceph/pull/36694 merged in August 2020) where we started overriding last_error potentially multiple times BUT we normalize it only once:

diff --git a/src/osdc/Objecter.cc b/src/osdc/Objecter.cc
index 816d358ad12..8a87ec5892d 100644
--- a/src/osdc/Objecter.cc
+++ b/src/osdc/Objecter.cc
@@ -653,17 +653,17 @@ void Objecter::_linger_reconnect(LingerOp *info, bs::error_code ec)
 {
   ldout(cct, 10) << __func__ << " " << info->linger_id << " = " << ec 
                 << " (last_error " << info->last_error << ")" << dendl;
+  std::unique_lock wl(info->watch_lock);
   if (ec) {
-    std::unique_lock wl(info->watch_lock);
     if (!info->last_error) {
       ec = _normalize_watch_error(ec);
-      info->last_error = ec;
       if (info->handle) {
        boost::asio::defer(finish_strand, CB_DoWatchError(this, info, ec));
       }
     }
-    wl.unlock();
   }
+
+  info->last_error = ec;
 }
Actions #13

Updated by Radoslaw Zarzynski over 1 year ago

  • Related to Bug #47004: watch/notify may not function well after lingerOp failed added
Actions #14

Updated by Radoslaw Zarzynski over 1 year ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 49462
Actions #15

Updated by Aishwarya Mathuria 8 months ago

Seen in a Pacific run : /a/yuriw-2023-08-16_22:40:18-rados-wip-yuri2-testing-2023-08-16-1142-pacific-distro-default-smithi/7370810

Actions #16

Updated by Radoslaw Zarzynski 8 months ago

  • Priority changed from Normal to High

Let's keep an eye on the PR.

Actions #17

Updated by Laura Flores 8 months ago

/a/yuriw-2023-09-01_19:14:47-rados-wip-batrick-testing-20230831.124848-pacific-distro-default-smithi/7386290

Actions #18

Updated by Radoslaw Zarzynski 7 months ago

The fix approved but waits for QA. Bumping this up.

Actions #19

Updated by Laura Flores 3 months ago

  • Translation missing: en.field_tag_list set to test-failure
  • Status changed from Fix Under Review to Pending Backport
Actions #20

Updated by Backport Bot 3 months ago

  • Copied to Backport #64157: pacific: CommandFailedError (rados/test_python.sh): "RADOS object not found" causes test_rados.TestWatchNotify.test_aio_notify to fail added
Actions #21

Updated by Backport Bot 3 months ago

  • Copied to Backport #64158: quincy: CommandFailedError (rados/test_python.sh): "RADOS object not found" causes test_rados.TestWatchNotify.test_aio_notify to fail added
Actions #22

Updated by Backport Bot 3 months ago

  • Tags set to backport_processed
Actions

Also available in: Atom PDF