Project

General

Profile

Actions

Bug #64438

open

NeoRadosWatchNotify.WatchNotifyTimeout times out along with FAILED ceph_assert(op->session)

Added by Laura Flores 3 months ago. Updated about 2 months ago.

Status:
New
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/a/lflores-2024-02-13_05:06:48-rados-wip-yuri5-testing-2024-02-12-1152-distro-default-smithi/7557540

2024-02-13T06:59:28.976 INFO:tasks.workunit.client.0.smithi121.stdout:             watch_notify: [ RUN      ] NeoRadosWatchNotify.WatchNotifyTimeout
2024-02-13T06:59:28.976 INFO:tasks.workunit.client.0.smithi121.stdout:             watch_notify: handle_error cookie 94179584839488 err (107) Transport endpoint is not connected
2024-02-13T06:59:28.976 INFO:tasks.workunit.client.0.smithi121.stdout:             watch_notify: ./src/test/neorados/watch_notify.cc:147: Failure
2024-02-13T06:59:28.976 INFO:tasks.workunit.client.0.smithi121.stdout:             watch_notify: Expected equality of these values:
2024-02-13T06:59:28.976 INFO:tasks.workunit.client.0.smithi121.stdout:             watch_notify:   1u
2024-02-13T06:59:28.976 INFO:tasks.workunit.client.0.smithi121.stdout:             watch_notify:     Which is: 1
2024-02-13T06:59:28.976 INFO:tasks.workunit.client.0.smithi121.stdout:             watch_notify:   watchers.size()
2024-02-13T06:59:28.976 INFO:tasks.workunit.client.0.smithi121.stdout:             watch_notify:     Which is: 0
2024-02-13T06:59:28.977 INFO:tasks.workunit.client.0.smithi121.stdout:             watch_notify: Trying...
2024-02-13T06:59:28.977 INFO:tasks.workunit.client.0.smithi121.stdout:             watch_notify: ./src/test/neorados/common_tests.h:132: Failure
2024-02-13T06:59:28.977 INFO:tasks.workunit.client.0.smithi121.stdout:             watch_notify: Value of: failed
2024-02-13T06:59:28.977 INFO:tasks.workunit.client.0.smithi121.stdout:             watch_notify:   Actual: false
2024-02-13T06:59:28.977 INFO:tasks.workunit.client.0.smithi121.stdout:             watch_notify: Expected: true
2024-02-13T06:59:28.977 INFO:tasks.workunit.client.0.smithi121.stdout:             watch_notify: Operation did not error at all.
2024-02-13T06:59:28.977 INFO:tasks.workunit.client.0.smithi121.stdout:             watch_notify: Timed out.
2024-02-13T06:59:28.977 INFO:tasks.workunit.client.0.smithi121.stdout:             watch_notify: ./src/test/neorados/watch_notify.cc:157: Failure
2024-02-13T06:59:28.977 INFO:tasks.workunit.client.0.smithi121.stdout:             watch_notify: Value of: rados().check_watch(handle)
2024-02-13T06:59:28.977 INFO:tasks.workunit.client.0.smithi121.stdout:             watch_notify:   Actual: false
2024-02-13T06:59:28.977 INFO:tasks.workunit.client.0.smithi121.stdout:             watch_notify: Expected: true
2024-02-13T06:59:28.977 INFO:tasks.workunit.client.0.smithi121.stdout:             watch_notify: Flushing...
2024-02-13T06:59:28.977 INFO:tasks.workunit.client.0.smithi121.stdout:             watch_notify: Flushed...
2024-02-13T06:59:28.978 INFO:tasks.workunit.client.0.smithi121.stdout:             watch_notify: ./src/osdc/Objecter.cc: In function 'void Objecter::tick()' thread 7efcf37fe640 time 2024-02-13T06:59:28.948712+0000
2024-02-13T06:59:28.978 INFO:tasks.workunit.client.0.smithi121.stdout:             watch_notify: ./src/osdc/Objecter.cc: 2180: FAILED ceph_assert(op->session)
2024-02-13T06:59:28.978 INFO:tasks.workunit.client.0.smithi121.stdout:             watch_notify:  ceph version 19.0.0-1442-g3d2b0d7b (3d2b0d7b9d303971fd78b4efe33cf2a06e3a7a4d) squid (dev)
2024-02-13T06:59:28.978 INFO:tasks.workunit.client.0.smithi121.stdout:             watch_notify:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x121) [0x7efd0f2e8b4d]
2024-02-13T06:59:28.978 INFO:tasks.workunit.client.0.smithi121.stdout:             watch_notify:  2: /usr/lib/ceph/libceph-common.so.2(+0x180d01) [0x7efd0f2e8d01]
2024-02-13T06:59:28.978 INFO:tasks.workunit.client.0.smithi121.stdout:             watch_notify:  3: (Objecter::tick()+0xb95) [0x55a7e32cdda5]
2024-02-13T06:59:28.978 INFO:tasks.workunit.client.0.smithi121.stdout:             watch_notify:  4: (ceph::timer<ceph::coarse_mono_clock>::timer_thread()+0x1cf) [0x55a7e32f58ff]
2024-02-13T06:59:28.978 INFO:tasks.workunit.client.0.smithi121.stdout:             watch_notify:  5: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc2b3) [0x7efd0f00e2b3]
2024-02-13T06:59:28.978 INFO:tasks.workunit.client.0.smithi121.stdout:             watch_notify:  6: /lib/x86_64-linux-gnu/libc.so.6(+0x94b43) [0x7efd0ec97b43]
2024-02-13T06:59:28.978 INFO:tasks.workunit.client.0.smithi121.stdout:             watch_notify:  7: /lib/x86_64-linux-gnu/libc.so.6(+0x126a00) [0x7efd0ed29a00]
2024-02-13T06:59:29.013 INFO:tasks.ceph.osd.5.smithi169.stderr:2024-02-13T06:59:29.015+0000 7fe01a578640 -1 received  signal: Hangup from /usr/bin/python3 /usr/bin/daemon-helper kill ceph-osd -f --cluster ceph -i 5  (PID: 25246) UID: 0
2024-02-13T06:59:29.022 INFO:tasks.workunit.client.0.smithi121.stderr:bash: line 1: 38101 Aborted                 (core dumped) ceph_test_neorados_watch_notify 2>&1
2024-02-13T06:59:29.022 INFO:tasks.workunit.client.0.smithi121.stderr:     38102 Done                    | tee ceph_test_neorados_watch_notify.log
2024-02-13T06:59:29.022 INFO:tasks.workunit.client.0.smithi121.stderr:     38103 Done                    | sed "s/^/             watch_notify: /" 

The read balancer was enabled during this job (d-balancer/read), so it's worth seeing if that's somehow related.

Actions #1

Updated by Laura Flores 2 months ago

Continuing to look at this bug.

Actions #2

Updated by Laura Flores about 2 months ago

Fails here in the neorados test:

107 auto create_obj(neorados::RADOS& r, std::string_view oid,
108                 const neorados::IOContext& ioc,
109                 CompletionToken&& token)
110 { 
111   neorados::WriteOp op;
112   op.create(true);
113   return r.execute(oid, ioc, std::move(op),
114                    std::forward<CompletionToken>(token));
115 }
116 
117 /// \brief Expect one of several errors from a coroutine
118 ///
119 /// \param coro Awaitable coroutine
120 /// \param ec Valid errors
121 boost::asio::awaitable<void>
122 expect_error_code(auto&& coro, auto ...ecs) {
123   bool failed = false;
124   try {
125     co_await std::move(coro);
126   } catch (const boost::system::system_error& e) {
127     failed = true;
128     auto h = [c = e.code()](auto t) -> bool { return t == c; };
129     EXPECT_TRUE((h(ecs) || ...))
130       << "Got unexpected error code " << e.code().message() << ".";
131   }
132   EXPECT_TRUE(failed) << "Operation did not error at all.";
133   co_return;
134 }

And asserted here in the Objector code:

2167   // look for laggy requests
2168   auto cutoff = ceph::coarse_mono_clock::now();
2169   cutoff -= ceph::make_timespan(cct->_conf->objecter_timeout);  // timeout
2170 
2171   unsigned laggy_ops = 0;
2172 
2173   for (auto siter = osd_sessions.begin();
2174        siter != osd_sessions.end(); ++siter) {
2175     auto s = siter->second;
2176     scoped_lock l(s->lock);
2177     bool found = false;
2178     for (auto p = s->ops.begin(); p != s->ops.end(); ++p) {
2179       auto op = p->second;
2180       ceph_assert(op->session);
2181       if (op->stamp < cutoff) {
2182         ldout(cct, 2) << " tid " << p->first << " on osd." << op->session->osd
2183                       << " is laggy" << dendl;
2184         found = true;
2185         ++laggy_ops;
2186       }
2187     }

Pretty sure the assert is an effect of the op timing out.

Actions

Also available in: Atom PDF