Actions
Bug #64438
openNeoRadosWatchNotify.WatchNotifyTimeout times out along with FAILED ceph_assert(op->session)
% Done:
0%
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
/a/lflores-2024-02-13_05:06:48-rados-wip-yuri5-testing-2024-02-12-1152-distro-default-smithi/7557540
2024-02-13T06:59:28.976 INFO:tasks.workunit.client.0.smithi121.stdout: watch_notify: [ RUN ] NeoRadosWatchNotify.WatchNotifyTimeout
2024-02-13T06:59:28.976 INFO:tasks.workunit.client.0.smithi121.stdout: watch_notify: handle_error cookie 94179584839488 err (107) Transport endpoint is not connected
2024-02-13T06:59:28.976 INFO:tasks.workunit.client.0.smithi121.stdout: watch_notify: ./src/test/neorados/watch_notify.cc:147: Failure
2024-02-13T06:59:28.976 INFO:tasks.workunit.client.0.smithi121.stdout: watch_notify: Expected equality of these values:
2024-02-13T06:59:28.976 INFO:tasks.workunit.client.0.smithi121.stdout: watch_notify: 1u
2024-02-13T06:59:28.976 INFO:tasks.workunit.client.0.smithi121.stdout: watch_notify: Which is: 1
2024-02-13T06:59:28.976 INFO:tasks.workunit.client.0.smithi121.stdout: watch_notify: watchers.size()
2024-02-13T06:59:28.976 INFO:tasks.workunit.client.0.smithi121.stdout: watch_notify: Which is: 0
2024-02-13T06:59:28.977 INFO:tasks.workunit.client.0.smithi121.stdout: watch_notify: Trying...
2024-02-13T06:59:28.977 INFO:tasks.workunit.client.0.smithi121.stdout: watch_notify: ./src/test/neorados/common_tests.h:132: Failure
2024-02-13T06:59:28.977 INFO:tasks.workunit.client.0.smithi121.stdout: watch_notify: Value of: failed
2024-02-13T06:59:28.977 INFO:tasks.workunit.client.0.smithi121.stdout: watch_notify: Actual: false
2024-02-13T06:59:28.977 INFO:tasks.workunit.client.0.smithi121.stdout: watch_notify: Expected: true
2024-02-13T06:59:28.977 INFO:tasks.workunit.client.0.smithi121.stdout: watch_notify: Operation did not error at all.
2024-02-13T06:59:28.977 INFO:tasks.workunit.client.0.smithi121.stdout: watch_notify: Timed out.
2024-02-13T06:59:28.977 INFO:tasks.workunit.client.0.smithi121.stdout: watch_notify: ./src/test/neorados/watch_notify.cc:157: Failure
2024-02-13T06:59:28.977 INFO:tasks.workunit.client.0.smithi121.stdout: watch_notify: Value of: rados().check_watch(handle)
2024-02-13T06:59:28.977 INFO:tasks.workunit.client.0.smithi121.stdout: watch_notify: Actual: false
2024-02-13T06:59:28.977 INFO:tasks.workunit.client.0.smithi121.stdout: watch_notify: Expected: true
2024-02-13T06:59:28.977 INFO:tasks.workunit.client.0.smithi121.stdout: watch_notify: Flushing...
2024-02-13T06:59:28.977 INFO:tasks.workunit.client.0.smithi121.stdout: watch_notify: Flushed...
2024-02-13T06:59:28.978 INFO:tasks.workunit.client.0.smithi121.stdout: watch_notify: ./src/osdc/Objecter.cc: In function 'void Objecter::tick()' thread 7efcf37fe640 time 2024-02-13T06:59:28.948712+0000
2024-02-13T06:59:28.978 INFO:tasks.workunit.client.0.smithi121.stdout: watch_notify: ./src/osdc/Objecter.cc: 2180: FAILED ceph_assert(op->session)
2024-02-13T06:59:28.978 INFO:tasks.workunit.client.0.smithi121.stdout: watch_notify: ceph version 19.0.0-1442-g3d2b0d7b (3d2b0d7b9d303971fd78b4efe33cf2a06e3a7a4d) squid (dev)
2024-02-13T06:59:28.978 INFO:tasks.workunit.client.0.smithi121.stdout: watch_notify: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x121) [0x7efd0f2e8b4d]
2024-02-13T06:59:28.978 INFO:tasks.workunit.client.0.smithi121.stdout: watch_notify: 2: /usr/lib/ceph/libceph-common.so.2(+0x180d01) [0x7efd0f2e8d01]
2024-02-13T06:59:28.978 INFO:tasks.workunit.client.0.smithi121.stdout: watch_notify: 3: (Objecter::tick()+0xb95) [0x55a7e32cdda5]
2024-02-13T06:59:28.978 INFO:tasks.workunit.client.0.smithi121.stdout: watch_notify: 4: (ceph::timer<ceph::coarse_mono_clock>::timer_thread()+0x1cf) [0x55a7e32f58ff]
2024-02-13T06:59:28.978 INFO:tasks.workunit.client.0.smithi121.stdout: watch_notify: 5: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc2b3) [0x7efd0f00e2b3]
2024-02-13T06:59:28.978 INFO:tasks.workunit.client.0.smithi121.stdout: watch_notify: 6: /lib/x86_64-linux-gnu/libc.so.6(+0x94b43) [0x7efd0ec97b43]
2024-02-13T06:59:28.978 INFO:tasks.workunit.client.0.smithi121.stdout: watch_notify: 7: /lib/x86_64-linux-gnu/libc.so.6(+0x126a00) [0x7efd0ed29a00]
2024-02-13T06:59:29.013 INFO:tasks.ceph.osd.5.smithi169.stderr:2024-02-13T06:59:29.015+0000 7fe01a578640 -1 received signal: Hangup from /usr/bin/python3 /usr/bin/daemon-helper kill ceph-osd -f --cluster ceph -i 5 (PID: 25246) UID: 0
2024-02-13T06:59:29.022 INFO:tasks.workunit.client.0.smithi121.stderr:bash: line 1: 38101 Aborted (core dumped) ceph_test_neorados_watch_notify 2>&1
2024-02-13T06:59:29.022 INFO:tasks.workunit.client.0.smithi121.stderr: 38102 Done | tee ceph_test_neorados_watch_notify.log
2024-02-13T06:59:29.022 INFO:tasks.workunit.client.0.smithi121.stderr: 38103 Done | sed "s/^/ watch_notify: /"
The read balancer was enabled during this job (d-balancer/read), so it's worth seeing if that's somehow related.
Updated by Laura Flores about 2 months ago
Fails here in the neorados test:
107 auto create_obj(neorados::RADOS& r, std::string_view oid,
108 const neorados::IOContext& ioc,
109 CompletionToken&& token)
110 {
111 neorados::WriteOp op;
112 op.create(true);
113 return r.execute(oid, ioc, std::move(op),
114 std::forward<CompletionToken>(token));
115 }
116
117 /// \brief Expect one of several errors from a coroutine
118 ///
119 /// \param coro Awaitable coroutine
120 /// \param ec Valid errors
121 boost::asio::awaitable<void>
122 expect_error_code(auto&& coro, auto ...ecs) {
123 bool failed = false;
124 try {
125 co_await std::move(coro);
126 } catch (const boost::system::system_error& e) {
127 failed = true;
128 auto h = [c = e.code()](auto t) -> bool { return t == c; };
129 EXPECT_TRUE((h(ecs) || ...))
130 << "Got unexpected error code " << e.code().message() << ".";
131 }
132 EXPECT_TRUE(failed) << "Operation did not error at all.";
133 co_return;
134 }
And asserted here in the Objector code:
2167 // look for laggy requests
2168 auto cutoff = ceph::coarse_mono_clock::now();
2169 cutoff -= ceph::make_timespan(cct->_conf->objecter_timeout); // timeout
2170
2171 unsigned laggy_ops = 0;
2172
2173 for (auto siter = osd_sessions.begin();
2174 siter != osd_sessions.end(); ++siter) {
2175 auto s = siter->second;
2176 scoped_lock l(s->lock);
2177 bool found = false;
2178 for (auto p = s->ops.begin(); p != s->ops.end(); ++p) {
2179 auto op = p->second;
2180 ceph_assert(op->session);
2181 if (op->stamp < cutoff) {
2182 ldout(cct, 2) << " tid " << p->first << " on osd." << op->session->osd
2183 << " is laggy" << dendl;
2184 found = true;
2185 ++laggy_ops;
2186 }
2187 }
Pretty sure the assert is an effect of the op timing out.
Actions