Bug #47719
api_watch_notify: LibRadosWatchNotify.AioWatchDelete2 fails
0%
Description
2020-10-01T09:30:07.507 INFO:tasks.workunit.client.0.smithi168.stdout: api_watch_notify: watch_notify2_test_errcb cookie 94195453383936 err -107 2020-10-01T09:30:07.507 INFO:tasks.workunit.client.0.smithi168.stdout: api_watch_notify: /home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.0.0-6067-g72c9c667/rpm/el8/BUILD/ceph-16.0.0-6067-g72c9c667/src/test/librados/watch_notify.cc:593: Failure 2020-10-01T09:30:07.507 INFO:tasks.workunit.client.0.smithi168.stdout: api_watch_notify: Expected equality of these values: 2020-10-01T09:30:07.508 INFO:tasks.workunit.client.0.smithi168.stdout: api_watch_notify: -107 2020-10-01T09:30:07.508 INFO:tasks.workunit.client.0.smithi168.stdout: api_watch_notify: rados_watch_check(ioctx, handle) 2020-10-01T09:30:07.508 INFO:tasks.workunit.client.0.smithi168.stdout: api_watch_notify: Which is: -2 2020-10-01T09:30:07.508 INFO:tasks.workunit.client.0.smithi168.stdout: api_watch_notify: [ FAILED ] LibRadosWatchNotify.AioWatchDelete2 (1007 ms)
/a/teuthology-2020-10-01_07:01:02-rados-master-distro-basic-smithi/5486109
Related issues
History
#1 Updated by Kefu Chai over 3 years ago
2020-10-30T16:30:08.639 INFO:tasks.workunit.client.0.smithi188.stdout: api_watch_notify: [ RUN ] LibRadosWatchNotify.AioWatchDelete2 2020-10-30T16:30:08.639 INFO:tasks.workunit.client.0.smithi188.stdout: api_watch_notify: waiting up to 30 for disconnect notification ... 2020-10-30T16:30:08.639 INFO:tasks.workunit.client.0.smithi188.stdout: api_watch_notify: watch_notify2_test_errcb cookie 94393264912880 err -107 2020-10-30T16:30:08.640 INFO:tasks.workunit.client.0.smithi188.stdout: api_watch_notify: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/cen tos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.0.0-6773-g8f8b93a9/rpm/el8/BUILD/ceph-16.0.0-6773-g8f8b93a9/src/test/librados/watch_notify.cc:592: Failure 2020-10-30T16:30:08.640 INFO:tasks.workunit.client.0.smithi188.stdout: api_watch_notify: Expected equality of these values: 2020-10-30T16:30:08.640 INFO:tasks.workunit.client.0.smithi188.stdout: api_watch_notify: -107 2020-10-30T16:30:08.640 INFO:tasks.workunit.client.0.smithi188.stdout: api_watch_notify: rados_watch_check(ioctx, handle) 2020-10-30T16:30:08.640 INFO:tasks.workunit.client.0.smithi188.stdout: api_watch_notify: Which is: -2 2020-10-30T16:30:08.640 INFO:tasks.workunit.client.0.smithi188.stdout: api_watch_notify: [ FAILED ] LibRadosWatchNotify.AioWatchDelete2 (1016 ms) 2020-10-30T16:30:08.640 INFO:tasks.workunit.client.0.smithi188.stdout: api_watch_notify: [----------] 10 tests from LibRadosWatchNotify (12515 ms total)
/a//kchai-2020-10-30_15:27:54-rados-wip-kefu-testing-2020-10-30-1822-distro-basic-smithi/5575340
#4 Updated by Neha Ojha about 3 years ago
- Related to Bug #48844: api_watch_notify: LibRadosWatchNotify.AioWatchDelete failed added
#5 Updated by Neha Ojha about 3 years ago
- Priority changed from Normal to High
- Backport set to pacific
/a/teuthology-2021-02-17_03:31:03-rados-pacific-distro-basic-smithi/5889235
#6 Updated by Neha Ojha about 3 years ago
- Assignee set to Brad Hubbard
Brad, can you please take look at this one?
#7 Updated by Brad Hubbard about 3 years ago
Will do.
#8 Updated by Sage Weil about 3 years ago
/a/sage-2021-02-20_16:46:42-rados-wip-sage2-testing-2021-02-20-0942-distro-basic-smithi/5899129
#9 Updated by Brad Hubbard about 3 years ago
First, the relevant test code from src/test/librados/watch_notify.cc.
565 TEST_F(LibRadosWatchNotify, AioWatchDelete2) { ... 583 ASSERT_EQ(0, rados_remove(ioctx, notify_oid)); 584 int left = 30; 585 std::cout << "waiting up to " << left << " for disconnect notification ..." 586 << std::endl; 587 while (notify_err == 0 && --left) { 588 sleep(1); 589 } 590 ASSERT_TRUE(left > 0); 591 ASSERT_EQ(-ENOTCONN, notify_err); 592 ASSERT_EQ(-ENOTCONN, rados_watch_check(ioctx, handle));
Comparing the success case to the failure case.
Firstly success.
We delete "foo" and then receive the watch-notify disconnect with error ENOTCONN 107. Just after we submit a "[watch unwatch cookie 94169848035968]" to osd.1
2021-02-22T00:32:33.742+0000 7f22b584d540 15 client.18355.objecter send_linger 25 register ... 2021-02-22T00:32:33.746+0000 7f22a5348700 10 client.18355.objecter ms_dispatch 0x55a5a04035e0 osd_op_reply(60 foo [delete] v1347'3 uv3 ondisk = 0) v8 2021-02-22T00:32:33.746+0000 7f22a5348700 10 client.18355.objecter in handle_osd_op_reply 2021-02-22T00:32:33.746+0000 7f22a5348700 7 client.18355.objecter handle_osd_op_reply 60 ondisk uv 3 in 23.1c attempt 0 2021-02-22T00:32:33.746+0000 7f22a5348700 10 client.18355.objecter op 0 rval 0 len 0 2021-02-22T00:32:33.746+0000 7f22a5348700 15 client.18355.objecter handle_osd_op_reply completed tid 60 2021-02-22T00:32:33.746+0000 7f22a5348700 15 client.18355.objecter _finish_op 60 2021-02-22T00:32:33.746+0000 7f22a5348700 20 client.18355.objecter put_session s=0x55a5a056bee0 osd=1 5 2021-02-22T00:32:33.746+0000 7f22a5348700 15 client.18355.objecter _session_op_remove 1 60 2021-02-22T00:32:33.746+0000 7f22a5348700 5 client.18355.objecter 0 in flight 2021-02-22T00:32:33.746+0000 7f22b584d540 10 librados: Objecter returned from delete r=0 2021-02-22T00:32:33.746+0000 7f22a5348700 1 -- 172.21.15.152:0/3391094266 <== osd.1 v2:172.21.15.152:6801/35235 5 ==== watch-notify(disconnect (3) cookie 94169848035968 notify 0 ret 0) v3 ==== 42+0+0 (crc 0 0 0) 0x7f22a000f460 con 0x55a5a0576400 2021-02-22T00:32:33.746+0000 7f22a5348700 10 client.18355.objecter ms_dispatch 0x55a5a04035e0 watch-notify(disconnect (3) cookie 94169848035968 notify 0 ret 0) v3 2021-02-22T00:32:33.746+0000 7f229effd700 10 librados: handle_error cookie 94169848035968 err -107 2021-02-22T00:32:34.367+0000 7f228ffff700 10 monclient: tick 2021-02-22T00:32:34.367+0000 7f228ffff700 20 monclient: _check_auth_rotating not needed by client.admin 2021-02-22T00:32:34.413+0000 7f229d7fa700 1 -- 172.21.15.152:0/3391094266 <== mon.1 v2:172.21.15.152:3300/0 13 ==== osd_map(1348..1348 src has 1..1348) v4 ==== 3004+0+0 (secure 0 0 0) 0x7f2290045390 con 0x55a5a0542f60 2021-02-22T00:32:34.413+0000 7f229d7fa700 10 client.18355.objecter ms_dispatch 0x55a5a04035e0 osd_map(1348..1348 src has 1..1348) v4 2021-02-22T00:32:34.413+0000 7f229d7fa700 3 client.18355.objecter handle_osd_map got epochs [1348,1348] > 1347 2021-02-22T00:32:34.413+0000 7f229d7fa700 3 client.18355.objecter handle_osd_map decoding incremental epoch 1348 2021-02-22T00:32:34.413+0000 7f229d7fa700 10 client.18355.objecter checking linger op 25 2021-02-22T00:32:34.413+0000 7f229d7fa700 20 client.18355.objecter _calc_target epoch 1348 base foo @23;test-rados-api-smithi152-260088-11 precalc_pgid 0 pgid 0.0 is_write 2021-02-22T00:32:34.413+0000 7f229d7fa700 20 client.18355.objecter _calc_target target foo @23;test-rados-api-smithi152-260088-11 -> pgid 23.89b396bc 2021-02-22T00:32:34.413+0000 7f229d7fa700 10 client.18355.objecter _maybe_request_map subscribing (onetime) to next osd map 2021-02-22T00:32:34.413+0000 7f229d7fa700 10 monclient: _renew_subs 2021-02-22T00:32:34.413+0000 7f229d7fa700 10 monclient: _send_mon_message to mon.a at v2:172.21.15.152:3300/0 2021-02-22T00:32:34.413+0000 7f229d7fa700 1 -- 172.21.15.152:0/3391094266 --> [v2:172.21.15.152:3300/0,v1:172.21.15.152:6789/0] -- mon_subscribe({osdmap=1349}) v3 -- 0x7f228805cb00 con 0x55a5a0542f60 2021-02-22T00:32:34.413+0000 7f229d7fa700 20 client.18355.objecter dump_active .. 0 homeless 2021-02-22T00:32:34.746+0000 7f22b584d540 10 client.18355.objecter linger_check 25 err osd:107 age 1.003987432s
The final line is where we call rados_watch_check -> librados::IoCtxImpl::watch_check -> objecter->linger_check and get the expected error (107) so all is good.
Now failure.
We do the same delete and wait cycle but we are hit by an injected socket failure and issue a reconnect. The test of 'notify_err' succeeds but, because of the reconnect, we send a [watch reconnect cookie 94313265062576 gen 1] to osd.4 and it responds ENOENT (2) (presumably because we've deleted the object) and when the call to linger_check completes it returns the ENOENT error instead of ENOTCONN as is expected by the test.
2021-02-21T02:09:44.696+0000 7f6b133b6540 15 client.16538.objecter send_linger 25 register ... 2021-02-21T02:09:44.698+0000 7f6b026b0700 0 -- 172.21.15.175:0/418337470 >> v2:172.21.15.91:6800/35267 conn(0x55c704a864b0 msgr2=0x55c704a907f0 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until injecting socket failure 2021-02-21T02:09:44.698+0000 7f6b026b0700 1 -- 172.21.15.175:0/418337470 <== osd.4 v2:172.21.15.91:6800/35267 10 ==== osd_op_reply(60 foo [delete] v1676'3 uv3 ondisk = 0) v8 ==== 147+0+0 (crc 0 0 0) 0x7f6af8000b80 con 0x55c704a864b0 2021-02-21T02:09:44.699+0000 7f6b026b0700 10 client.16538.objecter ms_dispatch 0x55c70491c5e0 osd_op_reply(60 foo [delete] v1676'3 uv3 ondisk = 0) v8 2021-02-21T02:09:44.699+0000 7f6b026b0700 10 client.16538.objecter in handle_osd_op_reply 2021-02-21T02:09:44.699+0000 7f6b026b0700 7 client.16538.objecter handle_osd_op_reply 60 ondisk uv 3 in 18.1e attempt 0 2021-02-21T02:09:44.699+0000 7f6b026b0700 10 client.16538.objecter op 0 rval 0 len 0 2021-02-21T02:09:44.699+0000 7f6b026b0700 15 client.16538.objecter handle_osd_op_reply completed tid 60 2021-02-21T02:09:44.699+0000 7f6b026b0700 15 client.16538.objecter _finish_op 60 2021-02-21T02:09:44.699+0000 7f6b026b0700 20 client.16538.objecter put_session s=0x55c7049449e0 osd=4 5 2021-02-21T02:09:44.699+0000 7f6b026b0700 15 client.16538.objecter _session_op_remove 4 60 2021-02-21T02:09:44.699+0000 7f6b026b0700 5 client.16538.objecter 0 in flight 2021-02-21T02:09:44.699+0000 7f6b133b6540 10 librados: Objecter returned from delete r=0 2021-02-21T02:09:44.699+0000 7f6b026b0700 1 -- 172.21.15.175:0/418337470 <== osd.4 v2:172.21.15.91:6800/35267 11 ==== watch-notify(disconnect (3) cookie 94313265062576 notify 0 ret 0) v3 ==== 42+0+0 (crc 0 0 0) 0x7f6af802c790 con 0x55c704a864b0 2021-02-21T02:09:44.699+0000 7f6b026b0700 10 client.16538.objecter ms_dispatch 0x55c70491c5e0 watch-notify(disconnect (3) cookie 94313265062576 notify 0 ret 0) v3 2021-02-21T02:09:44.699+0000 7f6b026b0700 1 -- 172.21.15.175:0/418337470 >> v2:172.21.15.91:6800/35267 conn(0x55c704a864b0 msgr2=0x55c704a907f0 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 20 2021-02-21T02:09:44.699+0000 7f6b016ae700 10 librados: handle_error cookie 94313265062576 err -107 2021-02-21T02:09:44.699+0000 7f6b026b0700 1 -- 172.21.15.175:0/418337470 >> v2:172.21.15.91:6800/35267 conn(0x55c704a864b0 msgr2=0x55c704a907f0 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed 2021-02-21T02:09:44.699+0000 7f6b026b0700 1 --2- 172.21.15.175:0/418337470 >> v2:172.21.15.91:6800/35267 conn(0x55c704a864b0 0x55c704a907f0 crc :-1 s=READY pgs=1025 cs=0 l=1 rev1=1 rx=0 tx=0).handle_read_frame_preamble_main read frame preamble failed r=-1 ((1) Operation not permitted) 2021-02-21T02:09:44.699+0000 7f6b026b0700 1 --2- 172.21.15.175:0/418337470 >> v2:172.21.15.91:6800/35267 conn(0x55c704a864b0 0x55c704a907f0 crc :-1 s=READY pgs=1025 cs=0 l=1 rev1=1 rx=0 tx=0).stop 2021-02-21T02:09:44.699+0000 7f6af2ffd700 1 client.16538.objecter ms_handle_reset 0x55c704a864b0 session 0x55c7049449e0 osd.4 2021-02-21T02:09:44.699+0000 7f6af2ffd700 10 client.16538.objecter reopen_session osd.4 session, addr now v2:172.21.15.91:6800/35267 2021-02-21T02:09:44.699+0000 7f6af2ffd700 1 -- 172.21.15.175:0/418337470 >> v2:172.21.15.91:6800/35267 conn(0x55c704a864b0 msgr2=0x55c704a907f0 unknown :-1 s=STATE_CLOSED l=1).mark_down 2021-02-21T02:09:44.699+0000 7f6af2ffd700 1 --2- 172.21.15.175:0/418337470 >> v2:172.21.15.91:6800/35267 conn(0x55c704a864b0 0x55c704a907f0 unknown :-1 s=CLOSED pgs=1025 cs=0 l=1 rev1=1 rx=0 tx=0).stop 2021-02-21T02:09:44.699+0000 7f6af2ffd700 1 --2- 172.21.15.175:0/418337470 >> v2:172.21.15.91:6800/35267 conn(0x7f6ae80629d0 0x7f6ae806e6f0 unknown :-1 s=NONE pgs=0 cs=0 l=1 rev1=0 rx=0 tx=0).connect 2021-02-21T02:09:44.699+0000 7f6af2ffd700 15 client.16538.objecter send_linger 25 reconnect ... 2021-02-21T02:09:44.699+0000 7f6af2ffd700 1 -- 172.21.15.175:0/418337470 --> v2:172.21.15.91:6800/35267 -- osd_op(unknown.0.0:61 18.1e 18:79650060:test-rados-api-smithi175-313817-11::foo:head [watch reconnect cookie 94313265062576 gen 1] snapc 0=[] ondisk+write+known_if_redirected e1676) v8 -- 0x7f6ae806b770 con 0x7f6ae80629d0 ... 2021-02-21T02:09:44.700+0000 7f6b026b0700 1 -- 172.21.15.175:0/418337470 <== osd.4 v2:172.21.15.91:6800/35267 1 ==== osd_op_reply(61 foo [watch reconnect cookie 94313265062576 gen 1] v1676'4 uv3 ondisk = -2 ((2) No such file or directory)) v8 ==== 147+0+0 (crc 0 0 0) 0x7f6af8000b80 con 0x7f6ae80629d0 ... 2021-02-21T02:09:45.358+0000 7f6af2ffd700 10 client.16538.objecter checking linger op 25 ... 2021-02-21T02:09:45.699+0000 7f6b133b6540 10 client.16538.objecter linger_check 25 err osd:2 age 1.002983451s
Next step, of course, is to work out what to do about that. Could it be as simple as checking whether rados_watch_check() returns ENOTCONN or ENOENT to cover both possibilities?
#10 Updated by Brad Hubbard about 3 years ago
- Backport changed from pacific to pacific, octopus, nautilus
- Pull request ID set to 39601
#11 Updated by Brad Hubbard about 3 years ago
Hey Sage, I think you meant /a/sage-2021-02-20_16:46:42-rados-wip-sage2-testing-2021-02-20-0942-distro-basic-smithi/5899117/ ?
#12 Updated by Brad Hubbard about 3 years ago
If we are happy with https://github.com/ceph/ceph/pull/39601 in theory perhaps we need to extend it to cover the other calls to rados_watch_check() although the one on line 592 seems to be the only one that experiences this failure for reasons as yet unexplained.
#13 Updated by Brad Hubbard about 3 years ago
- Status changed from New to In Progress
#14 Updated by Sage Weil about 3 years ago
/a/sage-2021-02-24_21:26:56-rados-wip-sage-testing-2021-02-24-1457-distro-basic-smithi/5912284
#15 Updated by Sage Weil about 3 years ago
/a/sage-2021-02-27_17:50:29-rados-wip-sage2-testing-2021-02-27-0921-pacific-distro-basic-smithi/5919090
#16 Updated by Sage Weil about 3 years ago
/a/sage-2021-02-28_18:35:15-rados-wip-sage-testing-2021-02-28-1217-distro-basic-smithi/5921418
#17 Updated by Brad Hubbard about 3 years ago
- Status changed from In Progress to Pending Backport
#18 Updated by Backport Bot about 3 years ago
- Copied to Backport #49565: pacific: api_watch_notify: LibRadosWatchNotify.AioWatchDelete2 fails added
#19 Updated by Backport Bot about 3 years ago
- Copied to Backport #49566: octopus: api_watch_notify: LibRadosWatchNotify.AioWatchDelete2 fails added
#20 Updated by Backport Bot about 3 years ago
- Copied to Backport #49567: nautilus: api_watch_notify: LibRadosWatchNotify.AioWatchDelete2 fails added
#21 Updated by Yuri Weinstein about 3 years ago
#22 Updated by Loïc Dachary almost 3 years ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".
#23 Updated by Radoslaw Zarzynski about 2 years ago
- Related to Bug #52319: LibRadosWatchNotify.WatchNotify2 fails added