Project

General

Profile

Bug #47719

api_watch_notify: LibRadosWatchNotify.AioWatchDelete2 fails

Added by Neha Ojha over 3 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
pacific, octopus, nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2020-10-01T09:30:07.507 INFO:tasks.workunit.client.0.smithi168.stdout:         api_watch_notify: watch_notify2_test_errcb cookie 94195453383936 err -107
2020-10-01T09:30:07.507 INFO:tasks.workunit.client.0.smithi168.stdout:         api_watch_notify: /home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.0.0-6067-g72c9c667/rpm/el8/BUILD/ceph-16.0.0-6067-g72c9c667/src/test/librados/watch_notify.cc:593: Failure
2020-10-01T09:30:07.507 INFO:tasks.workunit.client.0.smithi168.stdout:         api_watch_notify: Expected equality of these values:
2020-10-01T09:30:07.508 INFO:tasks.workunit.client.0.smithi168.stdout:         api_watch_notify:   -107
2020-10-01T09:30:07.508 INFO:tasks.workunit.client.0.smithi168.stdout:         api_watch_notify:   rados_watch_check(ioctx, handle)
2020-10-01T09:30:07.508 INFO:tasks.workunit.client.0.smithi168.stdout:         api_watch_notify:     Which is: -2
2020-10-01T09:30:07.508 INFO:tasks.workunit.client.0.smithi168.stdout:         api_watch_notify: [  FAILED  ] LibRadosWatchNotify.AioWatchDelete2 (1007 ms)

/a/teuthology-2020-10-01_07:01:02-rados-master-distro-basic-smithi/5486109


Related issues

Related to RADOS - Bug #48844: api_watch_notify: LibRadosWatchNotify.AioWatchDelete failed Duplicate
Related to RADOS - Bug #52319: LibRadosWatchNotify.WatchNotify2 fails New
Copied to RADOS - Backport #49565: pacific: api_watch_notify: LibRadosWatchNotify.AioWatchDelete2 fails Resolved
Copied to RADOS - Backport #49566: octopus: api_watch_notify: LibRadosWatchNotify.AioWatchDelete2 fails Resolved
Copied to RADOS - Backport #49567: nautilus: api_watch_notify: LibRadosWatchNotify.AioWatchDelete2 fails Resolved

History

#1 Updated by Kefu Chai over 3 years ago

2020-10-30T16:30:08.639 INFO:tasks.workunit.client.0.smithi188.stdout:         api_watch_notify: [ RUN      ] LibRadosWatchNotify.AioWatchDelete2
2020-10-30T16:30:08.639 INFO:tasks.workunit.client.0.smithi188.stdout:         api_watch_notify: waiting up to 30 for disconnect notification ...
2020-10-30T16:30:08.639 INFO:tasks.workunit.client.0.smithi188.stdout:         api_watch_notify: watch_notify2_test_errcb cookie 94393264912880 err -107
2020-10-30T16:30:08.640 INFO:tasks.workunit.client.0.smithi188.stdout:         api_watch_notify: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/cen
tos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.0.0-6773-g8f8b93a9/rpm/el8/BUILD/ceph-16.0.0-6773-g8f8b93a9/src/test/librados/watch_notify.cc:592: Failure
2020-10-30T16:30:08.640 INFO:tasks.workunit.client.0.smithi188.stdout:         api_watch_notify: Expected equality of these values:
2020-10-30T16:30:08.640 INFO:tasks.workunit.client.0.smithi188.stdout:         api_watch_notify:   -107
2020-10-30T16:30:08.640 INFO:tasks.workunit.client.0.smithi188.stdout:         api_watch_notify:   rados_watch_check(ioctx, handle)
2020-10-30T16:30:08.640 INFO:tasks.workunit.client.0.smithi188.stdout:         api_watch_notify:     Which is: -2
2020-10-30T16:30:08.640 INFO:tasks.workunit.client.0.smithi188.stdout:         api_watch_notify: [  FAILED  ] LibRadosWatchNotify.AioWatchDelete2 (1016 ms)
2020-10-30T16:30:08.640 INFO:tasks.workunit.client.0.smithi188.stdout:         api_watch_notify: [----------] 10 tests from LibRadosWatchNotify (12515 ms total)

/a//kchai-2020-10-30_15:27:54-rados-wip-kefu-testing-2020-10-30-1822-distro-basic-smithi/5575340

#4 Updated by Neha Ojha about 3 years ago

  • Related to Bug #48844: api_watch_notify: LibRadosWatchNotify.AioWatchDelete failed added

#5 Updated by Neha Ojha about 3 years ago

  • Priority changed from Normal to High
  • Backport set to pacific

/a/teuthology-2021-02-17_03:31:03-rados-pacific-distro-basic-smithi/5889235

#6 Updated by Neha Ojha about 3 years ago

  • Assignee set to Brad Hubbard

Brad, can you please take look at this one?

#7 Updated by Brad Hubbard about 3 years ago

Will do.

#8 Updated by Sage Weil about 3 years ago

/a/sage-2021-02-20_16:46:42-rados-wip-sage2-testing-2021-02-20-0942-distro-basic-smithi/5899129

#9 Updated by Brad Hubbard about 3 years ago

First, the relevant test code from src/test/librados/watch_notify.cc.

565 TEST_F(LibRadosWatchNotify, AioWatchDelete2) {
...
583   ASSERT_EQ(0, rados_remove(ioctx, notify_oid));
584   int left = 30;
585   std::cout << "waiting up to " << left << " for disconnect notification ..." 
586       << std::endl;
587   while (notify_err == 0 && --left) {
588     sleep(1);
589   }
590   ASSERT_TRUE(left > 0);
591   ASSERT_EQ(-ENOTCONN, notify_err);
592   ASSERT_EQ(-ENOTCONN, rados_watch_check(ioctx, handle));

Comparing the success case to the failure case.

Firstly success.

We delete "foo" and then receive the watch-notify disconnect with error ENOTCONN 107. Just after we submit a "[watch unwatch cookie 94169848035968]" to osd.1

2021-02-22T00:32:33.742+0000 7f22b584d540 15 client.18355.objecter send_linger 25 register
...
2021-02-22T00:32:33.746+0000 7f22a5348700 10 client.18355.objecter ms_dispatch 0x55a5a04035e0 osd_op_reply(60 foo [delete] v1347'3 uv3 ondisk = 0) v8
2021-02-22T00:32:33.746+0000 7f22a5348700 10 client.18355.objecter in handle_osd_op_reply
2021-02-22T00:32:33.746+0000 7f22a5348700  7 client.18355.objecter handle_osd_op_reply 60 ondisk uv 3 in 23.1c attempt 0
2021-02-22T00:32:33.746+0000 7f22a5348700 10 client.18355.objecter  op 0 rval 0 len 0
2021-02-22T00:32:33.746+0000 7f22a5348700 15 client.18355.objecter handle_osd_op_reply completed tid 60
2021-02-22T00:32:33.746+0000 7f22a5348700 15 client.18355.objecter _finish_op 60
2021-02-22T00:32:33.746+0000 7f22a5348700 20 client.18355.objecter put_session s=0x55a5a056bee0 osd=1 5
2021-02-22T00:32:33.746+0000 7f22a5348700 15 client.18355.objecter _session_op_remove 1 60
2021-02-22T00:32:33.746+0000 7f22a5348700  5 client.18355.objecter 0 in flight
2021-02-22T00:32:33.746+0000 7f22b584d540 10 librados: Objecter returned from delete r=0
2021-02-22T00:32:33.746+0000 7f22a5348700  1 -- 172.21.15.152:0/3391094266 <== osd.1 v2:172.21.15.152:6801/35235 5 ==== watch-notify(disconnect (3) cookie 94169848035968 notify 0 ret 0) v3 ==== 42+0+0 (crc 0 0 0) 0x7f22a000f460 con 0x55a5a0576400
2021-02-22T00:32:33.746+0000 7f22a5348700 10 client.18355.objecter ms_dispatch 0x55a5a04035e0 watch-notify(disconnect (3) cookie 94169848035968 notify 0 ret 0) v3
2021-02-22T00:32:33.746+0000 7f229effd700 10 librados: handle_error cookie 94169848035968 err -107
2021-02-22T00:32:34.367+0000 7f228ffff700 10 monclient: tick
2021-02-22T00:32:34.367+0000 7f228ffff700 20 monclient: _check_auth_rotating not needed by client.admin
2021-02-22T00:32:34.413+0000 7f229d7fa700  1 -- 172.21.15.152:0/3391094266 <== mon.1 v2:172.21.15.152:3300/0 13 ==== osd_map(1348..1348 src has 1..1348) v4 ==== 3004+0+0 (secure 0 0 0) 0x7f2290045390 con 0x55a5a0542f60
2021-02-22T00:32:34.413+0000 7f229d7fa700 10 client.18355.objecter ms_dispatch 0x55a5a04035e0 osd_map(1348..1348 src has 1..1348) v4
2021-02-22T00:32:34.413+0000 7f229d7fa700  3 client.18355.objecter handle_osd_map got epochs [1348,1348] > 1347
2021-02-22T00:32:34.413+0000 7f229d7fa700  3 client.18355.objecter handle_osd_map decoding incremental epoch 1348
2021-02-22T00:32:34.413+0000 7f229d7fa700 10 client.18355.objecter  checking linger op 25
2021-02-22T00:32:34.413+0000 7f229d7fa700 20 client.18355.objecter _calc_target epoch 1348 base foo @23;test-rados-api-smithi152-260088-11 precalc_pgid 0 pgid 0.0 is_write
2021-02-22T00:32:34.413+0000 7f229d7fa700 20 client.18355.objecter _calc_target target foo @23;test-rados-api-smithi152-260088-11 -> pgid 23.89b396bc
2021-02-22T00:32:34.413+0000 7f229d7fa700 10 client.18355.objecter _maybe_request_map subscribing (onetime) to next osd map
2021-02-22T00:32:34.413+0000 7f229d7fa700 10 monclient: _renew_subs
2021-02-22T00:32:34.413+0000 7f229d7fa700 10 monclient: _send_mon_message to mon.a at v2:172.21.15.152:3300/0
2021-02-22T00:32:34.413+0000 7f229d7fa700  1 -- 172.21.15.152:0/3391094266 --> [v2:172.21.15.152:3300/0,v1:172.21.15.152:6789/0] -- mon_subscribe({osdmap=1349}) v3 -- 0x7f228805cb00 con 0x55a5a0542f60
2021-02-22T00:32:34.413+0000 7f229d7fa700 20 client.18355.objecter dump_active .. 0 homeless
2021-02-22T00:32:34.746+0000 7f22b584d540 10 client.18355.objecter linger_check 25 err osd:107 age 1.003987432s

The final line is where we call rados_watch_check -> librados::IoCtxImpl::watch_check -> objecter->linger_check and get the expected error (107) so all is good.

Now failure.

We do the same delete and wait cycle but we are hit by an injected socket failure and issue a reconnect. The test of 'notify_err' succeeds but, because of the reconnect, we send a [watch reconnect cookie 94313265062576 gen 1] to osd.4 and it responds ENOENT (2) (presumably because we've deleted the object) and when the call to linger_check completes it returns the ENOENT error instead of ENOTCONN as is expected by the test.

2021-02-21T02:09:44.696+0000 7f6b133b6540 15 client.16538.objecter send_linger 25 register
...
2021-02-21T02:09:44.698+0000 7f6b026b0700  0 -- 172.21.15.175:0/418337470 >> v2:172.21.15.91:6800/35267 conn(0x55c704a864b0 msgr2=0x55c704a907f0 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until injecting socket failure
2021-02-21T02:09:44.698+0000 7f6b026b0700  1 -- 172.21.15.175:0/418337470 <== osd.4 v2:172.21.15.91:6800/35267 10 ==== osd_op_reply(60 foo [delete] v1676'3 uv3 ondisk = 0) v8 ==== 147+0+0 (crc 0 0 0) 0x7f6af8000b80 con 0x55c704a864b0
2021-02-21T02:09:44.699+0000 7f6b026b0700 10 client.16538.objecter ms_dispatch 0x55c70491c5e0 osd_op_reply(60 foo [delete] v1676'3 uv3 ondisk = 0) v8
2021-02-21T02:09:44.699+0000 7f6b026b0700 10 client.16538.objecter in handle_osd_op_reply
2021-02-21T02:09:44.699+0000 7f6b026b0700  7 client.16538.objecter handle_osd_op_reply 60 ondisk uv 3 in 18.1e attempt 0
2021-02-21T02:09:44.699+0000 7f6b026b0700 10 client.16538.objecter  op 0 rval 0 len 0
2021-02-21T02:09:44.699+0000 7f6b026b0700 15 client.16538.objecter handle_osd_op_reply completed tid 60
2021-02-21T02:09:44.699+0000 7f6b026b0700 15 client.16538.objecter _finish_op 60
2021-02-21T02:09:44.699+0000 7f6b026b0700 20 client.16538.objecter put_session s=0x55c7049449e0 osd=4 5
2021-02-21T02:09:44.699+0000 7f6b026b0700 15 client.16538.objecter _session_op_remove 4 60
2021-02-21T02:09:44.699+0000 7f6b026b0700  5 client.16538.objecter 0 in flight
2021-02-21T02:09:44.699+0000 7f6b133b6540 10 librados: Objecter returned from delete r=0
2021-02-21T02:09:44.699+0000 7f6b026b0700  1 -- 172.21.15.175:0/418337470 <== osd.4 v2:172.21.15.91:6800/35267 11 ==== watch-notify(disconnect (3) cookie 94313265062576 notify 0 ret 0) v3 ==== 42+0+0 (crc 0 0 0) 0x7f6af802c790 con 0x55c704a864b0
2021-02-21T02:09:44.699+0000 7f6b026b0700 10 client.16538.objecter ms_dispatch 0x55c70491c5e0 watch-notify(disconnect (3) cookie 94313265062576 notify 0 ret 0) v3
2021-02-21T02:09:44.699+0000 7f6b026b0700  1 -- 172.21.15.175:0/418337470 >> v2:172.21.15.91:6800/35267 conn(0x55c704a864b0 msgr2=0x55c704a907f0 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 20
2021-02-21T02:09:44.699+0000 7f6b016ae700 10 librados: handle_error cookie 94313265062576 err -107
2021-02-21T02:09:44.699+0000 7f6b026b0700  1 -- 172.21.15.175:0/418337470 >> v2:172.21.15.91:6800/35267 conn(0x55c704a864b0 msgr2=0x55c704a907f0 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed
2021-02-21T02:09:44.699+0000 7f6b026b0700  1 --2- 172.21.15.175:0/418337470 >> v2:172.21.15.91:6800/35267 conn(0x55c704a864b0 0x55c704a907f0 crc :-1 s=READY pgs=1025 cs=0 l=1 rev1=1 rx=0 tx=0).handle_read_frame_preamble_main read frame preamble failed r=-1 ((1) Operation not permitted)
2021-02-21T02:09:44.699+0000 7f6b026b0700  1 --2- 172.21.15.175:0/418337470 >> v2:172.21.15.91:6800/35267 conn(0x55c704a864b0 0x55c704a907f0 crc :-1 s=READY pgs=1025 cs=0 l=1 rev1=1 rx=0 tx=0).stop
2021-02-21T02:09:44.699+0000 7f6af2ffd700  1 client.16538.objecter ms_handle_reset 0x55c704a864b0 session 0x55c7049449e0 osd.4
2021-02-21T02:09:44.699+0000 7f6af2ffd700 10 client.16538.objecter reopen_session osd.4 session, addr now v2:172.21.15.91:6800/35267
2021-02-21T02:09:44.699+0000 7f6af2ffd700  1 -- 172.21.15.175:0/418337470 >> v2:172.21.15.91:6800/35267 conn(0x55c704a864b0 msgr2=0x55c704a907f0 unknown :-1 s=STATE_CLOSED l=1).mark_down
2021-02-21T02:09:44.699+0000 7f6af2ffd700  1 --2- 172.21.15.175:0/418337470 >> v2:172.21.15.91:6800/35267 conn(0x55c704a864b0 0x55c704a907f0 unknown :-1 s=CLOSED pgs=1025 cs=0 l=1 rev1=1 rx=0 tx=0).stop
2021-02-21T02:09:44.699+0000 7f6af2ffd700  1 --2- 172.21.15.175:0/418337470 >> v2:172.21.15.91:6800/35267 conn(0x7f6ae80629d0 0x7f6ae806e6f0 unknown :-1 s=NONE pgs=0 cs=0 l=1 rev1=0 rx=0 tx=0).connect
2021-02-21T02:09:44.699+0000 7f6af2ffd700 15 client.16538.objecter send_linger 25 reconnect
...
2021-02-21T02:09:44.699+0000 7f6af2ffd700  1 -- 172.21.15.175:0/418337470 --> v2:172.21.15.91:6800/35267 -- osd_op(unknown.0.0:61 18.1e 18:79650060:test-rados-api-smithi175-313817-11::foo:head [watch reconnect cookie 94313265062576 gen 1] snapc 0=[] ondisk+write+known_if_redirected e1676) v8 -- 0x7f6ae806b770 con 0x7f6ae80629d0
...
2021-02-21T02:09:44.700+0000 7f6b026b0700  1 -- 172.21.15.175:0/418337470 <== osd.4 v2:172.21.15.91:6800/35267 1 ==== osd_op_reply(61 foo [watch reconnect cookie 94313265062576 gen 1] v1676'4 uv3 ondisk = -2 ((2) No such file or directory)) v8 ==== 147+0+0 (crc 0 0 0) 0x7f6af8000b80 con 0x7f6ae80629d0
...
2021-02-21T02:09:45.358+0000 7f6af2ffd700 10 client.16538.objecter  checking linger op 25
...
2021-02-21T02:09:45.699+0000 7f6b133b6540 10 client.16538.objecter linger_check 25 err osd:2 age 1.002983451s

Next step, of course, is to work out what to do about that. Could it be as simple as checking whether rados_watch_check() returns ENOTCONN or ENOENT to cover both possibilities?

#10 Updated by Brad Hubbard about 3 years ago

  • Backport changed from pacific to pacific, octopus, nautilus
  • Pull request ID set to 39601

#11 Updated by Brad Hubbard about 3 years ago

Hey Sage, I think you meant /a/sage-2021-02-20_16:46:42-rados-wip-sage2-testing-2021-02-20-0942-distro-basic-smithi/5899117/ ?

#12 Updated by Brad Hubbard about 3 years ago

If we are happy with https://github.com/ceph/ceph/pull/39601 in theory perhaps we need to extend it to cover the other calls to rados_watch_check() although the one on line 592 seems to be the only one that experiences this failure for reasons as yet unexplained.

#13 Updated by Brad Hubbard about 3 years ago

  • Status changed from New to In Progress

#14 Updated by Sage Weil about 3 years ago

/a/sage-2021-02-24_21:26:56-rados-wip-sage-testing-2021-02-24-1457-distro-basic-smithi/5912284

#15 Updated by Sage Weil about 3 years ago

/a/sage-2021-02-27_17:50:29-rados-wip-sage2-testing-2021-02-27-0921-pacific-distro-basic-smithi/5919090

#16 Updated by Sage Weil about 3 years ago

/a/sage-2021-02-28_18:35:15-rados-wip-sage-testing-2021-02-28-1217-distro-basic-smithi/5921418

#17 Updated by Brad Hubbard about 3 years ago

  • Status changed from In Progress to Pending Backport

#18 Updated by Backport Bot about 3 years ago

  • Copied to Backport #49565: pacific: api_watch_notify: LibRadosWatchNotify.AioWatchDelete2 fails added

#19 Updated by Backport Bot about 3 years ago

  • Copied to Backport #49566: octopus: api_watch_notify: LibRadosWatchNotify.AioWatchDelete2 fails added

#20 Updated by Backport Bot about 3 years ago

  • Copied to Backport #49567: nautilus: api_watch_notify: LibRadosWatchNotify.AioWatchDelete2 fails added

#22 Updated by Loïc Dachary almost 3 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

#23 Updated by Radoslaw Zarzynski about 2 years ago

  • Related to Bug #52319: LibRadosWatchNotify.WatchNotify2 fails added

Also available in: Atom PDF