Bug #44062
LibRadosWatchNotify.WatchNotify failure
0%
Description
2020-02-09T22:40:12.646 INFO:tasks.workunit.client.0.smithi109.stdout: api_watch_notify: [ RUN ] LibRadosWatchNotify.WatchNotify 2020-02-09T22:40:12.646 INFO:tasks.workunit.client.0.smithi109.stdout: api_watch_notify: watch_notify_test_cb 2020-02-09T22:40:12.646 INFO:tasks.workunit.client.0.smithi109.stdout: api_watch_notify: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/huge/release/15.1.0-441-g39dbd5e/rpm/el8/BUILD/ceph-15.1.0-441-g39dbd5e/src/test/ librados/watch_notify.cc:95: Failure 2020-02-09T22:40:12.647 INFO:tasks.workunit.client.0.smithi109.stdout: api_watch_notify: Expected equality of these values: 2020-02-09T22:40:12.647 INFO:tasks.workunit.client.0.smithi109.stdout: api_watch_notify: 0 2020-02-09T22:40:12.647 INFO:tasks.workunit.client.0.smithi109.stdout: api_watch_notify: rados_notify(ioctx, "foo", 0, __null, 0) 2020-02-09T22:40:12.647 INFO:tasks.workunit.client.0.smithi109.stdout: api_watch_notify: Which is: -110 2020-02-09T22:40:12.647 INFO:tasks.workunit.client.0.smithi109.stdout: api_watch_notify: [ FAILED ] LibRadosWatchNotify.WatchNotify (61030 ms)
/a/sage-2020-02-09_21:18:03-rados-wip-sage2-testing-2020-02-09-1152-distro-basic-smithi/4749220
Related issues
History
#1 Updated by Sage Weil over 3 years ago
- Status changed from New to Triaged
- Priority changed from Normal to Urgent
/a/sage-2020-02-11_20:49:48-rados-wip-sage-testing-2020-02-11-1121-distro-basic-smithi/4755080
#2 Updated by Neha Ojha over 3 years ago
- Assignee set to Brad Hubbard
Brad, can you please take a look at this?
#3 Updated by Brad Hubbard over 3 years ago
Sure Neha
#4 Updated by Brad Hubbard over 3 years ago
When trying to reproduce I am seeing a lot of these which may, or may not, be related.
2020-02-13 02:01:11,261.261 INFO:tasks.ceph.mon.a.smithi103.stderr:==00:00:45:14.681 19436== Warning: unimplemented fcntl command: 1036 2020-02-13 02:01:11,302.302 INFO:tasks.ceph.mon.b.smithi109.stderr:==00:00:45:14.717 19197== Warning: unimplemented fcntl command: 1036 2020-02-13 02:01:11,302.302 INFO:tasks.ceph.mon.a.smithi103.stderr:==00:00:45:14.722 19436== Warning: unimplemented fcntl command: 1036 2020-02-13 02:01:11,306.306 INFO:tasks.ceph.mon.b.smithi109.stderr:==00:00:45:14.721 19197== Warning: unimplemented fcntl command: 1036 2020-02-13 02:01:11,317.317 INFO:tasks.ceph.mon.b.smithi109.stderr:==00:00:45:14.732 19197== Warning: unimplemented fcntl command: 1036 2020-02-13 02:01:11,319.319 INFO:tasks.ceph.mon.a.smithi103.stderr:==00:00:45:14.739 19436== Warning: unimplemented fcntl command: 1036 2020-02-13 02:01:11,319.319 INFO:tasks.ceph.mon.c.smithi103.stderr:==00:00:45:14.739 19439== Warning: unimplemented fcntl command: 1036 2020-02-13 02:01:11,338.338 INFO:tasks.ceph.mon.c.smithi103.stderr:==00:00:45:14.758 19439== Warning: unimplemented fcntl command: 1036 2020-02-13 02:01:11,353.353 INFO:tasks.ceph.mon.c.smithi103.stderr:==00:00:45:14.773 19439== Warning: unimplemented fcntl command: 1036
#5 Updated by Brad Hubbard over 3 years ago
Ah, that's right, from memory these Warnings are related to valgrind. Valgrind is also notorious for slowing things down substantially and, given that the error -110 is -ETIMEDOUT is it likely that Valgrind is slowing things to the point of causing timeouts?
#6 Updated by Brad Hubbard over 3 years ago
I can't reproduce this so far. If anyone can reproduce it reliably maybe we could try increasing the notify timeout in the test temporarily (or permanently) to see if it passes? This can be done by calling 'set_notify_timeout' on the IoCtxImpl or increasing 'client_notify_timeout' in the context of the test.
#7 Updated by Brad Hubbard over 3 years ago
- Related to Bug #43861: ceph_test_rados_watch_notify hang added
#8 Updated by Brad Hubbard over 3 years ago
- Related to Bug #43808: segv in MonClient::handle_auth_done added
#9 Updated by Brad Hubbard over 3 years ago
Reproduced and I see #43808 while doing so so I'm going to treat them as related for now at least.
I think we can also be pretty darn confident this is the same issue as #43861 as I see the same log entry.
In the teuthology log.
2020-02-13 01:23:03,685.685 INFO:tasks.workunit.client.0.smithi103.stdout: api_watch_notify: [ RUN ] LibRadosWatchNotify.WatchNotify 2020-02-13 01:23:03,685.685 INFO:tasks.workunit.client.0.smithi103.stdout: api_watch_notify: watch_notify_test_cb 2020-02-13 01:23:03,686.686 INFO:tasks.workunit.client.0.smithi103.stdout: api_watch_notify: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZ E/huge/release/15.1.0-441-g39dbd5e/rpm/el8/BUILD/ceph-15.1.0-441-g39dbd5e/src/test/librados/watch_notify.cc:95: Failure 2020-02-13 01:23:03,686.686 INFO:tasks.workunit.client.0.smithi103.stdout: api_watch_notify: Expected equality of these values: 2020-02-13 01:23:03,686.686 INFO:tasks.workunit.client.0.smithi103.stdout: api_watch_notify: 0 2020-02-13 01:23:03,686.686 INFO:tasks.workunit.client.0.smithi103.stdout: api_watch_notify: rados_notify(ioctx, "foo", 0, __null, 0) 2020-02-13 01:23:03,686.686 INFO:tasks.workunit.client.0.smithi103.stdout: api_watch_notify: Which is: -110 2020-02-13 01:23:03,686.686 INFO:tasks.workunit.client.0.smithi103.stdout: api_watch_notify: [ FAILED ] LibRadosWatchNotify.WatchNotify (108985 ms)
In ceph_test_rados_watch_notify.log
process_11_[23277]: starting. process_11_[23277]: notifying object 0.obj process_11_[23277]: file /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/huge/release/15.1.0-441-g39dbd5e/rpm/el8/BUILD/ceph-15.1.0-441-g39dbd5e/sr c/test/system/st_rados_notify.cc, line 69: expected 0, got -110 process_11_[23277]: shutting down.
In /var/log/ceph/ceph-client.admin.23277.log
2020-02-13T01:21:23.507+0000 7f1405107700 1 -- 172.21.15.103:0/4124318556 <== osd.1 v2:172.21.15.103:6800/20065 3 ==== watch-notify(notify_complete (2) cookie 94276413019632 notify 687194767361 ret -110) v3 ==== 42+0+24 (crc 0 0 0) 0x7f13fc0088a0 con 0x55be702ef360 2020-02-13T01:21:23.507+0000 7f1405107700 1 RefCountedObject::get 0x7f13fc0088a0 1 -> 2 2020-02-13T01:21:23.507+0000 7f1405107700 1 RefCountedObject::put 0x7f13fc0088a0 2 -> 1 2020-02-13T01:21:23.507+0000 7f1405107700 1 RefCountedObject::get 0x7f13fc0088a0 1 -> 2 2020-02-13T01:21:23.507+0000 7f1405107700 10 client.4560.objecter ms_dispatch 0x55be701a2560 watch-notify(notify_complete (2) cookie 94276413019632 notify 687194767361 ret -110) v3 2020-02-13T01:21:23.507+0000 7f1405107700 10 librados: finish completed notify (linger op 0x55be701bd5f0), r = -110 2020-02-13T01:21:23.507+0000 7f1405107700 1 RefCountedObject::put 0x7f13fc0088a0 2 -> 1 2020-02-13T01:21:23.507+0000 7f1405107700 1 RefCountedObject::put 0x7f13fc0088a0 1 -> 0 2020-02-13T01:21:23.507+0000 7f1405107700 1 RefCountedObject::put 0x55be702ef360 8 -> 7 2020-02-13T01:21:23.507+0000 7f1412a2e640 20 client.4560.objecter _linger_cancel linger_id=1 2020-02-13T01:21:23.507+0000 7f1412a2e640 20 client.4560.objecter put_session s=0x55be701c75f0 osd=1 3 2020-02-13T01:21:23.507+0000 7f1412a2e640 15 client.4560.objecter _session_linger_op_remove 1 1
Note that the notify_complete message with the return code of -110 is coming from osd.1.
Further down in the teuthology log.
2020-02-13 01:23:06,281.281 INFO:tasks.ceph.osd.1.smithi103.stderr:*** Caught signal (Segmentation fault) ** 2020-02-13 01:23:06,282.282 INFO:tasks.ceph.osd.1.smithi103.stderr: in thread 127f7700 thread_name:msgr-worker-2 2020-02-13 01:23:06,402.402 INFO:tasks.ceph.osd.1.smithi103.stderr: ceph version 15.1.0-441-g39dbd5e (39dbd5e9e63508f2c4b59180b41f5b8bd27d749f) octopus (rc) 2020-02-13 01:23:06,402.402 INFO:tasks.ceph.osd.1.smithi103.stderr: 1: (()+0x12dc0) [0xc23fdc0] 2020-02-13 01:23:06,403.403 INFO:tasks.ceph.osd.1.smithi103.stderr: 2: (MonClient::handle_auth_done(Connection*, AuthConnectionMeta*, unsigned long, unsigned int, ceph::buffer::v14_2_0::list const&, CryptoKey*, std::__cxx11::basic_string <char, std::char_traits<char>, std::allocator<char> >*)+0x78) [0xf159b8] 2020-02-13 01:23:06,403.403 INFO:tasks.ceph.osd.1.smithi103.stderr: 3: (ProtocolV2::handle_auth_done(ceph::buffer::v14_2_0::list&)+0x350) [0x10a7d40] 2020-02-13 01:23:06,403.403 INFO:tasks.ceph.osd.1.smithi103.stderr: 4: (ProtocolV2::handle_frame_payload()+0x1e3) [0x10b6013] 2020-02-13 01:23:06,403.403 INFO:tasks.ceph.osd.1.smithi103.stderr: 5: (ProtocolV2::handle_read_frame_dispatch()+0x160) [0x10b62c0] 2020-02-13 01:23:06,403.403 INFO:tasks.ceph.osd.1.smithi103.stderr: 6: (ProtocolV2::handle_read_frame_epilogue_main(std::unique_ptr<ceph::buffer::v14_2_0::ptr_node, ceph::buffer::v14_2_0::ptr_node::disposer>&&, int)+0x37d) [0x10b679d] 2020-02-13 01:23:06,403.403 INFO:tasks.ceph.osd.1.smithi103.stderr: 7: (ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x3c) [0x109cdfc] 2020-02-13 01:23:06,403.403 INFO:tasks.ceph.osd.1.smithi103.stderr: 8: (AsyncConnection::process()+0x8a9) [0x1073499] 2020-02-13 01:23:06,404.404 INFO:tasks.ceph.osd.1.smithi103.stderr: 9: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xcb7) [0xec5d37] 2020-02-13 01:23:06,404.404 INFO:tasks.ceph.osd.1.smithi103.stderr: 10: (()+0xdc331c) [0xecb31c] 2020-02-13 01:23:06,404.404 INFO:tasks.ceph.osd.1.smithi103.stderr: 11: (()+0xc2b23) [0xcb67b23] 2020-02-13 01:23:06,404.404 INFO:tasks.ceph.osd.1.smithi103.stderr: 12: (()+0x82de) [0xc2352de] 2020-02-13 01:23:06,404.404 INFO:tasks.ceph.osd.1.smithi103.stderr: 13: (clone()+0x43) [0xd4d0133]
So osd.1 segfaults shortly after returning -ETIMEDOUT which appears to be the cause of the test failures.
Also from /var/log/ceph/ceph-osd.1.log
$ sudo grep 31cca700 /var/log/ceph/ceph-osd.1.log | grep "94276413019632.*ret \-110" -B50 2020-02-13T01:20:11.817+0000 31cca700 10 Notify(94894880428624,412316860416 watchers=1) timeout 2020-02-13T01:20:11.818+0000 31cca700 10 Notify(94894880428624,412316860416 watchers=1) maybe_complete_notify -- 1 in progress watchers 2020-02-13T01:20:11.821+0000 31cca700 1 -- [v2:172.21.15.103:6800/20065,v1:172.21.15.103:6801/20065] --> 172.21.15.103:0/3272494552 -- watch-notify(notify_complete (2) cookie 94894880428624 notify 412316860416 ret -110) v3 -- 0xf995780 con 0x3cf550c0 2020-02-13T01:20:11.821+0000 31cca700 1 RefCountedObject::get 0x3cf550c0 14 -> 15 2020-02-13T01:20:11.822+0000 31cca700 5 PG::get 0x3d2b4c10 tag intptr 4 -> 5 2020-02-13T01:20:11.823+0000 31cca700 10 osd.1 pg_epoch: 100 pg[28.6s0( v 60'2 (0'0,60'2] local-lis/les=24/29 n=1 ec=24/24 lis/c=24/24 les/c/f=29/40/0 sis=24) [1,5,6]p1(0) r=0 lpr=24 crt=60'2 lcod 40'1 mlcod 40'1 active+clean] -- Watch(94894880601584,client.4431) cancel_notify 412316860416 2020-02-13T01:20:11.825+0000 31cca700 5 PG::put 0x3d2b4c10 tag intptr 5 -> 4 2020-02-13T01:20:11.827+0000 31cca700 1 RefCountedObject::put 0x3cf550c0 15 -> 14 2020-02-13T01:21:21.080+0000 31cca700 10 Notify(94276413019632,687194767361 watchers=1) timeout 2020-02-13T01:21:21.173+0000 31cca700 10 Notify(94276413019632,687194767361 watchers=1) maybe_complete_notify -- 1 in progress watchers 2020-02-13T01:21:21.173+0000 31cca700 1 -- [v2:172.21.15.103:6800/20065,v1:172.21.15.103:6801/20065] --> 172.21.15.103:0/4124318556 -- watch-notify(notify_complete (2) cookie 94276413019632 notify 687194767361 ret -110) v3 -- 0xfa626e0 con 0x3d4aae70
Is there a problem with time here, or is everything going exceptionally slow?
osd.1 reports sending the the notify_complete at 2020-02-13T01:21:21.173+0000 but the client only reports seeing it at 2020-02-13T01:21:23.507+0000 more than two seconds later.
We can see the notify come in at 2020-02-13T01:21:11.069+0000 with a timeout of 10s and a timeout at 2020-02-13T01:21:21.080+0000.
$ grep 687194767361 /var/log/ceph/ceph-osd.1.log 2020-02-13T01:21:11.069+0000 38cd8700 10 osd.1 pg_epoch: 160 pg[44.1a( v 145'2 (0'0,145'2] local-lis/les=88/106 n=1 ec=88/88 lis/c=88/88 les/c/f=106/122/0 sis=88) [1,7] r=0 lpr=88 crt=145'2 lcod 124'1 mlcod 124'1 active+clean] do_osd_op_effects, notify notify(cookie 94276413019632 notify687194767361 10s) 2020-02-13T01:21:11.069+0000 38cd8700 10 osd.1 pg_epoch: 160 pg[44.1a( v 145'2 (0'0,145'2] local-lis/les=88/106 n=1 ec=88/88 lis/c=88/88 les/c/f=106/122/0 sis=88) [1,7] r=0 lpr=88 crt=145'2 lcod 124'1 mlcod 124'1 active+clean] -- Watch(94276413019632,client.4636) start_notify 687194767361 2020-02-13T01:21:11.070+0000 38cd8700 10 Notify(94276413019632,687194767361 watchers=0) start_watcher 2020-02-13T01:21:11.070+0000 38cd8700 1 -- [v2:172.21.15.103:6800/20065,v1:172.21.15.103:6801/20065] --> 172.21.15.103:0/1036684621 -- watch-notify(notify (1) cookie 94276413019632 notify 687194767361 ret 0) v3 -- 0xfa626e0 con 0x2741b040 2020-02-13T01:21:11.083+0000 38cd8700 10 Notify(94276413019632,687194767361 watchers=1) maybe_complete_notify -- 1 in progress watchers 2020-02-13T01:21:21.080+0000 31cca700 10 Notify(94276413019632,687194767361 watchers=1) timeout 2020-02-13T01:21:21.173+0000 31cca700 10 Notify(94276413019632,687194767361 watchers=1) maybe_complete_notify -- 1 in progress watchers 2020-02-13T01:21:21.173+0000 31cca700 1 -- [v2:172.21.15.103:6800/20065,v1:172.21.15.103:6801/20065] --> 172.21.15.103:0/4124318556 -- watch-notify(notify_complete (2) cookie 94276413019632 notify 687194767361 ret -110) v3 -- 0xfa626e0 con 0x3d4aae70 2020-02-13T01:21:21.185+0000 31cca700 10 osd.1 pg_epoch: 170 pg[44.1a( v 145'2 (0'0,145'2] local-lis/les=88/106 n=1 ec=88/88 lis/c=88/88 les/c/f=106/122/0 sis=88) [1,7] r=0 lpr=88 crt=145'2 lcod 124'1 mlcod 124'1 active+clean] -- Watch(94276413019632,client.4636) cancel_notify 687194767361 2020-02-13T01:21:21.685+0000 34cd0700 10 osd.1 pg_epoch: 171 pg[44.1a( v 145'2 (0'0,145'2] local-lis/les=88/106 n=1 ec=88/88 lis/c=88/88 les/c/f=106/122/0 sis=88) [1,7] r=0 lpr=88 crt=145'2 lcod 124'1 mlcod 124'1 active+clean] notify_ack 94276413019632,687194767361
#10 Updated by Sage Weil over 3 years ago
/a/sage-2020-02-15_04:59:38-rados-wip-sage3-testing-2020-02-14-1951-distro-basic-smithi/4765960
#11 Updated by Sage Weil over 3 years ago
2020-02-18T17:00:48.229 INFO:tasks.workunit.client.0.smithi123.stdout: api_watch_notify_pp: [ RUN ] LibRadosWatchNotifyECPP.WatchNotify 2020-02-18T17:00:48.229 INFO:tasks.workunit.client.0.smithi123.stdout: api_watch_notify_pp: notify 2020-02-18T17:00:48.229 INFO:tasks.workunit.client.0.smithi123.stdout: api_watch_notify_pp: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/huge/release/15.1.0-789-g374fb53/rpm/el8/BUILD/ceph-15.1.0-789-g374fb53/src/test/ librados/watch_notify_cxx.cc:118: Failure 2020-02-18T17:00:48.229 INFO:tasks.workunit.client.0.smithi123.stdout: api_watch_notify_pp: Expected equality of these values: 2020-02-18T17:00:48.229 INFO:tasks.workunit.client.0.smithi123.stdout: api_watch_notify_pp: 0 2020-02-18T17:00:48.229 INFO:tasks.workunit.client.0.smithi123.stdout: api_watch_notify_pp: ioctx.notify("foo", 0, bl2) 2020-02-18T17:00:48.230 INFO:tasks.workunit.client.0.smithi123.stdout: api_watch_notify_pp: Which is: -110 2020-02-18T17:00:48.230 INFO:tasks.workunit.client.0.smithi123.stdout: api_watch_notify_pp: [ FAILED ] LibRadosWatchNotifyECPP.WatchNotify (102262 ms)
EC this time
/a/sage-2020-02-18_14:47:43-rados-wip-sage2-testing-2020-02-17-2124-distro-basic-smithi/4777572
#12 Updated by Brad Hubbard over 3 years ago
So the timeout, as previously mentioned, was 10 seconds although osd_default_notify_timeout is 30 seconds by default. Josh helped me realise that a timeout of 10 seconds is being set on the client by the client_notify_timeout default. I think that when we are running under valgrind with peering, etc. taking place that 10 seconds is just not long enough.
#13 Updated by Sage Weil over 3 years ago
/a/sage-2020-02-21_21:08:33-rados-wip-sage3-testing-2020-02-21-1218-distro-basic-smithi/4788714
2020-02-21T23:23:54.162 INFO:tasks.workunit.client.0.smithi050.stdout: api_watch_notify_pp: [ RUN ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.WatchNotify/0 2020-02-21T23:23:54.163 INFO:tasks.workunit.client.0.smithi050.stdout: api_watch_notify_pp: notify 2020-02-21T23:23:54.163 INFO:tasks.workunit.client.0.smithi050.stdout: api_watch_notify_pp: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/huge/release/15.1.0-1064-ge762a09/rpm/el8/BUILD/ceph-15.1.0-1064-ge762a09/src/test/librados/watch_notify_cxx.cc:97: Failure 2020-02-21T23:23:54.163 INFO:tasks.workunit.client.0.smithi050.stdout: api_watch_notify_pp: Expected equality of these values: 2020-02-21T23:23:54.163 INFO:tasks.workunit.client.0.smithi050.stdout: api_watch_notify_pp: 0 2020-02-21T23:23:54.163 INFO:tasks.workunit.client.0.smithi050.stdout: api_watch_notify_pp: ioctx.notify("foo", 0, bl2) 2020-02-21T23:23:54.163 INFO:tasks.workunit.client.0.smithi050.stdout: api_watch_notify_pp: Which is: -110 2020-02-21T23:23:54.164 INFO:tasks.workunit.client.0.smithi050.stdout: api_watch_notify_pp: [ FAILED ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.WatchNotify/0, where GetParam() = "" (194783 ms)
#14 Updated by Sage Weil over 3 years ago
2020-02-25T17:05:19.155 INFO:tasks.workunit.client.0.smithi175.stdout: api_watch_notify: [ RUN ] LibRadosWatchNotify.WatchNotify 2020-02-25T17:05:19.155 INFO:tasks.workunit.client.0.smithi175.stdout: api_watch_notify: watch_notify_test_cb 2020-02-25T17:05:19.155 INFO:tasks.workunit.client.0.smithi175.stdout: api_watch_notify: watch_notify_test_cb 2020-02-25T17:05:19.156 INFO:tasks.workunit.client.0.smithi175.stdout: api_watch_notify: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/huge/release/15.1.0-1203-gc0c0508/rpm/el8/BUILD/ceph-15.1.0-120 3-gc0c0508/src/test/librados/watch_notify.cc:95: Failure 2020-02-25T17:05:19.156 INFO:tasks.workunit.client.0.smithi175.stdout: api_watch_notify: Expected equality of these values: 2020-02-25T17:05:19.156 INFO:tasks.workunit.client.0.smithi175.stdout: api_watch_notify: 0 2020-02-25T17:05:19.156 INFO:tasks.workunit.client.0.smithi175.stdout: api_watch_notify: rados_notify(ioctx, "foo", 0, __null, 0) 2020-02-25T17:05:19.157 INFO:tasks.workunit.client.0.smithi175.stdout: api_watch_notify: Which is: -110 2020-02-25T17:05:19.157 INFO:tasks.workunit.client.0.smithi175.stdout: api_watch_notify: [ FAILED ] LibRadosWatchNotify.WatchNotify (109621 ms)
/a/sage-2020-02-25_15:51:04-rados-wip-sage2-testing-2020-02-25-0704-distro-basic-smithi/4801723
#15 Updated by Sage Weil over 3 years ago
2020-02-26T16:04:13.461 INFO:tasks.workunit.client.0.smithi017.stdout: api_watch_notify_pp: [ RUN ] LibRadosWatchNotifyECPP.WatchNotify 2020-02-26T16:04:13.461 INFO:tasks.workunit.client.0.smithi017.stdout: api_watch_notify_pp: notify 2020-02-26T16:04:13.461 INFO:tasks.workunit.client.0.smithi017.stdout: api_watch_notify_pp: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/huge/release/15.1.0-1253-g6b7a962/rpm/el8/BUILD/ceph-15.1.0-1253-g6b7a962/src/tes t/librados/watch_notify_cxx.cc:118: Failure 2020-02-26T16:04:13.461 INFO:tasks.workunit.client.0.smithi017.stdout: api_watch_notify_pp: Expected equality of these values: 2020-02-26T16:04:13.461 INFO:tasks.workunit.client.0.smithi017.stdout: api_watch_notify_pp: 0 2020-02-26T16:04:13.461 INFO:tasks.workunit.client.0.smithi017.stdout: api_watch_notify_pp: ioctx.notify("foo", 0, bl2) 2020-02-26T16:04:13.462 INFO:tasks.workunit.client.0.smithi017.stdout: api_watch_notify_pp: Which is: -110 2020-02-26T16:04:13.462 INFO:tasks.workunit.client.0.smithi017.stdout: api_watch_notify_pp: [ FAILED ] LibRadosWatchNotifyECPP.WatchNotify (173574 ms)
/a/sage-2020-02-26_08:10:43-rados-wip-sage2-testing-2020-02-25-2110-distro-basic-smithi/4804166
#16 Updated by Sage Weil over 3 years ago
/a/sage-2020-02-27_05:12:04-rados-wip-sage2-testing-2020-02-26-1925-distro-basic-smithi/4806157
#17 Updated by Sage Weil over 3 years ago
/a/sage-2020-03-10_16:51:17-rados-wip-sage3-testing-2020-03-10-1037-distro-basic-smithi/4844127
#18 Updated by Sage Weil over 3 years ago
/a/sage-2020-03-10_16:51:17-rados-wip-sage3-testing-2020-03-10-1037-distro-basic-smithi/4844006
I think this is failing partly because we have the OSD debug levels so high that the cluster can barely stay up. I was watching a cluster while the test was running and the osds were basically thrashing under the load.
Also note the completion times
2020-02-26T16:04:13.462 INFO:tasks.workunit.client.0.smithi017.stdout: api_watch_notify_pp: [ FAILED ] LibRadosWatchNotifyECPP.WatchNotify (173574 ms) 2020-02-25T17:05:19.157 INFO:tasks.workunit.client.0.smithi175.stdout: api_watch_notify: [ FAILED ] LibRadosWatchNotify.WatchNotify (109621 ms) 2020-02-21T23:23:54.164 INFO:tasks.workunit.client.0.smithi050.stdout: api_watch_notify_pp: [ FAILED ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.WatchNotify/0, where GetParam() = "" (194783 ms) 2020-02-13 01:23:03,686.686 INFO:tasks.workunit.client.0.smithi103.stdout: api_watch_notify: [ FAILED ] LibRadosWatchNotify.WatchNotify (108985 ms)
#19 Updated by Sage Weil over 3 years ago
i think https://github.com/ceph/ceph/pull/33871 may help...
#20 Updated by Brad Hubbard over 3 years ago
- Status changed from Triaged to In Progress
- Pull request ID set to 33871
#21 Updated by Brad Hubbard over 3 years ago
- Status changed from In Progress to Resolved
#22 Updated by Brad Hubbard over 3 years ago
#23 Updated by Brad Hubbard over 3 years ago
- Status changed from Resolved to In Progress
- Pull request ID changed from 33871 to 34011
#24 Updated by Sage Weil over 3 years ago
- Status changed from In Progress to Resolved
#25 Updated by Brad Hubbard over 3 years ago
Seeing this in Nautilus so setting backport.
#26 Updated by Brad Hubbard over 3 years ago
- Status changed from Resolved to Pending Backport
- Backport set to nautilus
#27 Updated by Nathan Cutler over 3 years ago
- Copied to Backport #45224: nautilus: LibRadosWatchNotify.WatchNotify failure added
#28 Updated by Nathan Cutler over 3 years ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".