Project

General

Profile

Actions

Bug #44062

closed

LibRadosWatchNotify.WatchNotify failure

Added by Sage Weil over 4 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2020-02-09T22:40:12.646 INFO:tasks.workunit.client.0.smithi109.stdout:         api_watch_notify: [ RUN      ] LibRadosWatchNotify.WatchNotify
2020-02-09T22:40:12.646 INFO:tasks.workunit.client.0.smithi109.stdout:         api_watch_notify: watch_notify_test_cb
2020-02-09T22:40:12.646 INFO:tasks.workunit.client.0.smithi109.stdout:         api_watch_notify: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/huge/release/15.1.0-441-g39dbd5e/rpm/el8/BUILD/ceph-15.1.0-441-g39dbd5e/src/test/
librados/watch_notify.cc:95: Failure
2020-02-09T22:40:12.647 INFO:tasks.workunit.client.0.smithi109.stdout:         api_watch_notify: Expected equality of these values:
2020-02-09T22:40:12.647 INFO:tasks.workunit.client.0.smithi109.stdout:         api_watch_notify:   0
2020-02-09T22:40:12.647 INFO:tasks.workunit.client.0.smithi109.stdout:         api_watch_notify:   rados_notify(ioctx, "foo", 0, __null, 0)
2020-02-09T22:40:12.647 INFO:tasks.workunit.client.0.smithi109.stdout:         api_watch_notify:     Which is: -110
2020-02-09T22:40:12.647 INFO:tasks.workunit.client.0.smithi109.stdout:         api_watch_notify: [  FAILED  ] LibRadosWatchNotify.WatchNotify (61030 ms)

/a/sage-2020-02-09_21:18:03-rados-wip-sage2-testing-2020-02-09-1152-distro-basic-smithi/4749220

Related issues 3 (0 open3 closed)

Related to RADOS - Bug #43861: ceph_test_rados_watch_notify hangResolved

Actions
Related to Messengers - Bug #43808: segv in MonClient::handle_auth_doneResolvedRadoslaw Zarzynski

Actions
Copied to RADOS - Backport #45224: nautilus: LibRadosWatchNotify.WatchNotify failureResolvedNathan CutlerActions
Actions #1

Updated by Sage Weil over 4 years ago

  • Status changed from New to Triaged
  • Priority changed from Normal to Urgent

/a/sage-2020-02-11_20:49:48-rados-wip-sage-testing-2020-02-11-1121-distro-basic-smithi/4755080

Actions #2

Updated by Neha Ojha over 4 years ago

  • Assignee set to Brad Hubbard

Brad, can you please take a look at this?

Actions #3

Updated by Brad Hubbard over 4 years ago

Sure Neha

Actions #4

Updated by Brad Hubbard over 4 years ago

When trying to reproduce I am seeing a lot of these which may, or may not, be related.

2020-02-13 02:01:11,261.261 INFO:tasks.ceph.mon.a.smithi103.stderr:==00:00:45:14.681 19436== Warning: unimplemented fcntl command: 1036
2020-02-13 02:01:11,302.302 INFO:tasks.ceph.mon.b.smithi109.stderr:==00:00:45:14.717 19197== Warning: unimplemented fcntl command: 1036
2020-02-13 02:01:11,302.302 INFO:tasks.ceph.mon.a.smithi103.stderr:==00:00:45:14.722 19436== Warning: unimplemented fcntl command: 1036
2020-02-13 02:01:11,306.306 INFO:tasks.ceph.mon.b.smithi109.stderr:==00:00:45:14.721 19197== Warning: unimplemented fcntl command: 1036
2020-02-13 02:01:11,317.317 INFO:tasks.ceph.mon.b.smithi109.stderr:==00:00:45:14.732 19197== Warning: unimplemented fcntl command: 1036
2020-02-13 02:01:11,319.319 INFO:tasks.ceph.mon.a.smithi103.stderr:==00:00:45:14.739 19436== Warning: unimplemented fcntl command: 1036
2020-02-13 02:01:11,319.319 INFO:tasks.ceph.mon.c.smithi103.stderr:==00:00:45:14.739 19439== Warning: unimplemented fcntl command: 1036
2020-02-13 02:01:11,338.338 INFO:tasks.ceph.mon.c.smithi103.stderr:==00:00:45:14.758 19439== Warning: unimplemented fcntl command: 1036
2020-02-13 02:01:11,353.353 INFO:tasks.ceph.mon.c.smithi103.stderr:==00:00:45:14.773 19439== Warning: unimplemented fcntl command: 1036
Actions #5

Updated by Brad Hubbard over 4 years ago

Ah, that's right, from memory these Warnings are related to valgrind. Valgrind is also notorious for slowing things down substantially and, given that the error -110 is -ETIMEDOUT is it likely that Valgrind is slowing things to the point of causing timeouts?

Actions #6

Updated by Brad Hubbard over 4 years ago

I can't reproduce this so far. If anyone can reproduce it reliably maybe we could try increasing the notify timeout in the test temporarily (or permanently) to see if it passes? This can be done by calling 'set_notify_timeout' on the IoCtxImpl or increasing 'client_notify_timeout' in the context of the test.

Actions #7

Updated by Brad Hubbard over 4 years ago

  • Related to Bug #43861: ceph_test_rados_watch_notify hang added
Actions #8

Updated by Brad Hubbard over 4 years ago

  • Related to Bug #43808: segv in MonClient::handle_auth_done added
Actions #9

Updated by Brad Hubbard over 4 years ago

Reproduced and I see #43808 while doing so so I'm going to treat them as related for now at least.

I think we can also be pretty darn confident this is the same issue as #43861 as I see the same log entry.

In the teuthology log.

2020-02-13 01:23:03,685.685 INFO:tasks.workunit.client.0.smithi103.stdout:         api_watch_notify: [ RUN      ] LibRadosWatchNotify.WatchNotify
2020-02-13 01:23:03,685.685 INFO:tasks.workunit.client.0.smithi103.stdout:         api_watch_notify: watch_notify_test_cb
2020-02-13 01:23:03,686.686 INFO:tasks.workunit.client.0.smithi103.stdout:         api_watch_notify: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZ
E/huge/release/15.1.0-441-g39dbd5e/rpm/el8/BUILD/ceph-15.1.0-441-g39dbd5e/src/test/librados/watch_notify.cc:95: Failure
2020-02-13 01:23:03,686.686 INFO:tasks.workunit.client.0.smithi103.stdout:         api_watch_notify: Expected equality of these values:
2020-02-13 01:23:03,686.686 INFO:tasks.workunit.client.0.smithi103.stdout:         api_watch_notify:   0
2020-02-13 01:23:03,686.686 INFO:tasks.workunit.client.0.smithi103.stdout:         api_watch_notify:   rados_notify(ioctx, "foo", 0, __null, 0)
2020-02-13 01:23:03,686.686 INFO:tasks.workunit.client.0.smithi103.stdout:         api_watch_notify:     Which is: -110
2020-02-13 01:23:03,686.686 INFO:tasks.workunit.client.0.smithi103.stdout:         api_watch_notify: [  FAILED  ] LibRadosWatchNotify.WatchNotify (108985 ms)

In ceph_test_rados_watch_notify.log

process_11_[23277]: starting.
process_11_[23277]: notifying object 0.obj
process_11_[23277]: file /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/huge/release/15.1.0-441-g39dbd5e/rpm/el8/BUILD/ceph-15.1.0-441-g39dbd5e/sr
c/test/system/st_rados_notify.cc, line 69: expected 0, got -110
process_11_[23277]: shutting down.

In /var/log/ceph/ceph-client.admin.23277.log

2020-02-13T01:21:23.507+0000 7f1405107700  1 -- 172.21.15.103:0/4124318556 <== osd.1 v2:172.21.15.103:6800/20065 3 ==== watch-notify(notify_complete (2) cookie 94276413019632 notify 687194767361 ret -110) v3 ==== 42+0+24 (crc 0 0 0) 0x7f13fc0088a0 con 0x55be702ef360
2020-02-13T01:21:23.507+0000 7f1405107700  1 RefCountedObject::get 0x7f13fc0088a0 1 -> 2
2020-02-13T01:21:23.507+0000 7f1405107700  1 RefCountedObject::put 0x7f13fc0088a0 2 -> 1
2020-02-13T01:21:23.507+0000 7f1405107700  1 RefCountedObject::get 0x7f13fc0088a0 1 -> 2
2020-02-13T01:21:23.507+0000 7f1405107700 10 client.4560.objecter ms_dispatch 0x55be701a2560 watch-notify(notify_complete (2) cookie 94276413019632 notify 687194767361 ret -110) v3
2020-02-13T01:21:23.507+0000 7f1405107700 10 librados: finish completed notify (linger op 0x55be701bd5f0), r = -110
2020-02-13T01:21:23.507+0000 7f1405107700  1 RefCountedObject::put 0x7f13fc0088a0 2 -> 1
2020-02-13T01:21:23.507+0000 7f1405107700  1 RefCountedObject::put 0x7f13fc0088a0 1 -> 0
2020-02-13T01:21:23.507+0000 7f1405107700  1 RefCountedObject::put 0x55be702ef360 8 -> 7
2020-02-13T01:21:23.507+0000 7f1412a2e640 20 client.4560.objecter _linger_cancel linger_id=1
2020-02-13T01:21:23.507+0000 7f1412a2e640 20 client.4560.objecter put_session s=0x55be701c75f0 osd=1 3
2020-02-13T01:21:23.507+0000 7f1412a2e640 15 client.4560.objecter _session_linger_op_remove 1 1

Note that the notify_complete message with the return code of -110 is coming from osd.1.

Further down in the teuthology log.

2020-02-13 01:23:06,281.281 INFO:tasks.ceph.osd.1.smithi103.stderr:*** Caught signal (Segmentation fault) **
2020-02-13 01:23:06,282.282 INFO:tasks.ceph.osd.1.smithi103.stderr: in thread 127f7700 thread_name:msgr-worker-2
2020-02-13 01:23:06,402.402 INFO:tasks.ceph.osd.1.smithi103.stderr: ceph version 15.1.0-441-g39dbd5e (39dbd5e9e63508f2c4b59180b41f5b8bd27d749f) octopus (rc)
2020-02-13 01:23:06,402.402 INFO:tasks.ceph.osd.1.smithi103.stderr: 1: (()+0x12dc0) [0xc23fdc0]
2020-02-13 01:23:06,403.403 INFO:tasks.ceph.osd.1.smithi103.stderr: 2: (MonClient::handle_auth_done(Connection*, AuthConnectionMeta*, unsigned long, unsigned int, ceph::buffer::v14_2_0::list const&, CryptoKey*, std::__cxx11::basic_string
<char, std::char_traits<char>, std::allocator<char> >*)+0x78) [0xf159b8]
2020-02-13 01:23:06,403.403 INFO:tasks.ceph.osd.1.smithi103.stderr: 3: (ProtocolV2::handle_auth_done(ceph::buffer::v14_2_0::list&)+0x350) [0x10a7d40]
2020-02-13 01:23:06,403.403 INFO:tasks.ceph.osd.1.smithi103.stderr: 4: (ProtocolV2::handle_frame_payload()+0x1e3) [0x10b6013]
2020-02-13 01:23:06,403.403 INFO:tasks.ceph.osd.1.smithi103.stderr: 5: (ProtocolV2::handle_read_frame_dispatch()+0x160) [0x10b62c0]
2020-02-13 01:23:06,403.403 INFO:tasks.ceph.osd.1.smithi103.stderr: 6: (ProtocolV2::handle_read_frame_epilogue_main(std::unique_ptr<ceph::buffer::v14_2_0::ptr_node, ceph::buffer::v14_2_0::ptr_node::disposer>&&, int)+0x37d) [0x10b679d]
2020-02-13 01:23:06,403.403 INFO:tasks.ceph.osd.1.smithi103.stderr: 7: (ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x3c) [0x109cdfc]
2020-02-13 01:23:06,403.403 INFO:tasks.ceph.osd.1.smithi103.stderr: 8: (AsyncConnection::process()+0x8a9) [0x1073499]
2020-02-13 01:23:06,404.404 INFO:tasks.ceph.osd.1.smithi103.stderr: 9: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xcb7) [0xec5d37]
2020-02-13 01:23:06,404.404 INFO:tasks.ceph.osd.1.smithi103.stderr: 10: (()+0xdc331c) [0xecb31c]
2020-02-13 01:23:06,404.404 INFO:tasks.ceph.osd.1.smithi103.stderr: 11: (()+0xc2b23) [0xcb67b23]
2020-02-13 01:23:06,404.404 INFO:tasks.ceph.osd.1.smithi103.stderr: 12: (()+0x82de) [0xc2352de]
2020-02-13 01:23:06,404.404 INFO:tasks.ceph.osd.1.smithi103.stderr: 13: (clone()+0x43) [0xd4d0133]

So osd.1 segfaults shortly after returning -ETIMEDOUT which appears to be the cause of the test failures.

Also from /var/log/ceph/ceph-osd.1.log

$ sudo grep 31cca700 /var/log/ceph/ceph-osd.1.log | grep "94276413019632.*ret \-110" -B50
2020-02-13T01:20:11.817+0000 31cca700 10 Notify(94894880428624,412316860416  watchers=1) timeout
2020-02-13T01:20:11.818+0000 31cca700 10 Notify(94894880428624,412316860416  watchers=1) maybe_complete_notify -- 1 in progress watchers 
2020-02-13T01:20:11.821+0000 31cca700  1 -- [v2:172.21.15.103:6800/20065,v1:172.21.15.103:6801/20065] --> 172.21.15.103:0/3272494552 -- watch-notify(notify_complete (2) cookie 94894880428624 notify 412316860416 ret -110) v3 -- 0xf995780 con 0x3cf550c0
2020-02-13T01:20:11.821+0000 31cca700  1 RefCountedObject::get 0x3cf550c0 14 -> 15
2020-02-13T01:20:11.822+0000 31cca700  5 PG::get 0x3d2b4c10 tag intptr 4 -> 5
2020-02-13T01:20:11.823+0000 31cca700 10 osd.1 pg_epoch: 100 pg[28.6s0( v 60'2 (0'0,60'2] local-lis/les=24/29 n=1 ec=24/24 lis/c=24/24 les/c/f=29/40/0 sis=24) [1,5,6]p1(0) r=0 lpr=24 crt=60'2 lcod 40'1 mlcod 40'1 active+clean]  -- Watch(94894880601584,client.4431) cancel_notify 412316860416
2020-02-13T01:20:11.825+0000 31cca700  5 PG::put 0x3d2b4c10 tag intptr 5 -> 4
2020-02-13T01:20:11.827+0000 31cca700  1 RefCountedObject::put 0x3cf550c0 15 -> 14
2020-02-13T01:21:21.080+0000 31cca700 10 Notify(94276413019632,687194767361  watchers=1) timeout
2020-02-13T01:21:21.173+0000 31cca700 10 Notify(94276413019632,687194767361  watchers=1) maybe_complete_notify -- 1 in progress watchers 
2020-02-13T01:21:21.173+0000 31cca700  1 -- [v2:172.21.15.103:6800/20065,v1:172.21.15.103:6801/20065] --> 172.21.15.103:0/4124318556 -- watch-notify(notify_complete (2) cookie 94276413019632 notify 687194767361 ret -110) v3 -- 0xfa626e0 con 0x3d4aae70

Is there a problem with time here, or is everything going exceptionally slow?

osd.1 reports sending the the notify_complete at 2020-02-13T01:21:21.173+0000 but the client only reports seeing it at 2020-02-13T01:21:23.507+0000 more than two seconds later.

We can see the notify come in at 2020-02-13T01:21:11.069+0000 with a timeout of 10s and a timeout at 2020-02-13T01:21:21.080+0000.

$ grep 687194767361 /var/log/ceph/ceph-osd.1.log
2020-02-13T01:21:11.069+0000 38cd8700 10 osd.1 pg_epoch: 160 pg[44.1a( v 145'2 (0'0,145'2] local-lis/les=88/106 n=1 ec=88/88 lis/c=88/88 les/c/f=106/122/0 sis=88) [1,7] r=0 lpr=88 crt=145'2 lcod 124'1 mlcod 124'1 active+clean] do_osd_op_effects, notify notify(cookie 94276413019632 notify687194767361 10s)
2020-02-13T01:21:11.069+0000 38cd8700 10 osd.1 pg_epoch: 160 pg[44.1a( v 145'2 (0'0,145'2] local-lis/les=88/106 n=1 ec=88/88 lis/c=88/88 les/c/f=106/122/0 sis=88) [1,7] r=0 lpr=88 crt=145'2 lcod 124'1 mlcod 124'1 active+clean]  -- Watch(94276413019632,client.4636) start_notify 687194767361
2020-02-13T01:21:11.070+0000 38cd8700 10 Notify(94276413019632,687194767361  watchers=0) start_watcher
2020-02-13T01:21:11.070+0000 38cd8700  1 -- [v2:172.21.15.103:6800/20065,v1:172.21.15.103:6801/20065] --> 172.21.15.103:0/1036684621 -- watch-notify(notify (1) cookie 94276413019632 notify 687194767361 ret 0) v3 -- 0xfa626e0 con 0x2741b040
2020-02-13T01:21:11.083+0000 38cd8700 10 Notify(94276413019632,687194767361  watchers=1) maybe_complete_notify -- 1 in progress watchers 
2020-02-13T01:21:21.080+0000 31cca700 10 Notify(94276413019632,687194767361  watchers=1) timeout
2020-02-13T01:21:21.173+0000 31cca700 10 Notify(94276413019632,687194767361  watchers=1) maybe_complete_notify -- 1 in progress watchers 
2020-02-13T01:21:21.173+0000 31cca700  1 -- [v2:172.21.15.103:6800/20065,v1:172.21.15.103:6801/20065] --> 172.21.15.103:0/4124318556 -- watch-notify(notify_complete (2) cookie 94276413019632 notify 687194767361 ret -110) v3 -- 0xfa626e0 con 0x3d4aae70
2020-02-13T01:21:21.185+0000 31cca700 10 osd.1 pg_epoch: 170 pg[44.1a( v 145'2 (0'0,145'2] local-lis/les=88/106 n=1 ec=88/88 lis/c=88/88 les/c/f=106/122/0 sis=88) [1,7] r=0 lpr=88 crt=145'2 lcod 124'1 mlcod 124'1 active+clean]  -- Watch(94276413019632,client.4636) cancel_notify 687194767361
2020-02-13T01:21:21.685+0000 34cd0700 10 osd.1 pg_epoch: 171 pg[44.1a( v 145'2 (0'0,145'2] local-lis/les=88/106 n=1 ec=88/88 lis/c=88/88 les/c/f=106/122/0 sis=88) [1,7] r=0 lpr=88 crt=145'2 lcod 124'1 mlcod 124'1 active+clean] notify_ack 94276413019632,687194767361
Actions #10

Updated by Sage Weil over 4 years ago

/a/sage-2020-02-15_04:59:38-rados-wip-sage3-testing-2020-02-14-1951-distro-basic-smithi/4765960

Actions #11

Updated by Sage Weil about 4 years ago

2020-02-18T17:00:48.229 INFO:tasks.workunit.client.0.smithi123.stdout:      api_watch_notify_pp: [ RUN      ] LibRadosWatchNotifyECPP.WatchNotify
2020-02-18T17:00:48.229 INFO:tasks.workunit.client.0.smithi123.stdout:      api_watch_notify_pp: notify
2020-02-18T17:00:48.229 INFO:tasks.workunit.client.0.smithi123.stdout:      api_watch_notify_pp: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/huge/release/15.1.0-789-g374fb53/rpm/el8/BUILD/ceph-15.1.0-789-g374fb53/src/test/
librados/watch_notify_cxx.cc:118: Failure
2020-02-18T17:00:48.229 INFO:tasks.workunit.client.0.smithi123.stdout:      api_watch_notify_pp: Expected equality of these values:
2020-02-18T17:00:48.229 INFO:tasks.workunit.client.0.smithi123.stdout:      api_watch_notify_pp:   0
2020-02-18T17:00:48.229 INFO:tasks.workunit.client.0.smithi123.stdout:      api_watch_notify_pp:   ioctx.notify("foo", 0, bl2)
2020-02-18T17:00:48.230 INFO:tasks.workunit.client.0.smithi123.stdout:      api_watch_notify_pp:     Which is: -110
2020-02-18T17:00:48.230 INFO:tasks.workunit.client.0.smithi123.stdout:      api_watch_notify_pp: [  FAILED  ] LibRadosWatchNotifyECPP.WatchNotify (102262 ms)

EC this time
/a/sage-2020-02-18_14:47:43-rados-wip-sage2-testing-2020-02-17-2124-distro-basic-smithi/4777572
Actions #12

Updated by Brad Hubbard about 4 years ago

So the timeout, as previously mentioned, was 10 seconds although osd_default_notify_timeout is 30 seconds by default. Josh helped me realise that a timeout of 10 seconds is being set on the client by the client_notify_timeout default. I think that when we are running under valgrind with peering, etc. taking place that 10 seconds is just not long enough.

Actions #13

Updated by Sage Weil about 4 years ago

/a/sage-2020-02-21_21:08:33-rados-wip-sage3-testing-2020-02-21-1218-distro-basic-smithi/4788714

2020-02-21T23:23:54.162 INFO:tasks.workunit.client.0.smithi050.stdout:      api_watch_notify_pp: [ RUN      ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.WatchNotify/0
2020-02-21T23:23:54.163 INFO:tasks.workunit.client.0.smithi050.stdout:      api_watch_notify_pp: notify
2020-02-21T23:23:54.163 INFO:tasks.workunit.client.0.smithi050.stdout:      api_watch_notify_pp: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/huge/release/15.1.0-1064-ge762a09/rpm/el8/BUILD/ceph-15.1.0-1064-ge762a09/src/test/librados/watch_notify_cxx.cc:97: Failure
2020-02-21T23:23:54.163 INFO:tasks.workunit.client.0.smithi050.stdout:      api_watch_notify_pp: Expected equality of these values:
2020-02-21T23:23:54.163 INFO:tasks.workunit.client.0.smithi050.stdout:      api_watch_notify_pp:   0
2020-02-21T23:23:54.163 INFO:tasks.workunit.client.0.smithi050.stdout:      api_watch_notify_pp:   ioctx.notify("foo", 0, bl2)
2020-02-21T23:23:54.163 INFO:tasks.workunit.client.0.smithi050.stdout:      api_watch_notify_pp:     Which is: -110
2020-02-21T23:23:54.164 INFO:tasks.workunit.client.0.smithi050.stdout:      api_watch_notify_pp: [  FAILED  ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.WatchNotify/0, where GetParam() = "" (194783 ms)

Actions #14

Updated by Sage Weil about 4 years ago

2020-02-25T17:05:19.155 INFO:tasks.workunit.client.0.smithi175.stdout:         api_watch_notify: [ RUN      ] LibRadosWatchNotify.WatchNotify
2020-02-25T17:05:19.155 INFO:tasks.workunit.client.0.smithi175.stdout:         api_watch_notify: watch_notify_test_cb
2020-02-25T17:05:19.155 INFO:tasks.workunit.client.0.smithi175.stdout:         api_watch_notify: watch_notify_test_cb
2020-02-25T17:05:19.156 INFO:tasks.workunit.client.0.smithi175.stdout:         api_watch_notify: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/huge/release/15.1.0-1203-gc0c0508/rpm/el8/BUILD/ceph-15.1.0-120
3-gc0c0508/src/test/librados/watch_notify.cc:95: Failure
2020-02-25T17:05:19.156 INFO:tasks.workunit.client.0.smithi175.stdout:         api_watch_notify: Expected equality of these values:
2020-02-25T17:05:19.156 INFO:tasks.workunit.client.0.smithi175.stdout:         api_watch_notify:   0
2020-02-25T17:05:19.156 INFO:tasks.workunit.client.0.smithi175.stdout:         api_watch_notify:   rados_notify(ioctx, "foo", 0, __null, 0)
2020-02-25T17:05:19.157 INFO:tasks.workunit.client.0.smithi175.stdout:         api_watch_notify:     Which is: -110
2020-02-25T17:05:19.157 INFO:tasks.workunit.client.0.smithi175.stdout:         api_watch_notify: [  FAILED  ] LibRadosWatchNotify.WatchNotify (109621 ms)

/a/sage-2020-02-25_15:51:04-rados-wip-sage2-testing-2020-02-25-0704-distro-basic-smithi/4801723
Actions #15

Updated by Sage Weil about 4 years ago

2020-02-26T16:04:13.461 INFO:tasks.workunit.client.0.smithi017.stdout:      api_watch_notify_pp: [ RUN      ] LibRadosWatchNotifyECPP.WatchNotify
2020-02-26T16:04:13.461 INFO:tasks.workunit.client.0.smithi017.stdout:      api_watch_notify_pp: notify
2020-02-26T16:04:13.461 INFO:tasks.workunit.client.0.smithi017.stdout:      api_watch_notify_pp: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/huge/release/15.1.0-1253-g6b7a962/rpm/el8/BUILD/ceph-15.1.0-1253-g6b7a962/src/tes
t/librados/watch_notify_cxx.cc:118: Failure
2020-02-26T16:04:13.461 INFO:tasks.workunit.client.0.smithi017.stdout:      api_watch_notify_pp: Expected equality of these values:
2020-02-26T16:04:13.461 INFO:tasks.workunit.client.0.smithi017.stdout:      api_watch_notify_pp:   0
2020-02-26T16:04:13.461 INFO:tasks.workunit.client.0.smithi017.stdout:      api_watch_notify_pp:   ioctx.notify("foo", 0, bl2)
2020-02-26T16:04:13.462 INFO:tasks.workunit.client.0.smithi017.stdout:      api_watch_notify_pp:     Which is: -110
2020-02-26T16:04:13.462 INFO:tasks.workunit.client.0.smithi017.stdout:      api_watch_notify_pp: [  FAILED  ] LibRadosWatchNotifyECPP.WatchNotify (173574 ms)

/a/sage-2020-02-26_08:10:43-rados-wip-sage2-testing-2020-02-25-2110-distro-basic-smithi/4804166
Actions #16

Updated by Sage Weil about 4 years ago

/a/sage-2020-02-27_05:12:04-rados-wip-sage2-testing-2020-02-26-1925-distro-basic-smithi/4806157

Actions #17

Updated by Sage Weil about 4 years ago

/a/sage-2020-03-10_16:51:17-rados-wip-sage3-testing-2020-03-10-1037-distro-basic-smithi/4844127

Actions #18

Updated by Sage Weil about 4 years ago

/a/sage-2020-03-10_16:51:17-rados-wip-sage3-testing-2020-03-10-1037-distro-basic-smithi/4844006

I think this is failing partly because we have the OSD debug levels so high that the cluster can barely stay up. I was watching a cluster while the test was running and the osds were basically thrashing under the load.

Also note the completion times

2020-02-26T16:04:13.462 INFO:tasks.workunit.client.0.smithi017.stdout:      api_watch_notify_pp: [  FAILED  ] LibRadosWatchNotifyECPP.WatchNotify (173574 ms)
2020-02-25T17:05:19.157 INFO:tasks.workunit.client.0.smithi175.stdout:         api_watch_notify: [  FAILED  ] LibRadosWatchNotify.WatchNotify (109621 ms)
2020-02-21T23:23:54.164 INFO:tasks.workunit.client.0.smithi050.stdout:      api_watch_notify_pp: [  FAILED  ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.WatchNotify/0, where GetParam() = "" (194783 ms)
2020-02-13 01:23:03,686.686 INFO:tasks.workunit.client.0.smithi103.stdout:         api_watch_notify: [  FAILED  ] LibRadosWatchNotify.WatchNotify (108985 ms)

Actions #20

Updated by Brad Hubbard about 4 years ago

  • Status changed from Triaged to In Progress
  • Pull request ID set to 33871
Actions #21

Updated by Brad Hubbard about 4 years ago

  • Status changed from In Progress to Resolved
Actions #23

Updated by Brad Hubbard about 4 years ago

  • Status changed from Resolved to In Progress
  • Pull request ID changed from 33871 to 34011
Actions #24

Updated by Sage Weil about 4 years ago

  • Status changed from In Progress to Resolved
Actions #26

Updated by Brad Hubbard about 4 years ago

  • Status changed from Resolved to Pending Backport
  • Backport set to nautilus
Actions #27

Updated by Nathan Cutler about 4 years ago

  • Copied to Backport #45224: nautilus: LibRadosWatchNotify.WatchNotify failure added
Actions #28

Updated by Nathan Cutler almost 4 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Also available in: Atom PDF