Project

General

Profile

Actions

Bug #64866

open

rados/test.sh: LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.WatchNotify3/1 failed

Added by Sridhar Seshasayee about 2 months ago. Updated about 1 month ago.

Status:
Fix Under Review
Priority:
Normal
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
rados
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/a/yuriw-2024-03-08_16:20:46-rados-wip-yuri4-testing-2024-03-05-0854-distro-default-smithi/7587349

There was a similar failure reported in https://tracker.ceph.com/issues/47025. The PR associated
with it is merged and therefore this new tracker is raised for the new failure.

2024-03-09T20:29:24.634 INFO:tasks.workunit.client.0.smithi005.stdout:      api_watch_notify_pp: [ RUN      ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.WatchNotify3/1
2024-03-09T20:29:24.634 INFO:tasks.workunit.client.0.smithi005.stdout:      api_watch_notify_pp: List watches
2024-03-09T20:29:24.634 INFO:tasks.workunit.client.0.smithi005.stdout:      api_watch_notify_pp: notify2
2024-03-09T20:29:24.634 INFO:tasks.workunit.client.0.smithi005.stdout:      api_watch_notify_pp: handle_notify cookie 94576533816384 notify_id 25035364368385 notifier_gid 45581
2024-03-09T20:29:24.634 INFO:tasks.workunit.client.0.smithi005.stdout:      api_watch_notify_pp: handle_error cookie 94576533816384 err -107
2024-03-09T20:29:24.634 INFO:tasks.workunit.client.0.smithi005.stdout:      api_watch_notify_pp: notify2 done
2024-03-09T20:29:24.634 INFO:tasks.workunit.client.0.smithi005.stdout:      api_watch_notify_pp: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/19.0.0-1875-g831bc188/rpm/el9/BUILD/ceph-19.0.0-1875-g831bc188/src/test/librados/watch_notify_cxx.cc:400: Failure
2024-03-09T20:29:24.635 INFO:tasks.workunit.client.0.smithi005.stdout:      api_watch_notify_pp: Expected equality of these values:
2024-03-09T20:29:24.635 INFO:tasks.workunit.client.0.smithi005.stdout:      api_watch_notify_pp:   1u
2024-03-09T20:29:24.635 INFO:tasks.workunit.client.0.smithi005.stdout:      api_watch_notify_pp:     Which is: 1
2024-03-09T20:29:24.635 INFO:tasks.workunit.client.0.smithi005.stdout:      api_watch_notify_pp:   reply_map.size()
2024-03-09T20:29:24.635 INFO:tasks.workunit.client.0.smithi005.stdout:      api_watch_notify_pp:     Which is: 0
2024-03-09T20:29:24.635 INFO:tasks.workunit.client.0.smithi005.stdout:      api_watch_notify_pp: [  FAILED  ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.WatchNotify3/1, where GetParam() = "cache" (16749 ms)
Actions #1

Updated by Radoslaw Zarzynski about 2 months ago

  • Assignee set to Nitzan Mordechai

Hi Nitzan! Would you mind taking a look?

Actions #2

Updated by Nitzan Mordechai about 2 months ago

  • Status changed from New to In Progress

the client log shows cookie 94576533816384

2024-03-09T20:29:21.122+0000 7f435f7fe640 20 client.45581.objecter _calc_target epoch 5837 base foo @148;test-rados-api-smithi005-48191-19 precalc_pgid 0 pgid 0.0 is_write
2024-03-09T20:29:21.122+0000 7f435f7fe640 20 client.45581.objecter _calc_target target foo @149;test-rados-api-smithi005-48191-19 -> pgid 149.f0bcca4d
2024-03-09T20:29:21.122+0000 7f435f7fe640 10 client.45581.objecter  checking linger op 26
2024-03-09T20:29:21.122+0000 7f435f7fe640 20 client.45581.objecter _calc_target epoch 5837 base foo @148;test-rados-api-smithi005-48191-19 precalc_pgid 0 pgid 0.0 is_read
2024-03-09T20:29:21.122+0000 7f435f7fe640 20 client.45581.objecter _calc_target target foo @149;test-rados-api-smithi005-48191-19 -> pgid 149.f0bcca4d
2024-03-09T20:29:21.122+0000 7f435f7fe640 10 client.45581.objecter  checking op 78
2024-03-09T20:29:21.122+0000 7f435f7fe640 20 client.45581.objecter _calc_target epoch 5837 base foo @148;test-rados-api-smithi005-48191-19 precalc_pgid 0 pgid 0.0 is_read
2024-03-09T20:29:21.122+0000 7f435f7fe640 20 client.45581.objecter _calc_target target foo @149;test-rados-api-smithi005-48191-19 -> pgid 149.f0bcca4d
2024-03-09T20:29:21.122+0000 7f435f7fe640 10 client.45581.objecter  checking op 79
2024-03-09T20:29:21.122+0000 7f435f7fe640 20 client.45581.objecter _calc_target epoch 5837 base foo @148;test-rados-api-smithi005-48191-19 precalc_pgid 0 pgid 0.0 is_write
2024-03-09T20:29:21.122+0000 7f435f7fe640 20 client.45581.objecter _calc_target target foo @149;test-rados-api-smithi005-48191-19 -> pgid 149.f0bcca4d
2024-03-09T20:29:21.122+0000 7f435f7fe640 10 client.45581.objecter  checking op 80
2024-03-09T20:29:21.122+0000 7f435f7fe640 20 client.45581.objecter _calc_target epoch 5837 base foo @148;test-rados-api-smithi005-48191-19 precalc_pgid 0 pgid 0.0 is_write
2024-03-09T20:29:21.122+0000 7f435f7fe640 20 client.45581.objecter _calc_target target foo @149;test-rados-api-smithi005-48191-19 -> pgid 149.f0bcca4d
2024-03-09T20:29:21.122+0000 7f435f7fe640 10 client.45581.objecter  checking op 81
2024-03-09T20:29:21.122+0000 7f435f7fe640 20 client.45581.objecter _calc_target epoch 5837 base foo @148;test-rados-api-smithi005-48191-19 precalc_pgid 0 pgid 0.0 is_write
2024-03-09T20:29:21.122+0000 7f435f7fe640 20 client.45581.objecter _calc_target target foo @149;test-rados-api-smithi005-48191-19 -> pgid 149.f0bcca4d
2024-03-09T20:29:21.122+0000 7f435f7fe640 20 client.45581.objecter dump_active .. 0 homeless
2024-03-09T20:29:21.122+0000 7f435f7fe640 20 client.45581.objecter 78   149.f0bcca4d    osd.6   foo     [notify-ack in=25b]
2024-03-09T20:29:21.122+0000 7f435f7fe640 20 client.45581.objecter 79   149.f0bcca4d    osd.6   foo     [watch ping cookie 94576533816384]
2024-03-09T20:29:21.122+0000 7f435f7fe640 20 client.45581.objecter 80   149.f0bcca4d    osd.6   foo     [watch ping cookie 94576533816384]
2024-03-09T20:29:21.122+0000 7f435f7fe640 20 client.45581.objecter 81   149.f0bcca4d    osd.6   foo     [watch ping cookie 94576533816384]
2024-03-09T20:29:21.268+0000 7f4370296640  1 -- 172.21.15.5:0/1247768562 <== osd.6 v2:172.21.15.191:6824/1548572424 17 ==== watch-notify(disconnect (3) cookie 94576533816384 notify 0 ret 0) v3 ==== 42+0+0 (crc 0 0 0) 0x7f4344005530 con 0x5604509ea950
2024-03-09T20:29:21.268+0000 7f4370296640 10 client.45581.objecter ms_dispatch 0x5604509cba00 watch-notify(disconnect (3) cookie 94576533816384 notify 0 ret 0) v3
2024-03-09T20:29:21.268+0000 7f4370296640  1 -- 172.21.15.5:0/1247768562 <== osd.6 v2:172.21.15.191:6824/1548572424 18 ==== watch-notify(notify_complete (2) cookie 94576533836096 notify 25035364368385 ret 0) v3 ==== 42+0+8 (crc 0 0 0) 0x7f4344005530 con 0x5604509ea950
2024-03-09T20:29:21.268+0000 7f436d030640 10 librados: handle_error cookie 94576533816384 err -107

checking osd.6 shows the reason of -107 - Timeout of the watch:

2024-03-09T20:29:18.473+0000 7fc83955c640 10 osd.6 pg_epoch: 5836 pg[149.1(unlocked)]  -- Watch(94576533816384,client.45581) HandleWatchTimeout

recreate it easily using notify_sleep in the unit-tes, we are not getting any error that say the watch was timeout. i wonder if that's the correct behavior.

Actions #3

Updated by Nitzan Mordechai about 1 month ago

after checking deeply, only watch_check will give us the timeout return code.

the client log also shows that we waited more than 13 seconds which caused the watch to time out and the following failure happened, the only fix we can do here is to increase the timeout.

Actions #4

Updated by Nitzan Mordechai about 1 month ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 56334
Actions

Also available in: Atom PDF