Project

General

Profile

Bug #50042

rados/test.sh: api_watch_notify failures

Added by Neha Ojha almost 3 years ago. Updated 3 months ago.

Status:
Resolved
Priority:
High
Category:
-
Target version:
-
% Done:

100%

Source:
Tags:
backport_processed
Backport:
pacific,quincy
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2021-03-28T23:40:45.706 INFO:tasks.workunit.client.0.smithi079.stdout:              api_tier_pp: [       OK ] LibRadosTwoPoolsECPP.ManifestPromoteRead (3018 ms)
2021-03-28T23:40:45.706 INFO:tasks.workunit.client.0.smithi079.stdout:              api_tier_pp: [ RUN      ] LibRadosTwoPoolsECPP.TrySetDedupTier
2021-03-28T23:40:45.706 INFO:tasks.workunit.client.0.smithi079.stdout:              api_tier_pp: [       OK ] LibRadosTwoPoolsECPP.TrySetDedupTier (3017 ms)
2021-03-28T23:40:45.706 INFO:tasks.workunit.client.0.smithi079.stdout:              api_tier_pp: [----------] 22 tests from LibRadosTwoPoolsECPP (231078 ms total)
2021-03-28T23:40:45.707 INFO:tasks.workunit.client.0.smithi079.stdout:              api_tier_pp:
2021-03-28T23:40:45.707 INFO:tasks.workunit.client.0.smithi079.stdout:              api_tier_pp: [----------] Global test environment tear-down
2021-03-28T23:40:45.707 INFO:tasks.workunit.client.0.smithi079.stdout:              api_tier_pp: [==========] 71 tests from 4 test suites ran. (779076 ms total)
2021-03-28T23:40:45.707 INFO:tasks.workunit.client.0.smithi079.stdout:              api_tier_pp: [  PASSED  ] 71 tests.
2021-03-28T23:40:45.708 INFO:tasks.workunit.client.0.smithi079.stderr:+ exit 1
2021-03-28T23:40:45.708 INFO:tasks.workunit.client.0.smithi079.stderr:+ cleanup
2021-03-28T23:40:45.708 INFO:tasks.workunit.client.0.smithi079.stderr:+ pkill -P 38317
2021-03-28T23:40:45.738 DEBUG:teuthology.orchestra.run:got remote process result: 1
2021-03-28T23:40:45.739 INFO:tasks.workunit.client.0.smithi079.stderr:+ true
2021-03-28T23:40:45.739 INFO:tasks.workunit:Stopping ['rados/test.sh'] on client.0...
2021-03-28T23:40:45.740 DEBUG:teuthology.orchestra.run.smithi079:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0
2021-03-28T23:40:45.995 ERROR:teuthology.run_tasks:Saw exception from tasks.

rados/verify/{centos_latest ceph clusters/{fixed-2 openstack} d-thrash/none mon_election/connectivity msgr-failures/few msgr/async-v2only objectstore/bluestore-comp-zstd rados tasks/rados_api_tests validater/lockdep}

/a/sage-2021-03-28_19:04:26-rados-wip-sage2-testing-2021-03-28-0933-pacific-distro-basic-smithi/6007274

The other kind of failures end up in timeouts.

2021-03-27T19:40:24.700 INFO:tasks.workunit.client.1.smithi201.stdout:              api_tier_pp: [ RUN      ] LibRadosTwoPoolsECPP.TrySetDedupTier
2021-03-27T19:40:24.700 INFO:tasks.workunit.client.1.smithi201.stdout:              api_tier_pp: [       OK ] LibRadosTwoPoolsECPP.TrySetDedupTier (18106 ms)
2021-03-27T19:40:24.700 INFO:tasks.workunit.client.1.smithi201.stdout:              api_tier_pp: [----------] 22 tests from LibRadosTwoPoolsECPP (370961 ms total)
2021-03-27T19:40:24.700 INFO:tasks.workunit.client.1.smithi201.stdout:              api_tier_pp:
2021-03-27T19:40:24.700 INFO:tasks.workunit.client.1.smithi201.stdout:              api_tier_pp: [----------] Global test environment tear-down
2021-03-27T19:40:24.700 INFO:tasks.workunit.client.1.smithi201.stdout:              api_tier_pp: [==========] 71 tests from 4 test suites ran. (1172543 ms total)
2021-03-27T19:40:24.700 INFO:tasks.workunit.client.1.smithi201.stdout:              api_tier_pp: [  PASSED  ] 71 tests.
...
2021-03-27T22:20:52.033 INFO:tasks.workunit.client.1.smithi201.stderr:++ cleanup
2021-03-27T22:20:52.034 INFO:tasks.workunit.client.1.smithi201.stderr:++ pkill -P 1701132
2021-03-27T22:20:52.041 INFO:tasks.mon_thrash.mon_thrasher:waiting for 1.0 secs before continuing thrashing
2021-03-27T22:20:52.109 DEBUG:teuthology.orchestra.run:got remote process result: 124
2021-03-27T22:20:52.109 INFO:tasks.workunit.client.1.smithi201.stderr:/home/ubuntu/cephtest/clone.client.1/qa/workunits/rados/test.sh: line 10: 1701585 Terminated              bash -o pipefail -exc "ceph_test_rados_$f $color 2>&1 | tee ceph_test_rados_$ff.log | sed \"s/^/$r: /\"" 
2021-03-27T22:20:52.109 INFO:tasks.workunit.client.1.smithi201.stderr:++ true
2021-03-27T22:20:52.110 INFO:tasks.workunit:Stopping ['rados/test.sh'] on client.1...
2021-03-27T22:20:52.110 DEBUG:teuthology.orchestra.run.smithi201:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.1 /home/ubuntu/cephtest/clone.client.1
2021-03-27T22:20:52.310 ERROR:teuthology.run_tasks:Saw exception from tasks.

upgrade:nautilus-x/parallel/{0-cluster/{openstack start} 1-ceph-install/nautilus 1.1-pg-log-overrides/short_pg_log 2-workload/{blogbench ec-rados-default rados_api rados_loadgenbig rgw_ragweed_prepare test_rbd_api test_rbd_python} 3-upgrade-sequence/upgrade-mon-osd-mds 4-octopus 5-final-workload/{blogbench rados-snaps-few-objects rados_loadgenmix rados_mon_thrash rbd_cls rbd_import_export rgw rgw_ragweed_check} mon_election/classic objectstore/filestore-xfs ubuntu_18.04}

/a/teuthology-2021-03-27_14:59:29-upgrade:nautilus-x-pacific-distro-basic-smithi/6004568


Related issues

Related to RADOS - Bug #50371: Segmentation fault (core dumped) ceph_test_rados_api_watch_notify_pp New
Related to RADOS - Bug #50595: upgrade:nautilus-x-pacific: LibRadosService.StatusFormat failure Resolved
Related to RADOS - Bug #40119: api_tier_pp hung causing a dead job New
Related to RADOS - Bug #47025: rados/test.sh: api_watch_notify_pp LibRadosWatchNotifyECPP.WatchNotify failed Fix Under Review
Related to RADOS - Bug #51307: LibRadosWatchNotify.Watch2Delete fails Resolved
Related to RADOS - Bug #55001: rados/test.sh: Early exit right after LibRados global tests complete Resolved
Related to RADOS - Bug #45868: rados_api_tests: LibRadosWatchNotify.AioWatchNotify2 fails Resolved
Related to RADOS - Bug #45615: api_watch_notify_pp: LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.WatchNotify/1 fails Resolved
Duplicated by RADOS - Bug #49428: ceph_test_rados_api_snapshots fails with "rados_mon_command osd pool create failed with error -22" Duplicate
Copied to RADOS - Backport #58868: pacific: rados/test.sh: api_watch_notify failures Resolved
Copied to RADOS - Backport #58869: quincy: rados/test.sh: api_watch_notify failures Resolved

History

#1 Updated by Neha Ojha almost 3 years ago

  • Assignee set to Brad Hubbard

Brad, as per our discussion, I am assigning this to you for now.

#2 Updated by Brad Hubbard almost 3 years ago

Looking at /a/sage-2021-03-28_19:04:26-rados-wip-sage2-testing-2021-03-28-0933-pacific-distro-basic-smithi/6007274

2021-03-28T23:38:16.715 INFO:tasks.workunit.client.0.smithi079.stdout:error in api_io (38338)
2021-03-28T23:38:16.717 INFO:tasks.workunit.client.0.smithi079.stderr:+ echo 'error in api_io (38338)'

Which leads us to.

03-28T23:27:50.813 INFO:tasks.workunit.client.0.smithi079.stdout:                   api_io: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.1.0-1331-gdfb04146/rpm/el8/BUILD/ceph-16.1.0-1331-gdfb04146/src/test/librados/TestCase.cc:161: Failure
2021-03-28T23:27:50.813 INFO:tasks.workunit.client.0.smithi079.stdout:                   api_io: Expected equality of these values:
2021-03-28T23:27:50.813 INFO:tasks.workunit.client.0.smithi079.stdout:                   api_io:   "" 
2021-03-28T23:27:50.813 INFO:tasks.workunit.client.0.smithi079.stdout:                   api_io:   create_one_ec_pool(pool_name, &s_cluster)
2021-03-28T23:27:50.813 INFO:tasks.workunit.client.0.smithi079.stdout:                   api_io:     Which is: "rados_mon_command osd pool create failed with error -22" 
2021-03-28T23:27:50.813 INFO:tasks.workunit.client.0.smithi079.stdout:                   api_io: [ RUN      ] LibRadosIoEC.SimpleWrite
2021-03-28T23:27:50.813 INFO:tasks.workunit.client.0.smithi079.stdout:                   api_io: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.1.0-1331-gdfb04146/rpm/el8/BUILD/ceph-16.1.0-1331-gdfb04146/src/test/librados/TestCase.cc:172: Failure
2021-03-28T23:27:50.813 INFO:tasks.workunit.client.0.smithi079.stdout:                   api_io: Expected equality of these values:
2021-03-28T23:27:50.813 INFO:tasks.workunit.client.0.smithi079.stdout:                   api_io:   0
2021-03-28T23:27:50.813 INFO:tasks.workunit.client.0.smithi079.stdout:                   api_io:   rados_ioctx_create(cluster, pool_name.c_str(), &ioctx)
2021-03-28T23:27:50.814 INFO:tasks.workunit.client.0.smithi079.stdout:                   api_io:     Which is: -107
2021-03-28T23:27:50.837 INFO:tasks.workunit.client.0.smithi079.stderr:bash: line 1: 38343 Segmentation fault      (core dumped) ceph_test_rados_api_io 2>&1
2021-03-28T23:27:50.837 INFO:tasks.workunit.client.0.smithi079.stderr:     38344 Done                    | tee ceph_test_rados_api_io.log
2021-03-28T23:27:50.837 INFO:tasks.workunit.client.0.smithi079.stderr:     38345 Done                    | sed "s/^/                   api_io: /" 
$ zgrep "creation canceled" ./remote/smithi079/log/ceph-mon.a.log.gz
2021-03-28T23:27:49.929+0000 7f21065fe700 20 mon.a@0(leader).osd e21 erasure code profile rm testprofile-test-rados-api-smithi079-38343-15: creation canceled

Based on that this looks like it is probably https://tracker.ceph.com/issues/49428.

#3 Updated by Brad Hubbard almost 3 years ago

Looking at /a/teuthology-2021-03-27_14:59:29-upgrade:nautilus-x-pacific-distro-basic-smithi/6004568

2021-03-27T22:20:52.109 INFO:tasks.workunit.client.1.smithi201.stderr:/home/ubuntu/cephtest/clone.client.1/qa/workunits/rados/test.sh: line 10: 1701585 Terminated              bash -o pipefail -exc "ceph_test_rados_$f $color 2>&1 | tee ceph_test_rados_$ff.log | sed \"s/^/$r: /\"" 

Looking backwards for 1701585.

2021-03-27T19:20:52.216 INFO:tasks.workunit.client.1.smithi201.stderr:+ pid=1701585
2021-03-27T19:20:52.216 INFO:tasks.workunit.client.1.smithi201.stdout:test api_service on pid 1701585
2021-03-27T19:20:52.217 INFO:tasks.workunit.client.1.smithi201.stderr:+ echo 'test api_service on pid 1701585' 

This test adds 16 'portal' services via rados_service_register and then checks status command output for the following.

"16 portals active (1 hosts, 3 zones)"

It retries 5 times and then asserts.

It seems like it is in the retry loop and just stops.

The following is just to make note of the apparent service map update timeline.

2021-03-27T19:20:57.662+0000 7f453023b700  1 -- [v2:172.21.15.3:6800/17228,v1:172.21.15.3:6801/17228] <== client.314839 172.21.15.201:0/392142730 1 ==== mgropen(foo.rbd/image0 daemon) v3 ==== 68756+0+0 (secure 0 0 0) 0x5597b3da1180 con 0x5597b3694800
2021-03-27T19:20:57.662+0000 7f453023b700 10 mgr.server handle_open from foo.rbd/image0 172.21.15.201:0/392142730
2021-03-27T19:20:57.662+0000 7f453023b700  4 mgr.server handle_open constructing new DaemonState for foo.rbd/image0
2021-03-27T19:20:57.662+0000 7f453023b700 10 mgr.server handle_open registering foo.rbd/image0 in pending_service_map
2021-03-27T19:20:57.662+0000 7f453023b700  1 -- [v2:172.21.15.3:6800/17228,v1:172.21.15.3:6801/17228] <== client.314839 172.21.15.201:0/392142730 2 ==== mgrreport(foo.rbd/image0 +4-0 packed 38 status=0 task_status=0) v9 ==== 325+0+0 (secure 0 0 0) 0x5597b60e2d80 con 0x5597b3694800
2021-03-27T19:20:57.662+0000 7f453023b700 10 mgr.server handle_report from 0x5597b3694800 foo.rbd/image0
2021-03-27T19:20:57.662+0000 7f453023b700 20 mgr.server handle_report updating existing DaemonState for foo.rbd/image0
2021-03-27T19:20:57.662+0000 7f453023b700 10 mgr.server update_task_status got task status from foo.rbd/image0
2021-03-27T19:20:59.538+0000 7f452f239700 10 mgr.server operator() sending service_map e159
2021-03-27T19:21:02.662+0000 7f453023b700  1 -- [v2:172.21.15.3:6800/17228,v1:172.21.15.3:6801/17228] <== client.314839 172.21.15.201:0/392142730 3 ==== mgrreport(foo.rbd/image0 +0-0 packed 38) v9 ==== 86+0+0 (secure 0 0 0) 0x5597b5280340 con 0x5597b3694800
2021-03-27T19:21:02.662+0000 7f453023b700 10 mgr.server handle_report from 0x5597b3694800 foo.rbd/image0
2021-03-27T19:21:02.662+0000 7f453023b700 20 mgr.server handle_report updating existing DaemonState for foo.rbd/image0
2021-03-27T19:21:07.662+0000 7f453023b700  1 -- [v2:172.21.15.3:6800/17228,v1:172.21.15.3:6801/17228] <== client.314839 172.21.15.201:0/392142730 4 ==== mgrreport(foo.rbd/image0 +0-0 packed 38) v9 ==== 86+0+0 (secure 0 0 0) 0x5597b400cd00 con 0x5597b3694800
2021-03-27T19:21:07.662+0000 7f453023b700 10 mgr.server handle_report from 0x5597b3694800 foo.rbd/image0
2021-03-27T19:21:07.662+0000 7f453023b700 20 mgr.server handle_report updating existing DaemonState for foo.rbd/image0
2021-03-27T19:21:11.050+0000 7f454a0b0700  1 -- 172.21.15.3:0/17228 <== mon.1 v2:172.21.15.64:3300/0 105 ==== service_map(e157 3 svc) v1 ==== 2398+0+0 (secure 0 0 0) 0x5597b5e60400 con 0x5597b3dbf800
2021-03-27T19:21:11.050+0000 7f454a0b0700 10 mgr ms_dispatch2 active service_map(e157 3 svc) v1
2021-03-27T19:21:11.050+0000 7f454a0b0700 10 mgr ms_dispatch2 service_map(e157 3 svc) v1
2021-03-27T19:21:11.050+0000 7f454a0b0700 10 mgr notify_all notify_all: notify_all service_map
2021-03-27T19:21:11.174+0000 7f454a0b0700  1 -- 172.21.15.3:0/17228 <== mon.1 v2:172.21.15.64:3300/0 106 ==== service_map(e157 3 svc) v1 ==== 2398+0+0 (secure 0 0 0) 0x5597b5e60580 con 0x5597b3dbf800
2021-03-27T19:21:11.174+0000 7f454a0b0700 10 mgr ms_dispatch2 active service_map(e157 3 svc) v1
2021-03-27T19:21:11.174+0000 7f454a0b0700 10 mgr ms_dispatch2 service_map(e157 3 svc) v1
2021-03-27T19:21:11.174+0000 7f454a0b0700 10 mgr notify_all notify_all: notify_all service_map
2021-03-27T19:21:12.054+0000 7f454a0b0700  1 -- 172.21.15.3:0/17228 <== mon.1 v2:172.21.15.64:3300/0 109 ==== service_map(e157 3 svc) v1 ==== 2398+0+0 (secure 0 0 0) 0x5597b5e60880 con 0x5597b3dbf800
2021-03-27T19:21:12.054+0000 7f454a0b0700 10 mgr ms_dispatch2 active service_map(e157 3 svc) v1
2021-03-27T19:21:12.054+0000 7f454a0b0700 10 mgr ms_dispatch2 service_map(e157 3 svc) v1
2021-03-27T19:21:12.054+0000 7f454a0b0700 10 mgr notify_all notify_all: notify_all service_map
2021-03-27T19:21:12.170+0000 7f454a0b0700  1 -- 172.21.15.3:0/17228 <== mon.1 v2:172.21.15.64:3300/0 111 ==== service_map(e157 3 svc) v1 ==== 2398+0+0 (secure 0 0 0) 0x5597b5e60a00 con 0x5597b3dbf800
2021-03-27T19:21:12.170+0000 7f454a0b0700 10 mgr ms_dispatch2 active service_map(e157 3 svc) v1
2021-03-27T19:21:12.170+0000 7f454a0b0700 10 mgr ms_dispatch2 service_map(e157 3 svc) v1
2021-03-27T19:21:12.170+0000 7f454a0b0700 10 mgr notify_all notify_all: notify_all service_map
2021-03-27T19:21:12.174+0000 7f454a0b0700  1 -- 172.21.15.3:0/17228 <== mon.1 v2:172.21.15.64:3300/0 120 ==== service_map(e159 5 svc) v1 ==== 12907+0+0 (secure 0 0 0) 0x5597b5e61780 con 0x5597b3dbf800
2021-03-27T19:21:12.174+0000 7f454a0b0700 10 mgr ms_dispatch2 active service_map(e159 5 svc) v1
2021-03-27T19:21:12.174+0000 7f454a0b0700 10 mgr ms_dispatch2 service_map(e159 5 svc) v1
2021-03-27T19:21:12.174+0000 7f454a0b0700 10 mgr notify_all notify_all: notify_all service_map
2021-03-27T19:21:12.662+0000 7f453023b700  1 -- [v2:172.21.15.3:6800/17228,v1:172.21.15.3:6801/17228] <== client.314839 172.21.15.201:0/392142730 5 ==== mgrreport(foo.rbd/image0 +0-0 packed 38) v9 ==== 86+0+0 (secure 0 0 0) 0x5597b4012a40 con 0x5597b3694800
2021-03-27T19:21:12.662+0000 7f453023b700 10 mgr.server handle_report from 0x5597b3694800 foo.rbd/image0
2021-03-27T19:21:12.662+0000 7f453023b700 20 mgr.server handle_report updating existing DaemonState for foo.rbd/image0

I'm going to try and reproduce this to see if I can get more information as there just doesn't seem to be enough here.

#4 Updated by Neha Ojha almost 3 years ago

saw a failure in ceph_test_rados_api_watch_notify_pp, could be related to https://tracker.ceph.com/issues/45615

2021-04-14T01:40:16.439 INFO:tasks.workunit.client.0.smithi079.stdout:      api_watch_notify_pp: [ RUN      ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.AioWatchNotify2/1
2021-04-14T01:40:16.477 INFO:tasks.workunit.client.0.smithi079.stderr:bash: line 1: 36189 Segmentation fault      (core dumped) ceph_test_rados_api_watch_notify_pp 2>&1
2021-04-14T01:40:16.477 INFO:tasks.workunit.client.0.smithi079.stderr:     36190 Done                    | tee ceph_test_rados_api_watch_notify_pp.log
2021-04-14T01:40:16.477 INFO:tasks.workunit.client.0.smithi079.stderr:     36192 Done                    | sed "s/^/      api_watch_notify_pp: /" 

/a/nojha-2021-04-14_00:54:53-rados-master-distro-basic-smithi/6044164

#5 Updated by Brad Hubbard almost 3 years ago

Created https://tracker.ceph.com/issues/50371 for the segfault analysis from /a/nojha-2021-04-14_00:54:53-rados-master-distro-basic-smithi/6044164 we can always mark them dups/related if that proves to be the case once we know more.

#6 Updated by Neha Ojha almost 3 years ago

Looks similar to https://tracker.ceph.com/issues/50042#note-2, feel free to create a separate tracker, Brad.

2021-04-15T22:58:10.474 INFO:tasks.workunit.client.0.smithi167.stdout:              api_lock_pp: [----------] 8 tests from LibRadosLockECPP
2021-04-15T22:58:10.474 INFO:tasks.workunit.client.0.smithi167.stdout:              api_lock_pp: /build/ceph-17.0.0-3233-g6d727743/src/test/librados/testcase_cxx.cc:363: Failure
2021-04-15T22:58:10.475 INFO:tasks.workunit.client.0.smithi167.stdout:              api_lock_pp: Expected equality of these values:
2021-04-15T22:58:10.475 INFO:tasks.workunit.client.0.smithi167.stdout:              api_lock_pp:   "" 
2021-04-15T22:58:10.475 INFO:tasks.workunit.client.0.smithi167.stdout:              api_lock_pp:   create_one_ec_pool_pp(pool_name, s_cluster)
2021-04-15T22:58:10.475 INFO:tasks.workunit.client.0.smithi167.stdout:              api_lock_pp:     Which is: "mon_command osd pool create pool:test-rados-api-smithi167-18396-10 pool_type:erasure failed with error -22" 
2021-04-15T22:58:10.476 INFO:tasks.workunit.client.0.smithi167.stdout:              api_lock_pp: [ RUN      ] LibRadosLockECPP.LockExclusivePP
2021-04-15T22:58:10.487 INFO:tasks.workunit.client.0.smithi167.stderr:bash: line 1: 18396 Segmentation fault      (core dumped) ceph_test_rados_api_lock_pp 2>&1
2021-04-15T22:58:10.487 INFO:tasks.workunit.client.0.smithi167.stderr:     18398 Done                    | tee ceph_test_rados_api_lock_pp.log
2021-04-15T22:58:10.488 INFO:tasks.workunit.client.0.smithi167.stderr:     18399 Done                    | sed "s/^/              api_lock_pp: /" 

/a/nojha-2021-04-15_20:05:27-rados-wip-50217-distro-basic-smithi/6049636

#7 Updated by Brad Hubbard almost 3 years ago

Looking at the latest issue (ignoring the segfault which is being tracked in https://tracker.ceph.com/issues/50371) the Transport Endpoint error is seen in the client's log here.

2021-04-14 01:40:16.034176 thread: 0x7f9bf3d11540 priority: 10 subsystem: 14 client.4816.objecter _op_submit oid foo '@50;test-rados-api-smithi079-36189-12' '@53;test-rados-api-smithi079-36189-12' [watch unwatch cookie 94290589651712]    ⤷ tid 39 osd.4
2021-04-14 01:40:16.034176 thread: 0x7f9bf3d11540 priority: 20 subsystem: 14 client.4816.objecter get_session s=0x55c1bd1c5740 osd=4 4
2021-04-14 01:40:16.034176 thread: 0x7f9bf3d11540 priority: 15 subsystem: 14 client.4816.objecter _session_op_assign 4 39 
2021-04-14 01:40:16.034176 thread: 0x7f9bf3d11540 priority: 15 subsystem: 14 client.4816.objecter _send_op 39 to 53.1 on osd.4
2021-04-14 01:40:16.034176 thread: 0x7f9bf3d11540 priority: 1 subsystem: 29 -- 172.21.15.79:0/3175972404 --> v1:172.21.15.103:6805/31037 -- osd_op(unknown.0.0:39 53.1 53:ac8ad6fd:test-rados-api-smithi079-36189-12::foo:head [watch         ⤷ unwatch cookie 94290589651712] snapc 0=[] ondisk+write+known_if_redirected e83) v8 -- 0x55c1bdc5a670 con 0x55c1bd1ace70
2021-04-14 01:40:16.034176 thread: 0x7f9bf3d11540 priority: 20 subsystem: 14 client.4816.objecter put_session s=0x55c1bd1c5740 osd=4 5
2021-04-14 01:40:16.034176 thread: 0x7f9bf3d11540 priority: 5 subsystem: 14 client.4816.objecter 1 in flight
2021-04-14 01:40:16.162173 thread: 0x7f9be99a1700 priority: 10 subsystem: 14 client.4816.objecter tick 
2021-04-14 01:40:16.162173 thread: 0x7f9be99a1700 priority: 10 subsystem: 14 client.4816.objecter  pinging osd that serves lingering tid 11 (osd.4)
2021-04-14 01:40:16.162173 thread: 0x7f9be99a1700 priority: 10 subsystem: 14 client.4816.objecter _send_linger_ping 11 now 1842.638147s
2021-04-14 01:40:16.162173 thread: 0x7f9be99a1700 priority: 20 subsystem: 14 client.4816.objecter get_session s=0x55c1bd1c5740 osd=4 4
2021-04-14 01:40:16.162173 thread: 0x7f9be99a1700 priority: 15 subsystem: 14 client.4816.objecter _session_op_assign 4 40 
2021-04-14 01:40:16.162173 thread: 0x7f9be99a1700 priority: 15 subsystem: 14 client.4816.objecter _send_op 40 to 53.1 on osd.4
2021-04-14 01:40:16.162173 thread: 0x7f9be99a1700 priority: 1 subsystem: 29 -- 172.21.15.79:0/3175972404 --> v1:172.21.15.103:6805/31037 -- osd_op(unknown.0.0:40 53.1 53:ac8ad6fd:test-rados-api-smithi079-36189-12::foo:head [watch ping    ⤷ cookie 94290589651712] snapc 0=[] ondisk+write+known_if_redirected e83) v8 -- 0x7f9bc0003ee0 con 0x55c1bd1ace70

On osd.4 the log looks like this.

2021-04-14T01:40:16.365+0000 323f6700 10 osd.4 pg_epoch: 83 pg[53.1( v 83'8 (0'0,83'8] local-lis/les=74/76 n=2 ec=74/74 lis/c=74/74 les/c/f=76/76/0 sis=74) [4,7] r=0 lpr=74 crt=83'8 lcod 82'7 mlcod 82'7 active+clean] execute_ctx 53:ac8ad6fd:test-rados-api-smithi079-36189-12::foo:head [watch ping cookie 94290589651712] ov 83'8
2021-04-14T01:40:16.366+0000 323f6700 10 osd.4 pg_epoch: 83 pg[53.1( v 83'8 (0'0,83'8] local-lis/les=74/76 n=2 ec=74/74 lis/c=74/74 les/c/f=76/76/0 sis=74) [4,7] r=0 lpr=74 crt=83'8 lcod 82'7 mlcod 82'7 active+clean] do_osd_op 53:ac8ad6fd:test-rados-api-smithi079-36189-12::foo:head [watch ping cookie 94290589651712]
2021-04-14T01:40:16.366+0000 323f6700 10 osd.4 pg_epoch: 83 pg[53.1( v 83'8 (0'0,83'8] local-lis/les=74/76 n=2 ec=74/74 lis/c=74/74 les/c/f=76/76/0 sis=74) [4,7] r=0 lpr=74 crt=83'8 lcod 82'7 mlcod 82'7 active+clean] do_osd_op  watch ping cookie 94290589651712
2021-04-14T01:40:16.366+0000 323f6700 10 osd.4 pg_epoch: 83 pg[53.1( v 83'8 (0'0,83'8] local-lis/les=74/76 n=2 ec=74/74 lis/c=74/74 les/c/f=76/76/0 sis=74) [4,7] r=0 lpr=74 crt=83'8 lcod 82'7 mlcod 82'7 active+clean] watch ping: ctx->obc=0x1960f150 cookie=94290589651712 oi.version=8 ctx->at_version=0'0
2021-04-14T01:40:16.366+0000 323f6700 10 osd.4 pg_epoch: 83 pg[53.1( v 83'8 (0'0,83'8] local-lis/les=74/76 n=2 ec=74/74 lis/c=74/74 les/c/f=76/76/0 sis=74) [4,7] r=0 lpr=74 crt=83'8 lcod 82'7 mlcod 82'7 active+clean] watch: oi.user_version=6
2021-04-14T01:40:16.366+0000 323f6700 10 osd.4 pg_epoch: 83 pg[53.1( v 83'8 (0'0,83'8] local-lis/les=74/76 n=2 ec=74/74 lis/c=74/74 les/c/f=76/76/0 sis=74) [4,7] r=0 lpr=74 crt=83'8 lcod 82'7 mlcod 82'7 active+clean] watch: peer_addr=v1:172.21.15.79:0/3175972404
2021-04-14T01:40:16.367+0000 323f6700 10 osd.4 pg_epoch: 83 pg[53.1( v 83'8 (0'0,83'8] local-lis/les=74/76 n=2 ec=74/74 lis/c=74/74 les/c/f=76/76/0 sis=74) [4,7] r=0 lpr=74 crt=83'8 lcod 82'7 mlcod 82'7 active+clean] do_osd_ops error: (107) Transport endpoint is not connected
2021-04-14T01:40:16.367+0000 323f6700 20 osd.4 pg_epoch: 83 pg[53.1( v 83'8 (0'0,83'8] local-lis/les=74/76 n=2 ec=74/74 lis/c=74/74 les/c/f=76/76/0 sis=74) [4,7] r=0 lpr=74 crt=83'8 lcod 82'7 mlcod 82'7 active+clean] execute_ctx alloc reply 0x10288ea0 result -107
2021-04-14T01:40:16.367+0000 323f6700  1 -- v1:172.21.15.103:6805/31037 --> v1:172.21.15.79:0/3175972404 -- osd_op_reply(40 foo [watch ping cookie 94290589651712] v0'0 uv0 ondisk = -107 ((107) Transport endpoint is not connected)) v8 -- 0x10288ea0 con 0x95e65a0
2021-04-14T01:40:16.367+0000 323f6700 10 osd.4 83 dequeue_op 0x10495d20 finish
2021-04-14T01:40:16.368+0000 323f6700 20 osd.4 op_wq(1) _process empty q, waiting
2021-04-14T01:40:16.371+0000 d706700  1 -- v1:172.21.15.103:6805/31037 >> v1:172.21.15.79:0/2056787744 conn(0x94d0180 legacy=0x10205780 unknown :6805 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 65
2021-04-14T01:40:16.372+0000 d706700  1 -- v1:172.21.15.103:6805/31037 >> v1:172.21.15.79:0/2056787744 conn(0x94d0180 legacy=0x10205780 unknown :6805 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed
2021-04-14T01:40:16.372+0000 d706700  1 --1- v1:172.21.15.103:6805/31037 >> v1:172.21.15.79:0/2056787744 conn(0x94d0180 0x10205780 :6805 s=OPENED pgs=11 cs=1 l=1).handle_message read tag failed
2021-04-14T01:40:16.373+0000 d706700  1 --1- v1:172.21.15.103:6805/31037 >> v1:172.21.15.79:0/2056787744 conn(0x94d0180 0x10205780 :6805 s=OPENED pgs=11 cs=1 l=1).fault on lossy channel, failing

Just above that is this sequence.

2021-04-14T01:40:16.354+0000 323f6700 15 osd.4 pg_epoch: 83 pg[53.1( v 83'8 (0'0,83'8] local-lis/les=74/76 n=2 ec=74/74 lis/c=74/74 les/c/f=76/76/0 sis=74) [4,7] r=0 lpr=74 crt=83'8 lcod 82'7 mlcod 82'7 active+clean] log_op_stats osd_op(client.4816.0:39 53.1 53:ac8ad6fd:test-rados-api-smithi079-36189-12::foo:head [watch unwatch cookie 94290589651712] snapc 0=[] ondisk+write+known_if_redirected e83) v8 inb 0 outb 0 lat 0.317980
2021-04-14T01:40:16.354+0000 323f6700 10 osd.4 pg_epoch: 83 pg[53.1( v 83'8 (0'0,83'8] local-lis/les=74/76 n=2 ec=74/74 lis/c=74/74 les/c/f=76/76/0 sis=74) [4,7] r=0 lpr=74 crt=83'8 lcod 82'7 mlcod 82'7 active+clean]  sending reply on osd_op(client.4816.0:39 53.1 53:ac8ad6fd:test-rados-api-smithi079-36189-12::foo:head [watch unwatch cookie 94290589651712] snapc 0=[] ondisk+write+known_if_redirected e83) v8 0x95c5100
2021-04-14T01:40:16.354+0000 323f6700  1 -- v1:172.21.15.103:6805/31037 --> v1:172.21.15.79:0/3175972404 -- osd_op_reply(39 foo [watch unwatch cookie 94290589651712] v83'8 uv6 ondisk = 0) v8 -- 0x95c5100 con 0x95e65a0
2021-04-14T01:40:16.352+0000 30bf3700 20 osd.4 pg_epoch: 83 pg[60.e( empty local-lis/les=81/83 n=0 ec=81/81 lis/c=0/0 les/c/f=0/0/0 sis=81) [4,7] r=0 lpr=81 crt=0'0 mlcod 0'0 creating+activating mbc={}] update_calc_stats misplaced 0
2021-04-14T01:40:16.356+0000 30bf3700 20 osd.4 pg_epoch: 83 pg[60.e( empty local-lis/les=81/83 n=0 ec=81/81 lis/c=0/0 les/c/f=0/0/0 sis=81) [4,7] r=0 lpr=81 crt=0'0 mlcod 0'0 creating+activating mbc={}] prepare_stats_for_publish reporting purged_snaps []
2021-04-14T01:40:16.356+0000 30bf3700 15 osd.4 pg_epoch: 83 pg[60.e( empty local-lis/les=81/83 n=0 ec=81/81 lis/c=0/0 les/c/f=0/0/0 sis=81) [4,7] r=0 lpr=81 crt=0'0 mlcod 0'0 creating+activating mbc={}] publish_stats_to_osd 83:4
2021-04-14T01:40:16.356+0000 30bf3700 10 osd.4 pg_epoch: 83 pg[60.e( empty local-lis/les=81/83 n=0 ec=81/81 lis/c=0/0 les/c/f=0/0/0 sis=81) [4,7] r=0 lpr=81 crt=0'0 mlcod 0'0 creating+activating mbc={}] state<Started/Primary/Active>: all_activated_and_committed
2021-04-14T01:40:16.356+0000 30bf3700 20 osd.4 pg_epoch: 83 pg[60.e( empty local-lis/les=81/83 n=0 ec=81/81 lis/c=0/0 les/c/f=0/0/0 sis=81) [4,7] r=0 lpr=81 crt=0'0 mlcod 0'0 creating+activating mbc={}] recalc_readable_until peer osd.7 ruub 229.032409668s
2021-04-14T01:40:16.357+0000 30bf3700 20 osd.4 pg_epoch: 83 pg[60.e( empty local-lis/les=81/83 n=0 ec=81/81 lis/c=0/0 les/c/f=0/0/0 sis=81) [4,7] r=0 lpr=81 crt=0'0 mlcod 0'0 creating+activating mbc={}] recalc_readable_until readable_until[_ub] 229.032409668s (sent 229.610900879s)
2021-04-14T01:40:16.357+0000 30bf3700  1 -- v1:172.21.15.103:6806/31037 --> v1:172.21.15.103:6814/31040 -- pg_lease(60.e pg_lease(ru 229.032409668s ub 229.610900879s int 64.000000000s) e83/83) v1 -- 0x400bead0 con 0xa15d5a0
2021-04-14T01:40:16.358+0000 323f6700 20 osd.4 pg_epoch: 83 pg[53.1( v 83'8 (0'0,83'8] local-lis/les=74/76 n=2 ec=74/74 lis/c=74/74 les/c/f=76/76/0 sis=74) [4,7] r=0 lpr=74 crt=83'8 lcod 82'7 mlcod 82'7 active+clean] prepare_stats_for_publish reporting purged_snaps []
2021-04-14T01:40:16.358+0000 323f6700 15 osd.4 pg_epoch: 83 pg[53.1( v 83'8 (0'0,83'8] local-lis/les=74/76 n=2 ec=74/74 lis/c=74/74 les/c/f=76/76/0 sis=74) [4,7] r=0 lpr=74 crt=83'8 lcod 82'7 mlcod 82'7 active+clean] publish_stats_to_osd 83:27
2021-04-14T01:40:16.358+0000 323f6700 10 osd.4 pg_epoch: 83 pg[53.1( v 83'8 (0'0,83'8] local-lis/les=74/76 n=2 ec=74/74 lis/c=74/74 les/c/f=76/76/0 sis=74) [4,7] r=0 lpr=74 crt=83'8 lcod 82'7 mlcod 82'7 active+clean]  removing repgather(0xa464240 83'8 rep_tid=113 committed?=1 r=0)
2021-04-14T01:40:16.358+0000 323f6700 20 osd.4 pg_epoch: 83 pg[53.1( v 83'8 (0'0,83'8] local-lis/les=74/76 n=2 ec=74/74 lis/c=74/74 les/c/f=76/76/0 sis=74) [4,7] r=0 lpr=74 crt=83'8 lcod 82'7 mlcod 82'7 active+clean]    q front is repgather(0xa464240 83'8 rep_tid=113 committed?=1 r=0)
2021-04-14T01:40:16.358+0000 323f6700 15 osd.4 pg_epoch: 83 pg[53.1( v 83'8 (0'0,83'8] local-lis/les=74/76 n=2 ec=74/74 lis/c=74/74 les/c/f=76/76/0 sis=74) [4,7] r=0 lpr=74 crt=83'8 lcod 82'7 mlcod 82'7 active+clean] do_osd_op_effects client.4816 con 0x95e65a0
2021-04-14T01:40:16.359+0000 323f6700 10 osd.4 pg_epoch: 83 pg[53.1( v 83'8 (0'0,83'8] local-lis/les=74/76 n=2 ec=74/74 lis/c=74/74 les/c/f=76/76/0 sis=74) [4,7] r=0 lpr=74 crt=83'8 lcod 82'7 mlcod 82'7 active+clean] do_osd_op_effects disconnect watcher 94290589651712,client.4816
2021-04-14T01:40:16.359+0000 323f6700 10 osd.4 pg_epoch: 83 pg[53.1( v 83'8 (0'0,83'8] local-lis/les=74/76 n=2 ec=74/74 lis/c=74/74 les/c/f=76/76/0 sis=74) [4,7] r=0 lpr=74 crt=83'8 lcod 82'7 mlcod 82'7 active+clean]  -- Watch(94290589651712,client.4816) remove
2021-04-14T01:40:16.359+0000 323f6700 15 osd.4 pg_epoch: 83 pg[53.1( v 83'8 (0'0,83'8] local-lis/les=74/76 n=2 ec=74/74 lis/c=74/74 les/c/f=76/76/0 sis=74) [4,7] r=0 lpr=74 crt=83'8 lcod 82'7 mlcod 82'7 active+clean]  -- Watch(94290589651712,client.4816) unregister_cb
2021-04-14T01:40:16.359+0000 323f6700 15 osd.4 pg_epoch: 83 pg[53.1( v 83'8 (0'0,83'8] local-lis/les=74/76 n=2 ec=74/74 lis/c=74/74 les/c/f=76/76/0 sis=74) [4,7] r=0 lpr=74 crt=83'8 lcod 82'7 mlcod 82'7 active+clean]  -- Watch(94290589651712,client.4816) actually registered, cancelling
2021-04-14T01:40:16.360+0000 323f6700 10 osd.4 pg_epoch: 83 pg[53.1( v 83'8 (0'0,83'8] local-lis/les=74/76 n=2 ec=74/74 lis/c=74/74 les/c/f=76/76/0 sis=74) [4,7] r=0 lpr=74 crt=83'8 lcod 82'7 mlcod 82'7 active+clean]  -- Watch(94290589651712,client.4816) ~Watch
2021-04-14T01:40:16.360+0000 323f6700 20 osd.4 pg_epoch: 83 pg[53.1( v 83'8 (0'0,83'8] local-lis/les=74/76 n=2 ec=74/74 lis/c=74/74 les/c/f=76/76/0 sis=74) [4,7] r=0 lpr=74 crt=83'8 lcod 82'7 mlcod 82'7 active+clean] remove_repop repgather(0xa464240 83'8 rep_tid=113 committed?=1 r=0)

The error is getting returned here.

src/osd/PrimaryLogPG.cc:
        } else if (op.watch.op == CEPH_OSD_WATCH_OP_PING) {
»·······  /* Note: WATCH with PING doesn't cause may_write() to return true,
»·······   * so if there is nothing else in the transaction, this is going
»·······   * to run do_osd_op_effects, but not write out a log entry */
»·······  if (!oi.watchers.count(make_pair(cookie, entity))) {
»·······    result = -ENOTCONN;
»·······    break;
»·······  }

So it looks like the client tried to ping a watcher that no longer exists which probably makes sense given we have just called unwatch. This looks like some sort of client side race.

#8 Updated by Brad Hubbard almost 3 years ago

  • Related to Bug #50371: Segmentation fault (core dumped) ceph_test_rados_api_watch_notify_pp added

#9 Updated by Brad Hubbard almost 3 years ago

I believe the log snippet below shows a race. We have just called and sent the unwatch command to osd4 but just after the objecter ticks in another thread and we send a watch ping and osd4 responds with ENOTCONN.

2021-04-14 01:40:16.034176 thread: 0x7f9bf3d11540 priority: 10 subsystem: 14 client.4816.objecter _op_submit oid foo '@50;test-rados-api-smithi079-36189-12' '@53;test-rados-api-smithi079-36189-12' [watch unwatch cookie 94290589651712]    ⤷ tid 39 osd.4
2021-04-14 01:40:16.034176 thread: 0x7f9bf3d11540 priority: 20 subsystem: 14 client.4816.objecter get_session s=0x55c1bd1c5740 osd=4 4
2021-04-14 01:40:16.034176 thread: 0x7f9bf3d11540 priority: 15 subsystem: 14 client.4816.objecter _session_op_assign 4 39
2021-04-14 01:40:16.034176 thread: 0x7f9bf3d11540 priority: 15 subsystem: 14 client.4816.objecter _send_op 39 to 53.1 on osd.4
2021-04-14 01:40:16.034176 thread: 0x7f9bf3d11540 priority: 1 subsystem: 29 -- 172.21.15.79:0/3175972404 --> v1:172.21.15.103:6805/31037 -- osd_op(unknown.0.0:39 53.1 53:ac8ad6fd:test-rados-api-smithi079-36189-12::foo:head [watch         ⤷ unwatch cookie 94290589651712] snapc 0=[] ondisk+write+known_if_redirected e83) v8 -- 0x55c1bdc5a670 con 0x55c1bd1ace70
2021-04-14 01:40:16.034176 thread: 0x7f9bf3d11540 priority: 20 subsystem: 14 client.4816.objecter put_session s=0x55c1bd1c5740 osd=4 5
2021-04-14 01:40:16.034176 thread: 0x7f9bf3d11540 priority: 5 subsystem: 14 client.4816.objecter 1 in flight
2021-04-14 01:40:16.162173 thread: 0x7f9be99a1700 priority: 10 subsystem: 14 client.4816.objecter tick
2021-04-14 01:40:16.162173 thread: 0x7f9be99a1700 priority: 10 subsystem: 14 client.4816.objecter  pinging osd that serves lingering tid 11 (osd.4)
2021-04-14 01:40:16.162173 thread: 0x7f9be99a1700 priority: 10 subsystem: 14 client.4816.objecter _send_linger_ping 11 now 1842.638147s
2021-04-14 01:40:16.162173 thread: 0x7f9be99a1700 priority: 20 subsystem: 14 client.4816.objecter get_session s=0x55c1bd1c5740 osd=4 4
2021-04-14 01:40:16.162173 thread: 0x7f9be99a1700 priority: 15 subsystem: 14 client.4816.objecter _session_op_assign 4 40
2021-04-14 01:40:16.162173 thread: 0x7f9be99a1700 priority: 15 subsystem: 14 client.4816.objecter _send_op 40 to 53.1 on osd.4
2021-04-14 01:40:16.162173 thread: 0x7f9be99a1700 priority: 1 subsystem: 29 -- 172.21.15.79:0/3175972404 --> v1:172.21.15.103:6805/31037 -- osd_op(unknown.0.0:40 53.1 53:ac8ad6fd:test-rados-api-smithi079-36189-12::foo:head [watch ping    ⤷ cookie 94290589651712] snapc 0=[] ondisk+write+known_if_redirected e83) v8 -- 0x7f9bc0003ee0 con 0x55c1bd1ace70

Then we get back confirmation of the successful unwatch and the failed ping.

2021-04-14 01:40:16.376168 thread: 0x7f9bea9a3700 priority: 1 subsystem: 29 -- 172.21.15.79:0/3175972404 <== osd.4 v1:172.21.15.103:6805/31037 12 ==== osd_op_reply(39 foo [watch unwatch cookie 94290589651712] v83'8 uv6 ondisk = 0) v8     ⤷ ==== 147+0+0 (unknown 2284314154 0 0) 0x7f9be0001060 con 0x55c1bd1ace70
2021-04-14 01:40:16.376168 thread: 0x7f9bea9a3700 priority: 10 subsystem: 14 client.4816.objecter ms_dispatch 0x55c1bd1f1c60 osd_op_reply(39 foo [watch unwatch cookie 94290589651712] v83'8 uv6 ondisk = 0) v8
2021-04-14 01:40:16.376168 thread: 0x7f9bea9a3700 priority: 10 subsystem: 14 client.4816.objecter in handle_osd_op_reply
2021-04-14 01:40:16.376168 thread: 0x7f9bea9a3700 priority: 7 subsystem: 14 client.4816.objecter handle_osd_op_reply 39 ondisk uv 6 in 53.1 attempt 0
2021-04-14 01:40:16.376168 thread: 0x7f9bea9a3700 priority: 10 subsystem: 14 client.4816.objecter  op 0 rval 0 len 0
2021-04-14 01:40:16.376168 thread: 0x7f9bea9a3700 priority: 15 subsystem: 14 client.4816.objecter handle_osd_op_reply completed tid 39
2021-04-14 01:40:16.376168 thread: 0x7f9bea9a3700 priority: 15 subsystem: 14 client.4816.objecter _finish_op 39
2021-04-14 01:40:16.376168 thread: 0x7f9bea9a3700 priority: 20 subsystem: 14 client.4816.objecter put_session s=0x55c1bd1c5740 osd=4 6
2021-04-14 01:40:16.376168 thread: 0x7f9bea9a3700 priority: 15 subsystem: 14 client.4816.objecter _session_op_remove 4 39
2021-04-14 01:40:16.376168 thread: 0x7f9bea9a3700 priority: 5 subsystem: 14 client.4816.objecter 1 in flight
2021-04-14 01:40:16.376168 thread: 0x7f9bea9a3700 priority: 1 subsystem: 29 -- 172.21.15.79:0/3175972404 <== osd.4 v1:172.21.15.103:6805/31037 13 ==== osd_op_reply(40 foo [watch ping cookie 94290589651712] v0'0 uv0 ondisk = -107 ((107)   ⤷ Transport endpoint is not connected)) v8 ==== 147+0+0 (unknown 1918824261 0 0) 0x7f9be0001060 con 0x55c1bd1ace70

So it seems to be a race between the 'unwatch' command and the objecter 'tick' function.

#10 Updated by Brad Hubbard almost 3 years ago

Neha Ojha wrote:

Looks similar to https://tracker.ceph.com/issues/50042#note-2, feel free to create a separate tracker, Brad.
[...]

/a/nojha-2021-04-15_20:05:27-rados-wip-50217-distro-basic-smithi/6049636

Missed this one. Will take a look at this tomorrow.

#11 Updated by Brad Hubbard almost 3 years ago

Core from /a/nojha-2021-04-15_20:05:27-rados-wip-50217-distro-basic-smithi/6049636 is the issue seen in https://tracker.ceph.com/issues/50371

Oops, was looking at the original coredump.

I've created a new tracker for /a/nojha-2021-04-15_20:05:27-rados-wip-50217-distro-basic-smithi/6049636 at https://tracker.ceph.com/issues/50473

#12 Updated by Brad Hubbard almost 3 years ago

The actual failure that caused the segfault for /a/nojha-2021-04-15_20:05:27-rados-wip-50217-distro-basic-smithi/6049636 was https://tracker.ceph.com/issues/49428

$ find . -name ceph-mon.\*.log.gz -print -exec zgrep "creation canceled" {} \;
./remote/smithi167/log/ceph-mon.g.log.gz
./remote/smithi167/log/ceph-mon.f.log.gz
./remote/smithi167/log/ceph-mon.h.log.gz
./remote/smithi167/log/ceph-mon.i.log.gz
2021-04-15T22:58:09.541+0000 7f42d60dd700 20 mon.i@7(leader).osd e21 erasure code profile rm testprofile-test-rados-api-smithi167-18396-10: creation canceled
./remote/smithi039/log/ceph-mon.d.log.gz
./remote/smithi039/log/ceph-mon.a.log.gz
./remote/smithi039/log/ceph-mon.b.log.gz
./remote/smithi039/log/ceph-mon.e.log.gz
./remote/smithi039/log/ceph-mon.c.log.gz

#13 Updated by Neha Ojha almost 3 years ago

2021-04-30T14:10:36.956 INFO:tasks.workunit.client.0.smithi018.stdout:      api_watch_notify_pp: [ RUN      ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.AioWatchNotify2/0
2021-04-30T14:10:36.956 INFO:tasks.workunit.client.0.smithi018.stdout:      api_watch_notify_pp: handle_error cookie 94477346232336 err -107
2021-04-30T14:10:36.956 INFO:tasks.workunit.client.0.smithi018.stdout:      api_watch_notify_pp: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.1-270-gfbbbd25a/rpm/el8/BUILD/ceph-16.2.1-270-gfbbbd25a/src/test/librados/watch_notify_cxx.cc:234: Failure
2021-04-30T14:10:36.956 INFO:tasks.workunit.client.0.smithi018.stdout:      api_watch_notify_pp: Expected equality of these values:
2021-04-30T14:10:36.956 INFO:tasks.workunit.client.0.smithi018.stdout:      api_watch_notify_pp:   watches.size()
2021-04-30T14:10:36.956 INFO:tasks.workunit.client.0.smithi018.stdout:      api_watch_notify_pp:     Which is: 0
2021-04-30T14:10:36.956 INFO:tasks.workunit.client.0.smithi018.stdout:      api_watch_notify_pp:   1u
2021-04-30T14:10:36.956 INFO:tasks.workunit.client.0.smithi018.stdout:      api_watch_notify_pp:     Which is: 1
2021-04-30T14:10:36.956 INFO:tasks.workunit.client.0.smithi018.stdout:      api_watch_notify_pp: [  FAILED  ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.AioWatchNotify2/0, where GetParam() = "" (103416 ms)

/a/yuriw-2021-04-30_12:58:14-rados-wip-yuri2-testing-2021-04-29-1501-pacific-distro-basic-smithi/6086155

#14 Updated by Neha Ojha almost 3 years ago

  • Related to Bug #50595: upgrade:nautilus-x-pacific: LibRadosService.StatusFormat failure added

#15 Updated by Brad Hubbard almost 3 years ago

  • Related to Bug #40119: api_tier_pp hung causing a dead job added

#16 Updated by Deepika Upadhyay almost 3 years ago

2021-05-17T08:56:00.127 INFO:tasks.workunit.client.0.smithi109.stdout:      api_watch_notify_pp: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.3-90-g02a5217e/rpm/el8/BUILD/ceph-16.2.3-90-g02a5217e/src/test/librados/watch_notify_cxx.cc:345: Failure
2021-05-17T08:56:00.127 INFO:tasks.workunit.client.0.smithi109.stdout:      api_watch_notify_pp: Expected equality of these values:
2021-05-17T08:56:00.127 INFO:tasks.workunit.client.0.smithi109.stdout:      api_watch_notify_pp:   watches.size()
2021-05-17T08:56:00.127 INFO:tasks.workunit.client.0.smithi109.stdout:      api_watch_notify_pp:     Which is: 0
2021-05-17T08:56:00.127 INFO:tasks.workunit.client.0.smithi109.stdout:      api_watch_notify_pp:   1u
2021-05-17T08:56:00.128 INFO:tasks.workunit.client.0.smithi109.stdout:      api_watch_notify_pp:     Which is: 1
2021-05-17T08:56:00.128 INFO:tasks.workunit.client.0.smithi109.stdout:      api_watch_notify_pp: [  FAILED  ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.WatchNotify3/0, where GetParam() = "" (197634 ms)
2021-05-17T08:56:00.128 INFO:tasks.workunit.client.0.smithi109.stdout:      api_watch_notify_pp: [ RUN      ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.WatchNotify3/12021-05-17T08:56:00.127 INFO:tasks.workunit.client.0.smithi109.stdout:      api_watch_notify_pp: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.3-90-g02a5217e/rpm/el8/BUILD/ceph-16.2.3-90-g02a5217e/src/test/librados/watch_notify_cxx.cc:345: Failure
2021-05-17T08:56:00.127 INFO:tasks.workunit.client.0.smithi109.stdout:      api_watch_notify_pp: Expected equality of these values:
2021-05-17T08:56:00.127 INFO:tasks.workunit.client.0.smithi109.stdout:      api_watch_notify_pp:   watches.size()
2021-05-17T08:56:00.127 INFO:tasks.workunit.client.0.smithi109.stdout:      api_watch_notify_pp:     Which is: 0
2021-05-17T08:56:00.127 INFO:tasks.workunit.client.0.smithi109.stdout:      api_watch_notify_pp:   1u
2021-05-17T08:56:00.128 INFO:tasks.workunit.client.0.smithi109.stdout:      api_watch_notify_pp:     Which is: 1
2021-05-17T08:56:00.128 INFO:tasks.workunit.client.0.smithi109.stdout:      api_watch_notify_pp: [  FAILED  ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.WatchNotify3/0, where GetParam() = "" (197634 ms)
2021-05-17T08:56:00.128 INFO:tasks.workunit.client.0.smithi109.stdout:      api_watch_notify_pp: [ RUN      ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.WatchNotify3/1

/ceph/teuthology-archive/ideepika-2021-05-17_08:09:09-rados-wip-yuri2-testing-2021-05-11-1032-pacific-distro-basic-smithi/6118155/teuthology.log

#17 Updated by Brad Hubbard almost 3 years ago

Deepika Upadhyay wrote:

[...]

/ceph/teuthology-archive/ideepika-2021-05-17_08:09:09-rados-wip-yuri2-testing-2021-05-11-1032-pacific-distro-basic-smithi/6118155/teuthology.log

Looking at the above.

Just before what you posted in the teuthology log we see the following.

2021-05-17T08:56:00.126 INFO:tasks.workunit.client.0.smithi109.stdout:      api_watch_notify_pp: handle_error cookie 93878928180144 err -107

I think we see this because of this error message on osd.6.

021-05-17T08:55:59.187+0000 36109700  1 -- v2:172.21.15.190:6809/34096 --> 172.21.15.109:0/335756937 -- osd_op_reply(78 foo [watch ping cookie 93878928180144] v0'0 uv0 ondisk = -107 ((107) Transport endpoint is not connected)) v8 -- 0x1011efc0 con 0x215ae090

This looks like yet another issue where I have an early theory that this sequence may be delaying the addition of the most recent watch call.

2021-05-17T08:55:29.343+0000 3a111700 10 osd.6 pg_epoch: 313 pg[57.1( v 312'7 (0'0,312'7] local-lis/les=182/217 n=2 ec=155/155 lis/c=182/155 les/c/f=217/158/0 sis=182) [6,1] r=0 lpr=182 pi=[155,182)/2 crt=312'7 lcod 300'6 mlcod 162'3 active+recovery_wait+degraded mbc={255={(1+1)=1}}]  sending reply on osd_op(client.5074.0:74 57.1 57:911f189c:test-rados-api-smithi109-43439-18::foo:head [watch watch cookie 93878928180144] snapc 0=[] ondisk+write+known_if_redirected e312) v8 0x211062b0
2021-05-17T08:55:29.343+0000 3a111700  1 -- v2:172.21.15.190:6809/34096 --> 172.21.15.109:0/335756937 -- osd_op_reply(74 foo [watch watch cookie 93878928180144] v312'7 uv6 ondisk = 0) v8 -- 0x211062b0 con 0x215ae090

Notice the pg (57.1) is in 'recovery_wait' and I'm speculating that is an issue because of this sequence.

$ zgrep -i -E '(93878928180144|list-watchers)' ./remote/smithi190/log/ceph-osd.6.log.gz|grep osd_op_reply
2021-05-17T08:52:10.388+0000 36109700  1 -- v2:172.21.15.190:6809/34096 --> 172.21.15.109:0/335756937 -- osd_op_reply(32 foo [watch watch cookie 93878928180144] v162'3 uv2 ondisk = 0) v8 -- 0x16020a90 con 0x215ae090
2021-05-17T08:52:10.418+0000 36109700  1 -- v2:172.21.15.190:6809/34096 --> 172.21.15.109:0/335756937 -- osd_op_reply(33 foo [list-watchers out=72b] v0'0 uv2 ondisk = 0) v8 -- 0x48332ed0 con 0x215ae090
2021-05-17T08:52:10.511+0000 36109700  1 -- v2:172.21.15.190:6809/34096 --> 172.21.15.109:0/335756937 -- osd_op_reply(36 foo [watch unwatch cookie 93878928180144] v162'4 uv2 ondisk = 0) v8 -- 0x485e5d80 con 0x215ae090
2021-05-17T08:55:29.343+0000 3a111700  1 -- v2:172.21.15.190:6809/34096 --> 172.21.15.109:0/335756937 -- osd_op_reply(74 foo [watch watch cookie 93878928180144] v312'7 uv6 ondisk = 0) v8 -- 0x211062b0 con 0x215ae090
2021-05-17T08:55:48.623+0000 36109700  1 -- v2:172.21.15.190:6809/34096 --> 172.21.15.109:0/335756937 -- osd_op_reply(75 foo [list-watchers out=10b] v0'0 uv6 ondisk = 0) v8 -- 0xf249510 con 0x215ae090
2021-05-17T08:55:48.628+0000 36109700  1 -- v2:172.21.15.190:6809/34096 --> 172.21.15.109:0/335756937 -- osd_op_reply(76 foo [watch ping cookie 93878928180144] v0'0 uv0 ondisk = -107 ((107) Transport endpoint is not connected)) v8 -- 0x21d55ee0 con 0x215ae090
2021-05-17T08:55:54.147+0000 36109700  1 -- v2:172.21.15.190:6809/34096 --> 172.21.15.109:0/335756937 -- osd_op_reply(77 foo [watch ping cookie 93878928180144] v0'0 uv0 ondisk = -107 ((107) Transport endpoint is not connected)) v8 -- 0xff93f80 con 0x215ae090
2021-05-17T08:55:59.187+0000 36109700  1 -- v2:172.21.15.190:6809/34096 --> 172.21.15.109:0/335756937 -- osd_op_reply(78 foo [watch ping cookie 93878928180144] v0'0 uv0 ondisk = -107 ((107) Transport endpoint is not connected)) v8 -- 0x1011efc0 con 0x215ae090

The 'watch' call we are interested in is just before the 'list-watchers' call (which presumably returns zero and causes the test to fail because it expects 1 watcher to be active). The watch pings on that cookie subsequently start to return -107 which is why we see the 'handle_error cookie 93878928180144 err -107' message in the teuthology.log just before the test failure.

Ah, digging a little deeper in the intervening period we see this.

2021-05-17T08:55:29.351+0000 3a111700 15 osd.6 pg_epoch: 313 pg[57.1( v 312'7 (0'0,312'7] local-lis/les=182/217 n=2 ec=155/155 lis/c=182/155 les/c/f=217/158/0 sis=182) [6,1] r=0 lpr=182 pi=[155,182)/2 crt=312'7 lcod 300'6 mlcod 162'3 active+recovery_wait+degraded mbc={255={(1+1)=1}}]  -- Watch(93878928180144,client.5074) registering callback, timeout: 12

Twelve seconds later.

2021-05-17T08:55:41.368+0000 33904700 10 osd.6 pg_epoch: 321 pg[57.1(unlocked)]  -- Watch(93878928180144,client.5074) HandleWatchTimeout

Then this sequence.

2021-05-17T08:55:43.123+0000 3a111700 10 osd.6 pg_epoch: 322 pg[57.1( v 321'8 (0'0,321'8] local-lis/les=182/217 n=2 ec=155/155 lis/c=182/182 les/c/f=217/320/0 sis=182) [6,1] r=0 lpr=182 crt=321'8 lcod 312'7 mlcod 312'7 active+clean] do_osd_op_effects disconnect watcher 93878928180144,client.5074
2021-05-17T08:55:43.125+0000 3a111700 10 osd.6 pg_epoch: 322 pg[57.1( v 321'8 (0'0,321'8] local-lis/les=182/217 n=2 ec=155/155 lis/c=182/182 les/c/f=217/320/0 sis=182) [6,1] r=0 lpr=182 crt=321'8 lcod 312'7 mlcod 312'7 active+clean]  -- Watch(93878928180144,client.5074) remove
2021-05-17T08:55:43.126+0000 3a111700  1 -- v2:172.21.15.190:6809/34096 --> 172.21.15.109:0/335756937 -- watch-notify(disconnect (3) cookie 93878928180144 notify 0 ret 0) v3 -- 0x160055b0 con 0x215ae090

So we called watch and timed out after 12 seconds, therefore failing. The 'list-watchers' call then subsequently returns 0 when 1 is expected in the code from src/test/librados/watch_notify_cxx.cc

329 TEST_P(LibRadosWatchNotifyPP, WatchNotify3) {
330   notify_oid = "foo";
331   notify_ioctx = &ioctx;
332   notify_cookies.clear();
333   uint32_t timeout = 12; // configured timeout
334   char buf[128];
335   memset(buf, 0xcc, sizeof(buf));
336   bufferlist bl1;
337   bl1.append(buf, sizeof(buf));
338   ASSERT_EQ(0, ioctx.write(notify_oid, bl1, sizeof(buf), 0));
339   uint64_t handle;
340   WatchNotifyTestCtx2 ctx(this);
341   ASSERT_EQ(0, ioctx.watch3(notify_oid, &handle, &ctx, timeout));
342   ASSERT_GT(ioctx.watch_check(handle), 0);
343   std::list<obj_watch_t> watches;
344   ASSERT_EQ(0, ioctx.list_watchers(notify_oid, &watches));
345   ASSERT_EQ(watches.size(), 1u);   <------------------------- This assert fails

So that explains why the test failed but I'm not sure this is an issue with the test? It's certainly a new issue so needs a tracker of it's own. I'll follow up tomorrow after I've had a chance to discuss how we want to proceed.

#18 Updated by Brad Hubbard almost 3 years ago

/a/yuriw-2021-04-30_12:58:14-rados-wip-yuri2-testing-2021-04-29-1501-pacific-distro-basic-smithi/6086155 is the same as above except in a different test and the timeout was 30 seconds (default).

2021-04-30T14:10:36.956 INFO:tasks.workunit.client.0.smithi018.stdout:      api_watch_notify_pp: handle_error cookie 94477346232336 err -107 
2021-04-30T14:10:36.956 INFO:tasks.workunit.client.0.smithi018.stdout:      api_watch_notify_pp: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.1-270-gfbbbd25a/rpm/el8/BUILD/ceph-16.2.1-270-gfbbbd25a/src/test/librados/watch_notify_cxx.cc:234: Failure
2021-04-30T13:43:48.425+0000 32d02700 15 osd.2 pg_epoch: 228 pg[53.0( v 228'11 (0'0,228'11] local-lis/les=152/155 n=3 ec=152/152 lis/c=152/0 les/c/f=155/0/0 sis=152) [2] r=0 lpr=152 crt=228'11 lcod 217'10 mlcod 217'10 active+undersized+degraded mbc={}]  -- Watch(94477346232336,client.5200) registering callback, timeout: 30
2021-04-30T13:44:18.434+0000 30cfe700 10 osd.2 pg_epoch: 248 pg[53.0(unlocked)]  -- Watch(94477346232336,client.5200) HandleWatchTimeout

We are going to tackle these timeout failures (3 different known so far) in teh existing tracker https://tracker.ceph.com/issues/47025

#19 Updated by Brad Hubbard almost 3 years ago

  • Related to Bug #47025: rados/test.sh: api_watch_notify_pp LibRadosWatchNotifyECPP.WatchNotify failed added

#20 Updated by Neha Ojha almost 3 years ago

/a/yuriw-2021-05-25_19:21:19-rados-wip-yuri2-testing-2021-05-25-0940-pacific-distro-basic-smithi/6134471

#21 Updated by Deepika Upadhyay almost 3 years ago

2021-06-15T01:47:53.406 INFO:tasks.workunit.client.0.smithi032.stdout:         api_watch_notify: [ RUN      ] LibRadosWatchNotifyEC.WatchNotify
2021-06-15T01:47:53.406 INFO:tasks.workunit.client.0.smithi032.stdout:         api_watch_notify: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.13-38-g9b6d90a4/rpm/el8/BUILD/ceph-15.2.13-38-g9b6d90a4/src/test/librados/TestCase.cc:172: Failure
2021-06-15T01:47:53.407 INFO:tasks.workunit.client.0.smithi032.stdout:         api_watch_notify: Expected equality of these values:
2021-06-15T01:47:53.407 INFO:tasks.workunit.client.0.smithi032.stdout:         api_watch_notify:   0
2021-06-15T01:47:53.407 INFO:tasks.workunit.client.0.smithi032.stdout:         api_watch_notify:   rados_ioctx_create(cluster, pool_name.c_str(), &ioctx)
2021-06-15T01:47:53.407 INFO:tasks.workunit.client.0.smithi032.stdout:         api_watch_notify:     Which is: -107
2021-06-15T01:47:53.482 INFO:tasks.ceph.osd.2.smithi032.stderr:2021-06-15T01:47:53.483+0000 7fb7fac56700 -1 received  signal: Hangup from /usr/bin/python3 /bin/daemon-helper kill ceph-osd -f --cluster ceph -i 2  (PID: 32159) UID: 0
2021-06-15T01:47:53.499 INFO:tasks.workunit.client.0.smithi032.stderr:bash: line 1: 39105 Segmentation fault      (core dumped) ceph_test_rados_api_watch_notify 2>&1
2021-06-15T01:47:53.499 INFO:tasks.workunit.client.0.smithi032.stderr:     39106 Done                    | tee ceph_test_rados_api_watch_notify.log
2021-06-15T01:47:53.499 INFO:tasks.workunit.client.0.smithi032.stderr:     39107 Done                    | sed "s/^/         api_watch_notify: /" 
2021-06-15T01:47:53.505 INFO:teuthology.orchestra.run.smithi032.stderr:2021-06-15T01:47:53.506+0000 7f851d8de700 10 monclient: start_mon_command cm

/ceph/teuthology-archive/yuriw-2021-06-14_19:20:57-rados-wip-yuri6-testing-2021-06-14-1106-octopus-distro-basic-smithi/6172399/teuthology.log

#22 Updated by Neha Ojha over 2 years ago

  • Related to Bug #51307: LibRadosWatchNotify.Watch2Delete fails added

#23 Updated by Neha Ojha over 2 years ago

2021-12-14T02:23:13.918 INFO:tasks.workunit.client.0.smithi003.stdout:      api_watch_notify_pp: [ RUN      ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.WatchNotify/1
2021-12-14T02:23:13.918 INFO:tasks.workunit.client.0.smithi003.stdout:      api_watch_notify_pp: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-9602-g24c69e1e/rpm/el8/BUILD/ceph-17.0.0-9602-g24c69e1e/src/test/librados/watch_notify_cxx.cc:95: Failure
2021-12-14T02:23:13.918 INFO:tasks.workunit.client.0.smithi003.stdout:      api_watch_notify_pp: Expected equality of these values:
2021-12-14T02:23:13.919 INFO:tasks.workunit.client.0.smithi003.stdout:      api_watch_notify_pp:   1u
2021-12-14T02:23:13.919 INFO:tasks.workunit.client.0.smithi003.stdout:      api_watch_notify_pp:     Which is: 1
2021-12-14T02:23:13.919 INFO:tasks.workunit.client.0.smithi003.stdout:      api_watch_notify_pp:   watches.size()
2021-12-14T02:23:13.919 INFO:tasks.workunit.client.0.smithi003.stdout:      api_watch_notify_pp:     Which is: 0
2021-12-14T02:23:13.920 INFO:tasks.workunit.client.0.smithi003.stdout:      api_watch_notify_pp: [  FAILED  ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.WatchNotify/1, where GetParam() = "cache" (289053 ms)

/a/yuriw-2021-12-13_23:18:05-rados-wip-yuri-testing-2021-12-13-1210-distro-default-smithi/6561405

#24 Updated by Laura Flores about 2 years ago

/a/yuriw-2022-02-24_22:04:22-rados-wip-yuri7-testing-2022-02-17-0852-pacific-distro-default-smithi/6704772

2022-02-24T23:32:45.093 INFO:tasks.workunit.client.0.smithi085.stdout:              api_tier_pp: [----------] Global test environment tear-down
2022-02-24T23:32:45.093 INFO:tasks.workunit.client.0.smithi085.stdout:              api_tier_pp: [==========] 71 tests from 4 test suites ran. (1618276 ms total)
2022-02-24T23:32:45.093 INFO:tasks.workunit.client.0.smithi085.stdout:              api_tier_pp: [  PASSED  ] 71 tests.
2022-02-24T23:32:45.093 INFO:tasks.workunit.client.0.smithi085.stderr:+ exit 1
2022-02-24T23:32:45.094 INFO:tasks.workunit.client.0.smithi085.stderr:+ cleanup
2022-02-24T23:32:45.094 INFO:tasks.workunit.client.0.smithi085.stderr:+ pkill -P 47335
2022-02-24T23:32:45.095 INFO:teuthology.orchestra.run.smithi201.stdout:{
2022-02-24T23:32:45.095 INFO:teuthology.orchestra.run.smithi201.stdout:    "ops": [],
2022-02-24T23:32:45.095 INFO:teuthology.orchestra.run.smithi201.stdout:    "complaint_time": 30,
2022-02-24T23:32:45.096 INFO:teuthology.orchestra.run.smithi201.stdout:    "num_blocked_ops": 0
2022-02-24T23:32:45.096 INFO:teuthology.orchestra.run.smithi201.stdout:}
2022-02-24T23:32:45.097 INFO:tasks.ceph.osd.5.smithi201.stderr:2022-02-24T23:32:45.098+0000 14a6d700 -1 received  signal: Hangup from /usr/bin/python3 /bin/daemon-helper term env OPENSSL_ia32cap=~0x1000000000000000 valgrind --trace-children=no --child-silent-after-fork=yes --soname-synonyms=somalloc=*tcmalloc* --num-callers=50 --suppressions=/home/ubuntu/cephtest/valgrind.supp --xml=yes --xml-file=/var/log/ceph/valgrind/osd.5.log --time-stamp=yes --vgdb=yes --exit-on-first-error=yes --error-exitcode=42 --tool=memcheck ceph-osd -f --cluster ceph -i 5  (PID: 36906) UID: 0
2022-02-24T23:32:45.099 DEBUG:teuthology.orchestra.run.smithi201:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 30 ceph --cluster ceph --admin-daemon /var/run/ceph/ceph-osd.6.asok dump_historic_ops
2022-02-24T23:32:45.110 DEBUG:teuthology.orchestra.run:got remote process result: 1
2022-02-24T23:32:45.111 INFO:tasks.workunit.client.0.smithi085.stderr:+ true
2022-02-24T23:32:45.111 INFO:tasks.workunit:Stopping ['rados/test.sh'] on client.0...

#25 Updated by Laura Flores about 2 years ago

/a/yuriw-2022-03-01_17:45:51-rados-wip-yuri3-testing-2022-02-28-0757-pacific-distro-default-smithi/6714863

2022-03-02T04:20:45.832 INFO:tasks.workunit.client.0.smithi036.stdout:              api_tier_pp: [       OK ] LibRadosTwoPoolsECPP.TrySetDedupTier (3064 ms)
2022-03-02T04:20:45.832 INFO:tasks.workunit.client.0.smithi036.stdout:              api_tier_pp: [----------] 22 tests from LibRadosTwoPoolsECPP (231118 ms total)
2022-03-02T04:20:45.832 INFO:tasks.workunit.client.0.smithi036.stdout:              api_tier_pp:
2022-03-02T04:20:45.832 INFO:tasks.workunit.client.0.smithi036.stdout:              api_tier_pp: [----------] Global test environment tear-down
2022-03-02T04:20:45.832 INFO:tasks.workunit.client.0.smithi036.stdout:              api_tier_pp: [==========] 71 tests from 4 test suites ran. (776815 ms total)
2022-03-02T04:20:45.833 INFO:tasks.workunit.client.0.smithi036.stdout:              api_tier_pp: [  PASSED  ] 71 tests.
2022-03-02T04:20:45.833 INFO:tasks.workunit.client.0.smithi036.stderr:+ exit 1
2022-03-02T04:20:45.834 INFO:tasks.workunit.client.0.smithi036.stderr:+ cleanup
2022-03-02T04:20:45.834 INFO:tasks.workunit.client.0.smithi036.stderr:+ pkill -P 41859
2022-03-02T04:20:45.847 DEBUG:teuthology.orchestra.run:got remote process result: 1
2022-03-02T04:20:45.848 INFO:tasks.workunit.client.0.smithi036.stderr:+ true
2022-03-02T04:20:45.849 INFO:tasks.workunit:Stopping ['rados/test.sh'] on client.0...
2022-03-02T04:20:45.849 DEBUG:teuthology.orchestra.run.smithi036:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0
2022-03-02T04:20:46.156 ERROR:teuthology.run_tasks:Saw exception from tasks.

#26 Updated by Laura Flores about 2 years ago

/a/yuriw-2022-03-01_17:45:51-rados-wip-yuri3-testing-2022-02-28-0757-pacific-distro-default-smithi/6714656

2022-03-01T22:29:25.384 INFO:tasks.workunit.client.0.smithi142.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test.sh: line 10: 41208 Terminated              bash -o pipefail -exc "ceph_test_rados_$f $color 2>&1 | tee ceph_test_rados_$ff.log | sed \"s/^/$r: /\"" 
2022-03-01T22:29:25.384 INFO:tasks.workunit.client.0.smithi142.stderr:++ true
2022-03-01T22:29:25.385 INFO:tasks.workunit.client.0.smithi142.stdout:              api_tier_pp: [       OK ] LibRadosTwoPoolsPP.CacheP
2022-03-01T22:29:25.385 INFO:tasks.workunit:Stopping ['rados/test.sh', 'rados/test_pool_quota.sh'] on client.0...
2022-03-01T22:29:25.386 DEBUG:teuthology.orchestra.run.smithi142:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0
2022-03-01T22:29:25.387 INFO:tasks.ceph.osd.7.smithi150.stderr:2022-03-01T22:29:25.373+0000 7f9ae3dbe700 -1 osd.7 404 get_health_metrics reporting 17 slow ops, oldest is osd_op(client.5300.0:7586 212.f 212:f76e3a33:test-rados-api-smithi142-41217-84::foo:head [tier-flush] snapc 0=[] ondisk+read+ignore_cache+known_if_redirected e403)

#27 Updated by Laura Flores about 2 years ago

  • Related to Bug #45423: api_tier_pp: [ FAILED ] LibRadosTwoPoolsPP.HitSetWrite added

#28 Updated by Laura Flores about 2 years ago

Found a case of https://tracker.ceph.com/issues/45423 in master, which had a fix that was merged. Seems like it's popping up again.

/a/dgalloway-2022-03-09_02:34:58-rados-wip-45272-distro-basic-smithi/6727572

2022-03-09T04:19:07.909 INFO:tasks.workunit.client.0.smithi111.stdout:              api_tier_pp: checking for 194:6cac518f:::0:head
2022-03-09T04:19:07.909 INFO:tasks.workunit.client.0.smithi111.stdout:              api_tier_pp: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-10990-gc11455fb/rpm/el8/BUILD/ceph-17.0.0-10990-gc11455fb/src/test/librados/tier_cxx.cc:2756: Failure
2022-03-09T04:19:07.909 INFO:tasks.workunit.client.0.smithi111.stdout:              api_tier_pp: Value of: found
2022-03-09T04:19:07.909 INFO:tasks.workunit.client.0.smithi111.stdout:              api_tier_pp:   Actual: false
2022-03-09T04:19:07.909 INFO:tasks.workunit.client.0.smithi111.stdout:              api_tier_pp: Expected: true
2022-03-09T04:19:07.910 INFO:tasks.workunit.client.0.smithi111.stdout:              api_tier_pp: [  FAILED  ] LibRadosTwoPoolsPP.HitSetWrite (363127 ms)

I logged the instance on that Tracker, but I'm logging it here too as it might be relevant.

#29 Updated by Neha Ojha about 2 years ago

  • Related to deleted (Bug #45423: api_tier_pp: [ FAILED ] LibRadosTwoPoolsPP.HitSetWrite)

#30 Updated by Neha Ojha about 2 years ago

  • Subject changed from sporadic rados/test.sh failures to rados/test.sh: api_watch_notify failures

Let's use this tracker to track all the watch notify failures. For other api test failures, let's open new trackers. It also seems like we have fixed some of the api_tier_pp issues.

#31 Updated by Laura Flores about 2 years ago

  • Related to Bug #55001: rados/test.sh: Early exit right after LibRados global tests complete added

#32 Updated by Laura Flores about 2 years ago

/a/yuriw-2022-03-24_14:35:45-rados-wip-yuri7-testing-2022-03-23-1332-quincy-distro-default-smithi/6757986

2022-03-24T15:12:20.584 INFO:tasks.workunit.client.0.smithi027.stdout:         api_watch_notify: [ RUN      ] LibRadosWatchNotify.WatchNotify2Timeout
2022-03-24T15:12:20.585 INFO:tasks.workunit.client.0.smithi027.stdout:         api_watch_notify: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.1.0-81-gacdeec56/rpm/el8/BUILD/ceph-17.1.0-81-gacdeec56/src/test/librados/watch_notify.cc:428: Failure
2022-03-24T15:12:20.585 INFO:tasks.workunit.client.0.smithi027.stdout:         api_watch_notify: Expected equality of these values:
2022-03-24T15:12:20.585 INFO:tasks.workunit.client.0.smithi027.stdout:         api_watch_notify:   1u
2022-03-24T15:12:20.586 INFO:tasks.workunit.client.0.smithi027.stdout:         api_watch_notify:     Which is: 1
2022-03-24T15:12:20.586 INFO:tasks.workunit.client.0.smithi027.stdout:         api_watch_notify:   notify_cookies.size()
2022-03-24T15:12:20.586 INFO:tasks.workunit.client.0.smithi027.stdout:         api_watch_notify:     Which is: 0
2022-03-24T15:12:20.587 INFO:tasks.workunit.client.0.smithi027.stdout:         api_watch_notify: [  FAILED  ] LibRadosWatchNotify.WatchNotify2Timeout (27562 ms)

#33 Updated by Brad Hubbard about 2 years ago

Laura Flores wrote:

/a/yuriw-2022-03-24_14:35:45-rados-wip-yuri7-testing-2022-03-23-1332-quincy-distro-default-smithi/6757986

This looks like a race to me.

2022-03-24T15:12:20.584 INFO:tasks.workunit.client.0.smithi027.stdout:         api_watch_notify: [ RUN      ] LibRadosWatchNotify.WatchNotify2Timeout
2022-03-24T15:12:20.585 INFO:tasks.workunit.client.0.smithi027.stdout:         api_watch_notify: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.1.0-81-gacdeec56/rpm/el8/BUILD/ceph-17.1.0-81-gacdeec56/src/test/librados/watch_notify.cc:428: Failure
2022-03-24T15:12:20.585 INFO:tasks.workunit.client.0.smithi027.stdout:         api_watch_notify: Expected equality of these values:
2022-03-24T15:12:20.585 INFO:tasks.workunit.client.0.smithi027.stdout:         api_watch_notify:   1u
2022-03-24T15:12:20.586 INFO:tasks.workunit.client.0.smithi027.stdout:         api_watch_notify:     Which is: 1
2022-03-24T15:12:20.586 INFO:tasks.workunit.client.0.smithi027.stdout:         api_watch_notify:   notify_cookies.size()
2022-03-24T15:12:20.586 INFO:tasks.workunit.client.0.smithi027.stdout:         api_watch_notify:     Which is: 0
2022-03-24T15:12:20.587 INFO:tasks.workunit.client.0.smithi027.stdout:         api_watch_notify: [  FAILED  ] LibRadosWatchNotify.WatchNotify2Timeout (27562 ms)
2022-03-24T15:12:20.587 INFO:tasks.workunit.client.0.smithi027.stdout:         api_watch_notify: [ RUN      ] LibRadosWatchNotify.Watch3Timeout
2022-03-24T15:12:20.587 INFO:tasks.workunit.client.0.smithi027.stdout:         api_watch_notify: watch_notify2_test_cb from 4580 notify_id 841813590016 cookie 94437677328464
2022-03-24T15:12:20.587 INFO:tasks.workunit.client.0.smithi027.stdout:         api_watch_notify: waiting up to 1024 for osd to time us out ...
2022-03-24T15:12:20.588 INFO:tasks.workunit.client.0.smithi027.stdout:         api_watch_notify: watch_notify2_test_errcb cookie 94437677314544 err -107

Notice how the messages from watch_notify2_test_cb and watch_notify2_test_errcb come after the test has already failed?

Thinking out loud without a deep investigation perhaps we need something similar to https://github.com/badone/ceph/blob/d5b155df9bdcf2ac2e364075f3fc8b1bf1d8cf22/src/test/librados/watch_notify.cc#L479-L487 to cure this?

Can this be easily reproduced?

#34 Updated by Brad Hubbard almost 2 years ago

  • Related to Bug #45868: rados_api_tests: LibRadosWatchNotify.AioWatchNotify2 fails added

#35 Updated by Nitzan Mordechai almost 2 years ago

It looks like we are too fast here - rados_watch2 didn't call the callback (we still do not have the print of watch_notify2_test_cb) and we probably reusing the old handler (we didn't wipe it and never checked for the previous return code of unwatch2)

#36 Updated by Laura Flores over 1 year ago

/a/yuriw-2022-09-29_16:40:30-rados-wip-all-kickoff-r-distro-default-smithi/7047940

2022-09-29T20:22:35.507 INFO:tasks.workunit.client.0.smithi081.stdout:      api_watch_notify_pp: [ RUN      ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.AioNotify/0
2022-09-29T20:22:35.530 INFO:tasks.workunit.client.0.smithi081.stderr:bash: line 1: 102610 Segmentation fault      (core dumped) ceph_test_rados_api_watch_notify_pp 2>&1
2022-09-29T20:22:35.530 INFO:tasks.workunit.client.0.smithi081.stderr:     102612 Done                    | tee ceph_test_rados_api_watch_notify_pp.log
2022-09-29T20:22:35.531 INFO:tasks.workunit.client.0.smithi081.stderr:     102613 Done                    | sed "s/^/      api_watch_notify_pp: /" 
2022-09-29T20:22:36.486 INFO:tasks.ceph.osd.5.smithi121.stderr:2022-09-29T20:22:36.484+0000 7fc3a977d700 -1 osd.5 42 handle_osd_map injecting map crc failure
2022-09-29T20:22:37.570 INFO:tasks.ceph.osd.5.smithi121.stderr:2022-09-29T20:22:37.568+0000 7fc3a977d700 -1 osd.5 43 handle_osd_map injecting map crc failure

(gdb) bt
#0  0x00007f8cd545b313 in WatchInfo::handle_error (err=-107, cookie=94202633757536, this=0x55ad426df570) at /usr/src/debug/ceph-17.0.0-15175.g7f41f157.el8.x86_64/src/librados/IoCtxImpl.cc:1610
#1  WatchInfo::operator() (bl=..., notifier_id=<optimized out>, cookie=94202633757536, notify_id=<optimized out>, ec=..., this=0x55ad426df570) at /usr/src/debug/ceph-17.0.0-15175.g7f41f157.el8.x86_64/src/librados/IoCtxImpl.cc:1619
#2  fu2::abi_310::detail::invocation::invoke<WatchInfo&, boost::system::error_code, unsigned long, unsigned long, unsigned long, ceph::buffer::v15_2_0::list> (callable=...)
    at /usr/src/debug/ceph-17.0.0-15175.g7f41f157.el8.x86_64/src/include/function2.hpp:126
#3  fu2::abi_310::detail::type_erasure::invocation_table::function_trait<void (boost::system::error_code, unsigned long, unsigned long, unsigned long, ceph::buffer::v15_2_0::list&&)>::internal_invoker<fu2::abi_310::detail::type_erasure::box<false, WatchInfo, std::allocator<WatchInfo> >, false>::invoke(fu2::abi_310::detail::type_erasure::data_accessor*, unsigned long, boost::system::error_code, unsigned long, unsigned long, unsigned long, ceph::buffer::v15_2_0::list&&) (
    data=<optimized out>, capacity=<optimized out>, args#0=..., args#1=<optimized out>, args#2=94202633757536, args#3=<optimized out>, args#4=...) at /usr/src/debug/ceph-17.0.0-15175.g7f41f157.el8.x86_64/src/include/function2.hpp:518
#4  0x00007f8cd54d2ed0 in fu2::abi_310::detail::type_erasure::tables::vtable<fu2::abi_310::detail::property<true, false, void (boost::system::error_code, unsigned long, unsigned long, unsigned long, ceph::buffer::v15_2_0::list&&)> >::invoke<0ul, fu2::abi_310::detail::type_erasure::data_accessor*, unsigned long const&, boost::system::error_code, unsigned long, unsigned long, unsigned long, ceph::buffer::v15_2_0::list>(fu2::abi_310::detail::type_erasure::data_accessor*&&, unsigned long const&, boost::system::error_code&&, unsigned long&&, unsigned long&&, unsigned long&&, ceph::buffer::v15_2_0::list&&) const (this=0x55ad4285a770)
    at /usr/src/debug/ceph-17.0.0-15175.g7f41f157.el8.x86_64/src/include/function2.hpp:1121
#5  fu2::abi_310::detail::type_erasure::erasure<true, fu2::abi_310::detail::config<true, false, 16ul>, fu2::abi_310::detail::property<true, false, void (boost::system::error_code, unsigned long, unsigned long, unsigned long, ceph::buffer::v15_2_0::list&&)> >::invoke<0ul, fu2::abi_310::detail::type_erasure::erasure<true, fu2::abi_310::detail::config<true, false, 16ul>, fu2::abi_310::detail::property<true, false, void (boost::system::error_code, unsigned long, unsigned long, unsigned long, ceph::buffer::v15_2_0::list&&)> >&, boost::system::error_code, unsigned long, unsigned long, unsigned long, ceph::buffer::v15_2_0::list>(fu2::abi_310::detail::type_erasure::erasure<true, fu2::abi_310::detail::config<true, false, 16ul>, fu2::abi_310::detail::property<true, false, void (boost::system::error_code, unsigned long, unsigned long, unsigned long, ceph::buffer::v15_2_0::list&&)> >&, boost::system::error_code&&, unsigned long&&, unsigned long&&, unsigned long&&, ceph::buffer::v15_2_0::list&&) (erasure=...) at /usr/src/debug/ceph-17.0.0-15175.g7f41f157.el8.x86_64/src/include/function2.hpp:1123
#6  fu2::abi_310::detail::type_erasure::invocation_table::operator_impl<0ul, fu2::abi_310::detail::function<fu2::abi_310::detail::config<true, false, 16ul>, fu2::abi_310::detail::property<true, false, void (boost::system::error_code, unsigned long, unsigned long, unsigned long, ceph::buffer::v15_2_0::list&&)> >, void (boost::system::error_code, unsigned long, unsigned long, unsigned long, ceph::buffer::v15_2_0::list&&)>::operator()(boost::system::error_code, unsigned long, unsigned long, unsigned long, ceph::buffer::v15_2_0::list&&) (args#4=..., args#3=0, args#2=94202633757536, args#1=0, args#0=..., this=0x55ad4285a760)
    at /usr/src/debug/ceph-17.0.0-15175.g7f41f157.el8.x86_64/src/include/function2.hpp:702
#7  CB_DoWatchError::operator() (this=0x7f8cca8a5ae0) at /usr/src/debug/ceph-17.0.0-15175.g7f41f157.el8.x86_64/src/osdc/Objecter.cc:650
#8  boost::asio::detail::binder0<CB_DoWatchError>::operator() (this=0x7f8cca8a5ae0) at /usr/src/debug/ceph-17.0.0-15175.g7f41f157.el8.x86_64/x86_64-redhat-linux-gnu/boost/include/boost/asio/detail/bind_handler.hpp:60
#9  boost::asio::asio_handler_invoke<boost::asio::detail::binder0<CB_DoWatchError> > (function=...) at /usr/src/debug/ceph-17.0.0-15175.g7f41f157.el8.x86_64/x86_64-redhat-linux-gnu/boost/include/boost/asio/handler_invoke_hook.hpp:88
#10 boost_asio_handler_invoke_helpers::invoke<boost::asio::detail::binder0<CB_DoWatchError>, CB_DoWatchError> (context=..., function=...)
    at /usr/src/debug/ceph-17.0.0-15175.g7f41f157.el8.x86_64/x86_64-redhat-linux-gnu/boost/include/boost/asio/detail/handler_invoke_helpers.hpp:54
#11 boost::asio::detail::asio_handler_invoke<boost::asio::detail::binder0<CB_DoWatchError>, CB_DoWatchError> (this_handler=0x7f8cca8a5ae0, function=...)
    at /usr/src/debug/ceph-17.0.0-15175.g7f41f157.el8.x86_64/x86_64-redhat-linux-gnu/boost/include/boost/asio/detail/bind_handler.hpp:111
#12 boost_asio_handler_invoke_helpers::invoke<boost::asio::detail::binder0<CB_DoWatchError>, boost::asio::detail::binder0<CB_DoWatchError> > (context=..., function=...)
    at /usr/src/debug/ceph-17.0.0-15175.g7f41f157.el8.x86_64/x86_64-redhat-linux-gnu/boost/include/boost/asio/detail/handler_invoke_helpers.hpp:54
#13 boost::asio::detail::handler_work<boost::asio::detail::binder0<CB_DoWatchError>, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul>, void>::complete<boost::asio::detail::binder0<CB_DoWatchError> > (handler=..., 
    function=..., this=<synthetic pointer>) at /usr/src/debug/ceph-17.0.0-15175.g7f41f157.el8.x86_64/x86_64-redhat-linux-gnu/boost/include/boost/asio/detail/handler_work.hpp:512
#14 boost::asio::detail::completion_handler<boost::asio::detail::binder0<CB_DoWatchError>, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> >::do_complete (owner=<optimized out>, base=<optimized out>)
    at /usr/src/debug/ceph-17.0.0-15175.g7f41f157.el8.x86_64/x86_64-redhat-linux-gnu/boost/include/boost/asio/detail/completion_handler.hpp:74
#15 0x00007f8cd546f995 in boost::asio::detail::strand_service::do_complete (owner=0x55ad42624830, base=0x55ad426f3cc0, ec=...)
    at /usr/src/debug/ceph-17.0.0-15175.g7f41f157.el8.x86_64/x86_64-redhat-linux-gnu/boost/include/boost/asio/detail/scheduler_operation.hpp:40
#16 0x00007f8cd547012a in boost::asio::detail::scheduler::do_run_one (this=0x55ad42624830, lock=..., this_thread=..., ec=...)
    at /usr/src/debug/ceph-17.0.0-15175.g7f41f157.el8.x86_64/x86_64-redhat-linux-gnu/boost/include/boost/asio/detail/scheduler_operation.hpp:40
#17 0x00007f8cd545eab1 in boost::asio::detail::scheduler::run(boost::system::error_code&) [clone .isra.0] ()
    at /usr/src/debug/ceph-17.0.0-15175.g7f41f157.el8.x86_64/x86_64-redhat-linux-gnu/boost/include/boost/asio/detail/impl/scheduler.ipp:210
#18 0x00007f8cd5474a4c in std::thread::_State_impl<std::thread::_Invoker<std::tuple<make_named_thread<ceph::async::io_context_pool::start(short)::{lambda()#1}>(std::basic_string_view<char, std::char_traits<char> >, ceph::async::io_context_pool::start(short)::{lambda()#1}&&)::{lambda(auto:1, auto:2&&)#1}, {lambda()#1}> > >::_M_run() (this=0x55ad426f4930)
    at /usr/src/debug/ceph-17.0.0-15175.g7f41f157.el8.x86_64/x86_64-redhat-linux-gnu/boost/include/boost/asio/impl/io_context.ipp:63
#19 0x00007f8cd28a1ba3 in std::execute_native_thread_routine (__p=0x55ad426f4930) at ../../../../../libstdc++-v3/src/c++11/thread.cc:80
#20 0x00007f8cd51401cf in start_thread (arg=<optimized out>) at pthread_create.c:534
#21 0x00007f8cd1eb9d83 in __new_exitfn (listp=0x7ffda4aac840) at cxa_atexit.c:92

Same issue as what was documented in this closed Tracker: https://tracker.ceph.com/issues/50371

#37 Updated by Radoslaw Zarzynski over 1 year ago

  • Assignee changed from Brad Hubbard to Nitzan Mordechai

Assigning to Nitzan just for the sake of testing the hypothesis from https://tracker.ceph.com/issues/50042#note-35.

#38 Updated by Nitzan Mordechai over 1 year ago

I checked all the list_watchers failures (checking size of watch list), It looks like the watcher timed out and that is the reason list_watchers returned 0 watch. We will not get specific return code from list_watchers (it is not based on cookies), even if we will add watch_check, there is a chance we will hit another zero list of watch. since the rados test runs with the default watch timeout (30 sec), i'll increase it to 120 sec and will run another test.

#39 Updated by Nitzan Mordechai over 1 year ago

(gdb) bt
#0  0x00007f8cd545b313 in WatchInfo::handle_error (err=-107, cookie=94202633757536, this=0x55ad426df570) at /usr/src/debug/ceph-17.0.0-15175.g7f41f157.el8.x86_64/src/librados/IoCtxImpl.cc:1610
#1  WatchInfo::operator() (bl=..., notifier_id=<optimized out>, cookie=94202633757536, notify_id=<optimized out>, ec=..., this=0x55ad426df570) at /usr/src/debug/ceph-17.0.0-15175.g7f41f157.el8.x86_64/src/librados/IoCtxImpl.cc:1619
#2  fu2::abi_310::detail::invocation::invoke<WatchInfo&, boost::system::error_code, unsigned long, unsigned long, unsigned long, ceph::buffer::v15_2_0::list> (callable=...)
    at /usr/src/debug/ceph-17.0.0-15175.g7f41f157.el8.x86_64/src/include/function2.hpp:126
#3  fu2::abi_310::detail::type_erasure::invocation_table::function_trait<void (boost::system::error_code, unsigned long, unsigned long, unsigned long, ceph::buffer::v15_2_0::list&&)>::internal_invoker<fu2::abi_310::detail::type_erasure::box<false, WatchInfo, std::allocator<WatchInfo> >, false>::invoke(fu2::abi_310::detail::type_erasure::data_accessor*, unsigned long, boost::system::error_code, unsigned long, unsigned long, unsigned long, ceph::buffer::v15_2_0::list&&) (
    data=<optimized out>, capacity=<optimized out>, args#0=..., args#1=<optimized out>, args#2=94202633757536, args#3=<optimized out>, args#4=...) at /usr/src/debug/ceph-17.0.0-15175.g7f41f157.el8.x86_64/src/include/function2.hpp:518

When we are hitting handle_error before handle_notify we can get that Segmentation fault when we're trying to erase cookie from the notify_cookies set

#40 Updated by Nitzan Mordechai over 1 year ago

  • Status changed from New to In Progress

#41 Updated by Nitzan Mordechai over 1 year ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 49109

#42 Updated by Nitzan Mordechai over 1 year ago

  • Related to Bug #45615: api_watch_notify_pp: LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.WatchNotify/1 fails added

#43 Updated by Nitzan Mordechai about 1 year ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport changed from pacific to pacific,quincy

#44 Updated by Backport Bot about 1 year ago

  • Copied to Backport #58868: pacific: rados/test.sh: api_watch_notify failures added

#45 Updated by Backport Bot about 1 year ago

  • Copied to Backport #58869: quincy: rados/test.sh: api_watch_notify failures added

#46 Updated by Backport Bot about 1 year ago

  • Tags set to backport_processed

#47 Updated by Laura Flores about 1 year ago

  • Tags set to test-failure

/a/yuriw-2023-03-07_23:24:03-rados-wip-yuri6-testing-2023-03-07-1336-quincy-distro-default-smithi/7196726

2023-03-08T02:07:29.775 INFO:tasks.workunit.client.0.smithi153.stdout:      api_watch_notify_pp: [ RUN      ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.AioWatchNotify2/1
2023-03-08T02:07:30.059 INFO:tasks.workunit.client.0.smithi153.stderr:bash: line 1: 30512 Segmentation fault      (core dumped) ceph_test_rados_api_watch_notify_pp 2>&1
2023-03-08T02:07:30.059 INFO:tasks.workunit.client.0.smithi153.stderr:     30513 Done                    | tee ceph_test_rados_api_watch_notify_pp.log
2023-03-08T02:07:30.059 INFO:tasks.workunit.client.0.smithi153.stderr:     30514 Done                    | sed "s/^/      api_watch_notify_pp: /" 
2023-03-08T02:07:30.541 INFO:tasks.workunit.client.0.smithi153.stdout:                 api_lock: Running main() from gmock_main.cc
2023-03-08T02:07:30.541 INFO:tasks.workunit.client.0.smithi153.stdout:                 api_lock: [==========] Running 16 tests from 2 test suites.

#48 Updated by Laura Flores 9 months ago

/a/yuriw-2023-06-23_20:51:14-rados-wip-yuri8-testing-2023-06-22-1309-pacific-distro-default-smithi/7314199

Coredump at:
/a/yuriw-2023-06-23_20:51:14-rados-wip-yuri8-testing-2023-06-22-1309-pacific-distro-default-smithi/7314199/remote/smithi112/coredump

Same symptom:

/a/yuriw-2023-06-23_20:51:14-rados-wip-yuri8-testing-2023-06-22-1309-pacific-distro-default-smithi/7314199/remote/smithi112/log$ zgrep "creation canceled" ceph-mon.a.log.gz
2023-06-24T00:19:26.954+0000 7f319c7a7700 20 mon.a@0(leader).osd e23 erasure code profile rm testprofile-test-rados-api-smithi112-108683-15: creation canceled

#49 Updated by Laura Flores 9 months ago

Just needed a P backport. Approved the backport and marked it ready for QA.

#50 Updated by Laura Flores 9 months ago

  • Duplicated by Bug #49428: ceph_test_rados_api_snapshots fails with "rados_mon_command osd pool create failed with error -22" added

#51 Updated by Konstantin Shalygin 3 months ago

  • Subject changed from rados/test.sh: api_watch_notify failures to rados/test.sh: api_watch_notify failures
  • Status changed from Pending Backport to Resolved
  • % Done changed from 0 to 100

Also available in: Atom PDF