Project

General

Profile

Actions

Bug #56609

closed

performance issues causing teuthology failures: RGWWatcher::handle_error (107) Transport endpoint is not connected

Added by Casey Bodley almost 2 years ago. Updated over 1 year ago.

Status:
Closed
Priority:
Urgent
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

some s3tests in teuthology are failing due to significant delays, including watch/notify dinnections:

2022-07-18T04:09:12.709 INFO:teuthology.orchestra.run.smithi117.stderr:s3tests_boto3.functional.test_s3.test_object_copy_versioned_url_encoding ... ok
2022-07-18T04:13:05.070 INFO:tasks.rgw.client.0.smithi117.stdout:2022-07-18T04:13:05.016+0000 1ae24700 -1 rgw watcher librados: RGWWatcher::handle_error cookie 539093392 err (107) Transport endpoint is not connected
2022-07-18T04:13:05.073 INFO:tasks.rgw.client.0.smithi117.stdout:2022-07-18T04:13:05.045+0000 1ae24700 -1 rgw watcher librados: RGWWatcher::handle_error cookie 538935136 err (107) Transport endpoint is not connected
2022-07-18T04:13:05.074 INFO:tasks.rgw.client.0.smithi117.stdout:2022-07-18T04:13:05.050+0000 1ae24700 -1 rgw watcher librados: RGWWatcher::handle_error cookie 5486698560 err (107) Transport endpoint is not connected
2022-07-18T04:13:05.076 INFO:tasks.rgw.client.0.smithi117.stdout:2022-07-18T04:13:05.051+0000 1ae24700 -1 rgw watcher librados: RGWWatcher::handle_error cookie 765811392 err (107) Transport endpoint is not connected
2022-07-18T04:13:21.233 INFO:tasks.rgw.client.0.smithi117.stdout:2022-07-18T04:13:13.675+0000 1ae24700 -1 rgw watcher librados: RGWWatcher::handle_error cookie 765769920 err (107) Transport endpoint is not connected
2022-07-18T04:13:21.234 INFO:tasks.rgw.client.0.smithi117.stdout:2022-07-18T04:13:13.675+0000 1ae24700 -1 rgw watcher librados: RGWWatcher::handle_error cookie 539131760 err (107) Transport endpoint is not connected
2022-07-18T04:13:21.235 INFO:tasks.rgw.client.0.smithi117.stdout:2022-07-18T04:13:13.675+0000 1ae24700 -1 rgw watcher librados: RGWWatcher::handle_error cookie 539052992 err (107) Transport endpoint is not connected
2022-07-18T04:13:21.236 INFO:tasks.rgw.client.0.smithi117.stdout:2022-07-18T04:13:13.676+0000 1ae24700 -1 rgw watcher librados: RGWWatcher::handle_error cookie 297726304 err (107) Transport endpoint is not connected
2022-07-18T04:14:41.903 INFO:tasks.rgw.client.0.smithi117.stdout:2022-07-18T04:14:40.668+0000 1a623700 -1 rgw watcher librados: RGWWatcher::handle_error cookie 5460315872 err (107) Transport endpoint is not connected
2022-07-18T04:14:41.937 INFO:tasks.rgw.client.0.smithi117.stdout:2022-07-18T04:14:41.934+0000 1a623700 -1 rgw watcher librados: RGWWatcher::handle_error cookie 766699728 err (107) Transport endpoint is not connected
2022-07-18T04:14:41.938 INFO:tasks.rgw.client.0.smithi117.stdout:2022-07-18T04:14:41.934+0000 1a623700 -1 rgw watcher librados: RGWWatcher::handle_error cookie 768581008 err (107) Transport endpoint is not connected
2022-07-18T04:14:41.939 INFO:tasks.rgw.client.0.smithi117.stdout:2022-07-18T04:14:41.934+0000 1a623700 -1 rgw watcher librados: RGWWatcher::handle_error cookie 839848816 err (107) Transport endpoint is not connected
2022-07-18T04:14:41.985 INFO:tasks.rgw.client.0.smithi117.stdout:2022-07-18T04:14:41.975+0000 1ae24700 -1 rgw watcher librados: RGWWatcher::handle_error cookie 5478414768 err (107) Transport endpoint is not connected
2022-07-18T04:14:41.986 INFO:tasks.rgw.client.0.smithi117.stdout:2022-07-18T04:14:41.975+0000 1ae24700 -1 rgw watcher librados: RGWWatcher::handle_error cookie 292018432 err (107) Transport endpoint is not connected
2022-07-18T04:14:41.986 INFO:tasks.rgw.client.0.smithi117.stdout:2022-07-18T04:14:41.976+0000 1ae24700 -1 rgw watcher librados: RGWWatcher::handle_error cookie 535364960 err (107) Transport endpoint is not connected
2022-07-18T04:15:43.605 INFO:teuthology.orchestra.run.smithi117.stderr:s3tests_boto3.functional.test_s3.test_object_copy_versioning_multipart_upload ... ERROR

ex: http://qa-proxy.ceph.com/teuthology/mbenjamin-2022-07-17_22:24:15-rgw-wip-rgwlc-azone-distro-default-smithi/6935342/teuthology.log


Related issues 1 (0 open1 closed)

Has duplicate rgw - Bug #57128: rgw: s3tests are failing with timeoutDuplicate

Actions
Actions #1

Updated by Ali Maredia almost 2 years ago

I have seen these issues running the full rgw suiite in teuthology on the main branch over the last 1-2 weeks. Specifically in runs of the s3tests in the verify suite. The failures from this issue are not common, but here did happen every run in a few jobs:

Run #1: https://pulpito.ceph.com/amaredia-2022-07-19_21:41:51-rgw-main-distro-default-smithi/
Failed Job #1: https://pulpito.ceph.com/amaredia-2022-07-19_21:41:51-rgw-main-distro-default-smithi/6938676/
Link to Ceph Logs for Job: http://qa-proxy.ceph.com/teuthology/amaredia-2022-07-19_21:41:51-rgw-main-distro-default-smithi/6938676/remote/
Failed Job #2: https://pulpito.ceph.com/amaredia-2022-07-19_21:41:51-rgw-main-distro-default-smithi/6938622/
Link to Ceph Logs for Job: http://qa-proxy.ceph.com/teuthology/amaredia-2022-07-19_21:41:51-rgw-main-distro-default-smithi/6938622/remote/
Failed Job #3: https://pulpito.ceph.com/amaredia-2022-07-19_21:41:51-rgw-main-distro-default-smithi/6938649/
Link to Ceph Logs for Job: http://qa-proxy.ceph.com/teuthology/amaredia-2022-07-19_21:41:51-rgw-main-distro-default-smithi/6938649/remote/

Run #2: https://pulpito.ceph.com/amaredia-2022-07-19_18:37:19-rgw-main-distro-default-smithi/
Failed Job #1: https://pulpito.ceph.com/amaredia-2022-07-19_18:37:19-rgw-main-distro-default-smithi/6938441/
Link to Ceph Logs for Job: http://qa-proxy.ceph.com/teuthology/amaredia-2022-07-19_18:37:19-rgw-main-distro-default-smithi/6938441/remote/
Failed Job #2: https://pulpito.ceph.com/amaredia-2022-07-19_18:37:19-rgw-main-distro-default-smithi/6938467/
Link to Ceph Logs for Job: http://qa-proxy.ceph.com/teuthology/amaredia-2022-07-19_18:37:19-rgw-main-distro-default-smithi/6938467/remote/
Failed Job #3: https://pulpito.ceph.com/amaredia-2022-07-19_18:37:19-rgw-main-distro-default-smithi/6938492/
Link to Ceph Logs for Job: http://qa-proxy.ceph.com/teuthology/amaredia-2022-07-19_18:37:19-rgw-main-distro-default-smithi/6938492/remote/

Run #3: https://pulpito.ceph.com/amaredia-2022-07-15_17:26:12-rgw-main-distro-default-smithi/
Failed Job #1: https://pulpito.ceph.com/amaredia-2022-07-15_17:26:12-rgw-main-distro-default-smithi/6932391/
Link to Ceph Logs for Job: http://qa-proxy.ceph.com/teuthology/amaredia-2022-07-15_17:26:12-rgw-main-distro-default-smithi/6932391/remote/
Failed Job #2: https://pulpito.ceph.com/amaredia-2022-07-15_17:26:12-rgw-main-distro-default-smithi/6932397/
Link to Ceph Logs for Job: http://qa-proxy.ceph.com/teuthology/amaredia-2022-07-15_17:26:12-rgw-main-distro-default-smithi/6932397/remote/

Run #4: https://pulpito.ceph.com/amaredia-2022-07-14_18:37:42-rgw-main-distro-default-smithi/
Failed Job #1: https://pulpito.ceph.com/amaredia-2022-07-14_18:37:42-rgw-main-distro-default-smithi/6930614/
Link to Ceph Logs for Job: http://qa-proxy.ceph.com/teuthology/amaredia-2022-07-14_18:37:42-rgw-main-distro-default-smithi/6930614/remote/

The rgw verify teuthology suite config for each of the above jobs from Run #1 Job #1 to Run #4 Job #1 is:

rgw/verify/{0-install centos_latest clusters/fixed-2 datacache/no_datacache frontend/beast ignore-pg-availability msgr-failures/few objectstore/bluestore-bitmap overrides proto/https rgw_pool_type/ec-profile s3tests-branch sharding$/{default} striping$/{stripe-greater-than-chunk} tasks/{cls ragweed reshard s3tests-java s3tests} validater/valgrind}

rgw/verify/{0-install centos_latest clusters/fixed-2 datacache/rgw-datacache frontend/beast ignore-pg-availability msgr-failures/few objectstore/bluestore-bitmap overrides proto/https rgw_pool_type/ec s3tests-branch sharding$/{single} striping$/{stripe-equals-chunk} tasks/{cls ragweed reshard s3tests-java s3tests} validater/valgrind}

rgw/verify/{0-install centos_latest clusters/fixed-2 datacache/no_datacache frontend/beast ignore-pg-availability msgr-failures/few objectstore/filestore-xfs overrides proto/https rgw_pool_type/ec s3tests-branch sharding$/{default} striping$/{stripe-greater-than-chunk} tasks/{cls ragweed reshard s3tests-java s3tests} validater/valgrind}

rgw/verify/{0-install centos_latest clusters/fixed-2 datacache/rgw-datacache frontend/beast ignore-pg-availability msgr-failures/few objectstore/bluestore-bitmap overrides proto/https rgw_pool_type/replicated s3tests-branch sharding$/{single} striping$/{stripe-greater-than-chunk} tasks/{cls ragweed reshard s3tests-java s3tests} validater/valgrind}

rgw/verify/{0-install centos_latest clusters/fixed-2 datacache/no_datacache frontend/beast ignore-pg-availability msgr-failures/few objectstore/filestore-xfs overrides proto/https rgw_pool_type/replicated s3tests-branch sharding$/{single} striping$/{stripe-greater-than-chunk} tasks/{cls ragweed reshard s3tests-java s3tests} validater/valgrind}

rgw/verify/{0-install centos_latest clusters/fixed-2 datacache/no_datacache frontend/beast ignore-pg-availability msgr-failures/few objectstore/bluestore-bitmap overrides proto/https rgw_pool_type/replicated s3tests-branch sharding$/{default} striping$/{stripe-equals-chunk} tasks/{cls ragweed reshard s3tests-java s3tests} validater/valgrind}

rgw/verify/{0-install centos_latest clusters/fixed-2 datacache/no_datacache frontend/beast ignore-pg-availability msgr-failures/few objectstore/filestore-xfs overrides proto/https rgw_pool_type/ec-profile s3tests-branch sharding$/{single} striping$/{stripe-greater-than-chunk} tasks/{cls ragweed reshard s3tests-java s3tests} validater/valgrind}

rgw/verify/{0-install centos_latest clusters/fixed-2 datacache/no_datacache frontend/beast ignore-pg-availability msgr-failures/few objectstore/filestore-xfs overrides proto/https rgw_pool_type/replicated s3tests-branch sharding$/{single} striping$/{stripe-equals-chunk} tasks/{cls ragweed reshard s3tests-java s3tests} validater/valgrind}

rgw/verify/{0-install centos_latest clusters/fixed-2 datacache/rgw-datacache frontend/beast ignore-pg-availability msgr-failures/few objectstore/bluestore-bitmap overrides proto/https rgw_pool_type/ec-profile s3tests-branch sharding$/{default} striping$/{stripe-equals-chunk} tasks/{cls ragweed reshard s3tests-java s3tests} validater/valgrind}

I don't see anything that sticks out that's in common for each of these jobs, though maybe someone else with better eyes than me can see a pattern.

Actions #2

Updated by Casey Bodley over 1 year ago

  • Has duplicate Bug #57128: rgw: s3tests are failing with timeout added
Actions #3

Updated by Ali Maredia over 1 year ago

  • Status changed from New to Resolved

Something has been changed (probably in the OSD) over the last few months to make this issue go away. I'm closing this issue unless it reappears.

Below are 2 clean runs of the RGW Verify suite:

https://pulpito.ceph.com/amaredia-2022-09-22_22:59:08-rgw:verify-main-distro-default-smithi/

https://pulpito.ceph.com/amaredia-2022-09-22_14:05:55-rgw:verify-main-distro-default-smithi/

Actions #4

Updated by Ali Maredia over 1 year ago

  • Status changed from Resolved to Closed
Actions

Also available in: Atom PDF