Bug #65495
open1 slow request in rgw suite causes test failure
0%
Description
on an integration branch based on squid, a rgw suite job failed due to 'slow request' errors: https://qa-proxy.ceph.com/teuthology/cbodley-2024-04-15_17:32:27-rgw-wip-cbodley2-testing-distro-default-smithi/7657196/teuthology.log
... 2024-04-15T18:12:54.115 INFO:tasks.ceph.osd.2.smithi005.stderr:2024-04-15T18:12:54.109+0000 7fa2ac7ca640 -1 osd.2 35 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.4486.0:1493 3.3f 3:fe9ff65e:::429598c5-09a8-449d-831a-b9efc6a49f6b.4486.2__%3azZJKxn4eHoUQdp8clpg-8aClbiCxTP0_e:head [create,setxattr user.rgw.idtag (62) in=76b,setxattr user.rgw.tail_tag (62) in=79b,writefull 0~11 in=11b,set-alloc-hint object_size 0 write_size 0,setxattr user.rgw.manifest (371) in=388b,setxattr user.rgw.acl (179) in=191b,setxattr user.rgw.compression (59) in=79b,setxattr user.rgw.etag (32) in=45b,setxattr user.rgw.torrent (109) in=125b,setxattr user.rgw.x-amz-content-sha256 (17) in=46b,setxattr user.rgw.x-amz-date (17) in=36b,setxattr user.rgw.x-amz-meta-tag (7) in=30b,call rgw.obj_store_pg_ver in=44b,setxattr user.rgw.source_zone (4) in=24b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e35) 2024-04-15T18:12:55.086 INFO:tasks.ceph.osd.2.smithi005.stderr:2024-04-15T18:12:55.081+0000 7fa2ac7ca640 -1 osd.2 35 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.4486.0:1493 3.3f 3:fe9ff65e:::429598c5-09a8-449d-831a-b9efc6a49f6b.4486.2__%3azZJKxn4eHoUQdp8clpg-8aClbiCxTP0_e:head [create,setxattr user.rgw.idtag (62) in=76b,setxattr user.rgw.tail_tag (62) in=79b,writefull 0~11 in=11b,set-alloc-hint object_size 0 write_size 0,setxattr user.rgw.manifest (371) in=388b,setxattr user.rgw.acl (179) in=191b,setxattr user.rgw.compression (59) in=79b,setxattr user.rgw.etag (32) in=45b,setxattr user.rgw.torrent (109) in=125b,setxattr user.rgw.x-amz-content-sha256 (17) in=46b,setxattr user.rgw.x-amz-date (17) in=36b,setxattr user.rgw.x-amz-meta-tag (7) in=30b,call rgw.obj_store_pg_ver in=44b,setxattr user.rgw.source_zone (4) in=24b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e35) 2024-04-15T18:12:56.074 INFO:tasks.ceph.osd.2.smithi005.stderr:2024-04-15T18:12:56.069+0000 7fa2ac7ca640 -1 osd.2 35 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.4486.0:1493 3.3f 3:fe9ff65e:::429598c5-09a8-449d-831a-b9efc6a49f6b.4486.2__%3azZJKxn4eHoUQdp8clpg-8aClbiCxTP0_e:head [create,setxattr user.rgw.idtag (62) in=76b,setxattr user.rgw.tail_tag (62) in=79b,writefull 0~11 in=11b,set-alloc-hint object_size 0 write_size 0,setxattr user.rgw.manifest (371) in=388b,setxattr user.rgw.acl (179) in=191b,setxattr user.rgw.compression (59) in=79b,setxattr user.rgw.etag (32) in=45b,setxattr user.rgw.torrent (109) in=125b,setxattr user.rgw.x-amz-content-sha256 (17) in=46b,setxattr user.rgw.x-amz-date (17) in=36b,setxattr user.rgw.x-amz-meta-tag (7) in=30b,call rgw.obj_store_pg_ver in=44b,setxattr user.rgw.source_zone (4) in=24b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e35) 2024-04-15T18:12:56.604 INFO:tasks.workunit.client.0.smithi005.stderr:Read timeout on endpoint URL: "https://localhost:443/check-bucket/e" ... teuthology.exceptions.CommandFailedError: Command failed (workunit test rgw/run-bucket-check.sh) on smithi005 with status 1: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=2a33e1bb169d6d4488af40dc16da3c050844e360 TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 CEPH_MNT=/home/ubuntu/cephtest/mnt.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/rgw/run-bucket-check.sh'
the qa/workunits/rgw/run-bucket-check.sh
script is setting up a test by creating several objects "a", "b", "c" etc. creation of object "e" is hanging, and a request timeout ultimately causes the workunit to fail
2024-04-15T18:11:56.550+0000 934f9640 20 req 2506544305206278652 0.031999689s s3:put_obj get_obj_state: octx=0x13ee81c90 obj=check-bucket:_:zZJKxn4eHoUQdp8clpg-8aClbiCxTP0_e state=0x14096dfa8 s->prefetch_data=0 2024-04-15T18:11:56.550+0000 934f9640 10 req 2506544305206278652 0.031999689s s3:put_obj setting object write_tag=429598c5-09a8-449d-831a-b9efc6a49f6b.4486.2506544305206278652 2024-04-15T18:11:56.550+0000 934f9640 20 req 2506544305206278652 0.031999689s s3:put_obj bucket index object: :.dir.429598c5-09a8-449d-831a-b9efc6a49f6b.4486.2.1
this shows rgw preparing to write the head object with oid zZJKxn4eHoUQdp8clpg-8aClbiCxTP0_e
. no further log output shows up for this req 2506544305206278652
, which indicates that rgw never got the response to this rados write request
$ grep 'slow request' -n ceph-osd.2.log 117000636:2024-04-15T18:12:27.201+0000 7fa2ac7ca640 20 slow request osd_op(client.4486.0:1493 3.3f 3:fe9ff65e:::429598c5-09a8-449d-831a-b9efc6a49f6b.4486.2__%3azZJKxn4eHoUQdp8clpg-8aClbiCxTP0_e:head [create,setxattr user.rgw.idtag (62) in=76b,setxattr user.rgw.tail_tag (62) in=79b,writefull 0~11 in=11b,set-alloc-hint object_size 0 write_size 0,setxattr user.rgw.manifest (371) in=388b,setxattr user.rgw.acl (179) in=191b,setxattr user.rgw.compression (59) in=79b,setxattr user.rgw.etag (32) in=45b,setxattr user.rgw.torrent (109) in=125b,setxattr user.rgw.x-amz-content-sha256 (17) in=46b,setxattr user.rgw.x-amz-date (17) in=36b,setxattr user.rgw.x-amz-meta-tag (7) in=30b,call rgw.obj_store_pg_ver in=44b,setxattr user.rgw.source_zone (4) in=24b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e35) initiated 2024-04-15T18:11:56.558730+0000 currently waiting for readable 117000640:2024-04-15T18:12:27.201+0000 7fa2ac7ca640 20 1 slow requests (by type [ 'delayed' : 1 ] most affected pool [ 'default.rgw.buckets.data' : 1 ]) 117000643:2024-04-15T18:12:27.201+0000 7fa2ac7ca640 0 log_channel(cluster) log [WRN] : 1 slow requests (by type [ 'delayed' : 1 ] most affected pool [ 'default.rgw.buckets.data' : 1 ]) 117371032:2024-04-15T18:12:28.221+0000 7fa2ac7ca640 20 slow request osd_op(client.4486.0:1493 3.3f 3:fe9ff65e:::429598c5-09a8-449d-831a-b9efc6a49f6b.4486.2__%3azZJKxn4eHoUQdp8clpg-8aClbiCxTP0_e:head [create,setxattr user.rgw.idtag (62) in=76b,setxattr user.rgw.tail_tag (62) in=79b,writefull 0~11 in=11b,set-alloc-hint object_size 0 write_size 0,setxattr user.rgw.manifest (371) in=388b,setxattr user.rgw.acl (179) in=191b,setxattr user.rgw.compression (59) in=79b,setxattr user.rgw.etag (32) in=45b,setxattr user.rgw.torrent (109) in=125b,setxattr user.rgw.x-amz-content-sha256 (17) in=46b,setxattr user.rgw.x-amz-date (17) in=36b,setxattr user.rgw.x-amz-meta-tag (7) in=30b,call rgw.obj_store_pg_ver in=44b,setxattr user.rgw.source_zone (4) in=24b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e35) initiated 2024-04-15T18:11:56.558730+0000 currently waiting for readable 117371038:2024-04-15T18:12:28.221+0000 7fa2ac7ca640 20 1 slow requests (by type [ 'delayed' : 1 ] most affected pool [ 'default.rgw.buckets.data' : 1 ]) 117371041:2024-04-15T18:12:28.221+0000 7fa2ac7ca640 0 log_channel(cluster) log [WRN] : 1 slow requests (by type [ 'delayed' : 1 ] most affected pool [ 'default.rgw.buckets.data' : 1 ]) 117758176:2024-04-15T18:12:29.241+0000 7fa2ac7ca640 20 slow request osd_op(client.4486.0:1493 3.3f 3:fe9ff65e:::429598c5-09a8-449d-831a-b9efc6a49f6b.4486.2__%3azZJKxn4eHoUQdp8clpg-8aClbiCxTP0_e:head [create,setxattr user.rgw.idtag (62) in=76b,setxattr user.rgw.tail_tag (62) in=79b,writefull 0~11 in=11b,set-alloc-hint object_size 0 write_size 0,setxattr user.rgw.manifest (371) in=388b,setxattr user.rgw.acl (179) in=191b,setxattr user.rgw.compression (59) in=79b,setxattr user.rgw.etag (32) in=45b,setxattr user.rgw.torrent (109) in=125b,setxattr user.rgw.x-amz-content-sha256 (17) in=46b,setxattr user.rgw.x-amz-date (17) in=36b,setxattr user.rgw.x-amz-meta-tag (7) in=30b,call rgw.obj_store_pg_ver in=44b,setxattr user.rgw.source_zone (4) in=24b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e35) initiated 2024-04-15T18:11:56.558730+0000 currently waiting for readable 117758180:2024-04-15T18:12:29.241+0000 7fa2ac7ca640 20 1 slow requests (by type [ 'delayed' : 1 ] most affected pool [ 'default.rgw.buckets.data' : 1 ]) 117758183:2024-04-15T18:12:29.241+0000 7fa2ac7ca640 0 log_channel(cluster) log [WRN] : 1 slow requests (by type [ 'delayed' : 1 ] most affected pool [ 'default.rgw.buckets.data' : 1 ])
the log file for this osd.2 is far larger than other osds:
ceph-osd.0.log.gz 15-Apr-2024 18:24 3572161 ceph-osd.1.log.gz 15-Apr-2024 18:24 3452643 ceph-osd.2.log.gz 15-Apr-2024 18:24 1558959387 ceph-osd.3.log.gz 15-Apr-2024 18:24 4008726
the log was too big to inspect myself, but hopefully it can shed some light on the slow request: https://qa-proxy.ceph.com/teuthology/cbodley-2024-04-15_17:32:27-rgw-wip-cbodley2-testing-distro-default-smithi/7657196/remote/smithi005/log/ceph-osd.2.log.gz
Updated by Casey Bodley 15 days ago
i see that one of the osds on the other node has a similarly large log:
ceph-osd.4.log.gz 15-Apr-2024 18:24 3339388 ceph-osd.5.log.gz 15-Apr-2024 18:24 3793322 ceph-osd.6.log.gz 15-Apr-2024 18:24 1207287104 ceph-osd.7.log.gz 15-Apr-2024 18:24 3971665
scanning through
ceph-osd.2.log
, i see that these osds 2 and 6 seem to be exchanging a whole lot of peering traffic?