Bug #65495: 1 slow request in rgw suite causes test failure - RADOS - Ceph

Actions

Copy link

Bug #65495

open

1 slow request in rgw suite causes test failure

Added by Casey Bodley 15 days ago. Updated 15 days ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Yes

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v19.0.0

ceph-qa-suite:

rgw

Component(RADOS):

OSD

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

on an integration branch based on squid, a rgw suite job failed due to 'slow request' errors: https://qa-proxy.ceph.com/teuthology/cbodley-2024-04-15_17:32:27-rgw-wip-cbodley2-testing-distro-default-smithi/7657196/teuthology.log

...
2024-04-15T18:12:54.115 INFO:tasks.ceph.osd.2.smithi005.stderr:2024-04-15T18:12:54.109+0000 7fa2ac7ca640 -1 osd.2 35 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.4486.0:1493 3.3f 3:fe9ff65e:::429598c5-09a8-449d-831a-b9efc6a49f6b.4486.2__%3azZJKxn4eHoUQdp8clpg-8aClbiCxTP0_e:head [create,setxattr user.rgw.idtag (62) in=76b,setxattr user.rgw.tail_tag (62) in=79b,writefull 0~11 in=11b,set-alloc-hint object_size 0 write_size 0,setxattr user.rgw.manifest (371) in=388b,setxattr user.rgw.acl (179) in=191b,setxattr user.rgw.compression (59) in=79b,setxattr user.rgw.etag (32) in=45b,setxattr user.rgw.torrent (109) in=125b,setxattr user.rgw.x-amz-content-sha256 (17) in=46b,setxattr user.rgw.x-amz-date (17) in=36b,setxattr user.rgw.x-amz-meta-tag (7) in=30b,call rgw.obj_store_pg_ver in=44b,setxattr user.rgw.source_zone (4) in=24b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e35)
2024-04-15T18:12:55.086 INFO:tasks.ceph.osd.2.smithi005.stderr:2024-04-15T18:12:55.081+0000 7fa2ac7ca640 -1 osd.2 35 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.4486.0:1493 3.3f 3:fe9ff65e:::429598c5-09a8-449d-831a-b9efc6a49f6b.4486.2__%3azZJKxn4eHoUQdp8clpg-8aClbiCxTP0_e:head [create,setxattr user.rgw.idtag (62) in=76b,setxattr user.rgw.tail_tag (62) in=79b,writefull 0~11 in=11b,set-alloc-hint object_size 0 write_size 0,setxattr user.rgw.manifest (371) in=388b,setxattr user.rgw.acl (179) in=191b,setxattr user.rgw.compression (59) in=79b,setxattr user.rgw.etag (32) in=45b,setxattr user.rgw.torrent (109) in=125b,setxattr user.rgw.x-amz-content-sha256 (17) in=46b,setxattr user.rgw.x-amz-date (17) in=36b,setxattr user.rgw.x-amz-meta-tag (7) in=30b,call rgw.obj_store_pg_ver in=44b,setxattr user.rgw.source_zone (4) in=24b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e35)
2024-04-15T18:12:56.074 INFO:tasks.ceph.osd.2.smithi005.stderr:2024-04-15T18:12:56.069+0000 7fa2ac7ca640 -1 osd.2 35 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.4486.0:1493 3.3f 3:fe9ff65e:::429598c5-09a8-449d-831a-b9efc6a49f6b.4486.2__%3azZJKxn4eHoUQdp8clpg-8aClbiCxTP0_e:head [create,setxattr user.rgw.idtag (62) in=76b,setxattr user.rgw.tail_tag (62) in=79b,writefull 0~11 in=11b,set-alloc-hint object_size 0 write_size 0,setxattr user.rgw.manifest (371) in=388b,setxattr user.rgw.acl (179) in=191b,setxattr user.rgw.compression (59) in=79b,setxattr user.rgw.etag (32) in=45b,setxattr user.rgw.torrent (109) in=125b,setxattr user.rgw.x-amz-content-sha256 (17) in=46b,setxattr user.rgw.x-amz-date (17) in=36b,setxattr user.rgw.x-amz-meta-tag (7) in=30b,call rgw.obj_store_pg_ver in=44b,setxattr user.rgw.source_zone (4) in=24b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e35)
2024-04-15T18:12:56.604 INFO:tasks.workunit.client.0.smithi005.stderr:Read timeout on endpoint URL: "https://localhost:443/check-bucket/e" 
...
teuthology.exceptions.CommandFailedError: Command failed (workunit test rgw/run-bucket-check.sh) on smithi005 with status 1: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=2a33e1bb169d6d4488af40dc16da3c050844e360 TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 CEPH_MNT=/home/ubuntu/cephtest/mnt.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/rgw/run-bucket-check.sh'

the qa/workunits/rgw/run-bucket-check.sh script is setting up a test by creating several objects "a", "b", "c" etc. creation of object "e" is hanging, and a request timeout ultimately causes the workunit to fail

from the rgw log https://qa-proxy.ceph.com/teuthology/cbodley-2024-04-15_17:32:27-rgw-wip-cbodley2-testing-distro-default-smithi/7657196/remote/smithi005/log/rgw.ceph.client.0.log.gz

2024-04-15T18:11:56.550+0000 934f9640 20 req 2506544305206278652 0.031999689s s3:put_obj get_obj_state: octx=0x13ee81c90 obj=check-bucket:_:zZJKxn4eHoUQdp8clpg-8aClbiCxTP0_e state=0x14096dfa8 s->prefetch_data=0
2024-04-15T18:11:56.550+0000 934f9640 10 req 2506544305206278652 0.031999689s s3:put_obj setting object write_tag=429598c5-09a8-449d-831a-b9efc6a49f6b.4486.2506544305206278652
2024-04-15T18:11:56.550+0000 934f9640 20 req 2506544305206278652 0.031999689s s3:put_obj  bucket index object: :.dir.429598c5-09a8-449d-831a-b9efc6a49f6b.4486.2.1

this shows rgw preparing to write the head object with oid zZJKxn4eHoUQdp8clpg-8aClbiCxTP0_e. no further log output shows up for this req 2506544305206278652, which indicates that rgw never got the response to this rados write request

from the osd log https://qa-proxy.ceph.com/teuthology/cbodley-2024-04-15_17:32:27-rgw-wip-cbodley2-testing-distro-default-smithi/7657196/remote/smithi005/log/ceph-osd.2.log.gz

$ grep 'slow request' -n ceph-osd.2.log 
117000636:2024-04-15T18:12:27.201+0000 7fa2ac7ca640 20 slow request osd_op(client.4486.0:1493 3.3f 3:fe9ff65e:::429598c5-09a8-449d-831a-b9efc6a49f6b.4486.2__%3azZJKxn4eHoUQdp8clpg-8aClbiCxTP0_e:head [create,setxattr user.rgw.idtag (62) in=76b,setxattr user.rgw.tail_tag (62) in=79b,writefull 0~11 in=11b,set-alloc-hint object_size 0 write_size 0,setxattr user.rgw.manifest (371) in=388b,setxattr user.rgw.acl (179) in=191b,setxattr user.rgw.compression (59) in=79b,setxattr user.rgw.etag (32) in=45b,setxattr user.rgw.torrent (109) in=125b,setxattr user.rgw.x-amz-content-sha256 (17) in=46b,setxattr user.rgw.x-amz-date (17) in=36b,setxattr user.rgw.x-amz-meta-tag (7) in=30b,call rgw.obj_store_pg_ver in=44b,setxattr user.rgw.source_zone (4) in=24b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e35) initiated 2024-04-15T18:11:56.558730+0000 currently waiting for readable
117000640:2024-04-15T18:12:27.201+0000 7fa2ac7ca640 20 1 slow requests (by type [ 'delayed' : 1 ] most affected pool [ 'default.rgw.buckets.data' : 1 ])
117000643:2024-04-15T18:12:27.201+0000 7fa2ac7ca640  0 log_channel(cluster) log [WRN] : 1 slow requests (by type [ 'delayed' : 1 ] most affected pool [ 'default.rgw.buckets.data' : 1 ])
117371032:2024-04-15T18:12:28.221+0000 7fa2ac7ca640 20 slow request osd_op(client.4486.0:1493 3.3f 3:fe9ff65e:::429598c5-09a8-449d-831a-b9efc6a49f6b.4486.2__%3azZJKxn4eHoUQdp8clpg-8aClbiCxTP0_e:head [create,setxattr user.rgw.idtag (62) in=76b,setxattr user.rgw.tail_tag (62) in=79b,writefull 0~11 in=11b,set-alloc-hint object_size 0 write_size 0,setxattr user.rgw.manifest (371) in=388b,setxattr user.rgw.acl (179) in=191b,setxattr user.rgw.compression (59) in=79b,setxattr user.rgw.etag (32) in=45b,setxattr user.rgw.torrent (109) in=125b,setxattr user.rgw.x-amz-content-sha256 (17) in=46b,setxattr user.rgw.x-amz-date (17) in=36b,setxattr user.rgw.x-amz-meta-tag (7) in=30b,call rgw.obj_store_pg_ver in=44b,setxattr user.rgw.source_zone (4) in=24b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e35) initiated 2024-04-15T18:11:56.558730+0000 currently waiting for readable
117371038:2024-04-15T18:12:28.221+0000 7fa2ac7ca640 20 1 slow requests (by type [ 'delayed' : 1 ] most affected pool [ 'default.rgw.buckets.data' : 1 ])
117371041:2024-04-15T18:12:28.221+0000 7fa2ac7ca640  0 log_channel(cluster) log [WRN] : 1 slow requests (by type [ 'delayed' : 1 ] most affected pool [ 'default.rgw.buckets.data' : 1 ])
117758176:2024-04-15T18:12:29.241+0000 7fa2ac7ca640 20 slow request osd_op(client.4486.0:1493 3.3f 3:fe9ff65e:::429598c5-09a8-449d-831a-b9efc6a49f6b.4486.2__%3azZJKxn4eHoUQdp8clpg-8aClbiCxTP0_e:head [create,setxattr user.rgw.idtag (62) in=76b,setxattr user.rgw.tail_tag (62) in=79b,writefull 0~11 in=11b,set-alloc-hint object_size 0 write_size 0,setxattr user.rgw.manifest (371) in=388b,setxattr user.rgw.acl (179) in=191b,setxattr user.rgw.compression (59) in=79b,setxattr user.rgw.etag (32) in=45b,setxattr user.rgw.torrent (109) in=125b,setxattr user.rgw.x-amz-content-sha256 (17) in=46b,setxattr user.rgw.x-amz-date (17) in=36b,setxattr user.rgw.x-amz-meta-tag (7) in=30b,call rgw.obj_store_pg_ver in=44b,setxattr user.rgw.source_zone (4) in=24b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e35) initiated 2024-04-15T18:11:56.558730+0000 currently waiting for readable
117758180:2024-04-15T18:12:29.241+0000 7fa2ac7ca640 20 1 slow requests (by type [ 'delayed' : 1 ] most affected pool [ 'default.rgw.buckets.data' : 1 ])
117758183:2024-04-15T18:12:29.241+0000 7fa2ac7ca640  0 log_channel(cluster) log [WRN] : 1 slow requests (by type [ 'delayed' : 1 ] most affected pool [ 'default.rgw.buckets.data' : 1 ])

the log file for this osd.2 is far larger than other osds:

ceph-osd.0.log.gz                                  15-Apr-2024 18:24             3572161
ceph-osd.1.log.gz                                  15-Apr-2024 18:24             3452643
ceph-osd.2.log.gz                                  15-Apr-2024 18:24          1558959387
ceph-osd.3.log.gz                                  15-Apr-2024 18:24             4008726

the log was too big to inspect myself, but hopefully it can shed some light on the slow request: https://qa-proxy.ceph.com/teuthology/cbodley-2024-04-15_17:32:27-rgw-wip-cbodley2-testing-distro-default-smithi/7657196/remote/smithi005/log/ceph-osd.2.log.gz

Actions

Copy link

Updated by Casey Bodley 15 days ago

i see that one of the osds on the other node has a similarly large log:

ceph-osd.4.log.gz                                  15-Apr-2024 18:24             3339388
ceph-osd.5.log.gz                                  15-Apr-2024 18:24             3793322
ceph-osd.6.log.gz                                  15-Apr-2024 18:24          1207287104
ceph-osd.7.log.gz                                  15-Apr-2024 18:24             3971665

scanning through ceph-osd.2.log, i see that these osds 2 and 6 seem to be exchanging a whole lot of peering traffic?

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #65495

1 slow request in rgw suite causes test failure

Updated by Casey Bodley 15 days ago