Bug #59164: LC rules cause latency spikes - rgw - Ceph

Actions

Copy link

Bug #59164

closed

LC rules cause latency spikes

Added by Jamison Lofthouse about 1 year ago. Updated 11 months ago.

Status:

Can't reproduce

Priority:

Normal

Assignee:

Target version:

% Done:

Source:

Tags:

lifecycle

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v17.2.5

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

We have a number of lifecycle rules that have object expirations set to 5-7 days. I have been noticing latency spikes pretty regularly on these buckets (i.e. every 4 operations makes a PUT operation for an 800MiB file go from 3-4s to 160s+, or a PUT for a 150MiB can go from 0.15s to 7s). Debugging the logs, I have traced this to the lifecycle rule header steps for example:

debug 2023-03-24T20:28:52.849+0000 7efd52095700 20 req 16391328287536034602 0.507005215s s3:get_obj RGWObjManifest::operator++(): result: ofs=19922944 stripe_ofs=19922944 part_ofs=15728640 rule->part_size=5242880
debug 2023-03-24T20:28:52.849+0000 7efd52095700 20 req 16391328287536034602 0.507005215s s3:get_obj rados->get_obj_iterate_cb oid=9394f5d7-23ae-40c0-b813-9d0d7bc5cdc8.18373438.3__shadow_3f0666b9-37d9-4259-b8c5-d65a6ed016e6.tmp.2~z6j_XmAAaz7Ulp69KZ-qwPMTepXSPRZ.4_1 obj-ofs=19922944 read_ofs=0 len=1048576
debug 2023-03-24T20:28:52.849+0000 7efd52095700 10 req 16391328287536034602 0.507005215s rule: cgf005prr2v3m88aeu0g prefix:  expiration:  date:  days: 5 noncur_expiration:  date:  days: 0
debug 2023-03-24T20:29:02.087+0000 7efdad14b700 10 req 16391328287536034602 9.744100571s rule: cgf005prr2v3m88aeu0g prefix:  expiration:  date:  days: 5 noncur_expiration:  date:  days: 0
debug 2023-03-24T20:29:02.123+0000 7efdd699e700 10 req 16391328287536034602 9.780099869s rule: cgf005prr2v3m88aeu0g prefix:  expiration:  date:  days: 5 noncur_expiration:  date:  days: 0
debug 2023-03-24T20:29:02.253+0000 7efd4d88c700 10 req 16391328287536034602 9.910101891s rule: cgf005prr2v3m88aeu0g prefix:  expiration:  date:  days: 5 noncur_expiration:  date:  days: 0
debug 2023-03-24T20:29:02.286+0000 7efd3485a700 10 req 16391328287536034602 9.943102837s rule: cgf005prr2v3m88aeu0g prefix:  expiration:  date:  days: 5 noncur_expiration:  date:  days: 0
debug 2023-03-24T20:29:02.363+0000 7efdef1cf700 20 req 16391328287536034602 10.020103455s s3:get_obj RGWObjManifest::operator++(): rule->part_size=5242880 rules.size()=2
debug 2023-03-24T20:29:02.363+0000 7efdef1cf700 20 req 16391328287536034602 10.020103455s s3:get_obj RGWObjManifest::operator++(): stripe_ofs=24117248 part_ofs=15728640 rule->part_size=5242880
debug 2023-03-24T20:29:02.363+0000 7efdef1cf700 20 req 16391328287536034602 10.020103455s s3:get_obj RGWObjManifest::operator++(): result: ofs=20971520 stripe_ofs=20971520 part_ofs=20971520 rule->part_size=5242880
debug 2023-03-24T20:29:02.363+0000 7efdef1cf700 20 req 16391328287536034602 10.020103455s s3:get_obj rados->get_obj_iterate_cb oid=9394f5d7-23ae-40c0-b813-9d0d7bc5cdc8.18373438.3__multipart_3f0666b9-37d9-4259-b8c5-d65a6ed016e6.tmp.2~z6j_XmAAaz7Ulp69KZ-qwPMTepXSPRZ.5 obj-ofs=20971520 read_ofs=0 len=4194304

Full log available as a download. This does not appear to happen every time and removing the lifecycle seems get rid of the spikes.

The cluster is deployed via Rook.

Files

case6-filtered-sanitized.log (378 KB) case6-filtered-sanitized.log

Full log from RGW

Jamison Lofthouse, 03/24/2023 09:00 PM

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by Jamison Lofthouse about 1 year ago

I should mention this log is from a GET, not a PUT. PUTs have similar latency spikes but I don't think lifecycle would be checked on new files.

Actions

Copy link

Updated by Matt Benjamin about 1 year ago

Hi Jamison,

Lifecycle is just listing the bucket(?prefix)(es) that have policy, unordered. It would appear that there must be high latency from RADOS listing the index segments.

Matt

Actions

Copy link

Updated by Jamison Lofthouse about 1 year ago

Matt Benjamin wrote:

It would appear that there must be high latency from RADOS listing the index segments.

I'm unfamiliar with how the indexes work, but if these are newly written files could they be affected by index corruption or other weirdness?

Actions

Copy link

Updated by Casey Bodley about 1 year ago

Tags set to lifecycle

rule: cgf005prr2v3m88aeu0g prefix: expiration: date: days: 5 noncur_expiration: date: days: 0

this log output is coming from rgw::lc::s3_expiration_header()[1], which is reading the rules from the bucket's xattr RGW_ATTR_LC. this function gets called for each chunk of response body that we send back to the client. there isn't any additional i/o here for buckets with lifecycle enabled

that 9-second gap in the log output would either be waiting on the response to an OSD read request, or for the client socket to be writable

what is the s3 client here? is it possible that it's slow to read the response, so leaving the tcp connection flow controlled for that long?

if not, you might investigate the rados cluster for reports of slow requests on those __multipart or __shadow objects. you could also enable debug_ms=1 in radosgw to see the timing of osd requests and replies

[1] https://github.com/ceph/ceph/blob/f5c5009/src/rgw/rgw_lc.cc#L2609-L2617

Actions

Copy link

Updated by Jamison Lofthouse about 1 year ago

I haven't been able to reproduce this since I ran radosgw-admin bucket check --check-objects --fix on the buckets in question. I'll revisit if I see this again.

Actions

Copy link

Updated by Casey Bodley about 1 year ago

Status changed from New to Can't reproduce

Actions

Copy link

Updated by Cory Snyder 11 months ago

I suspect that this may be related to issues that we've observed with bucket indexes that have a lot of leftover invalid entries. It seems to agree with the fact that a complete rebuild of these bucket indexes resolved the problem. Related issues linked.

Actions

Copy link

Updated by Cory Snyder 11 months ago

Related to Bug #58673: When bucket index ops are cancelled it can leave behind zombie index entries added

Actions

Copy link

Updated by Cory Snyder 11 months ago

Related to Bug #59663: rgw: expired delete markers created by deleting non-existant object multiple times are not being removed from data pool after deletion from bucket added

Actions

Copy link

#10

Updated by Cory Snyder 11 months ago

Related to Bug #61359: Consistency bugs with OLH objects added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » rgw

Custom queries

Bug #59164

LC rules cause latency spikes

Updated by Jamison Lofthouse about 1 year ago

Updated by Matt Benjamin about 1 year ago

Updated by Jamison Lofthouse about 1 year ago

Updated by Casey Bodley about 1 year ago

Updated by Jamison Lofthouse about 1 year ago

Updated by Casey Bodley about 1 year ago

Updated by Cory Snyder 11 months ago

Updated by Cory Snyder 11 months ago

Updated by Cory Snyder 11 months ago

Updated by Cory Snyder 11 months ago