Project

General

Profile

Bug #23965

FAIL: s3tests.functional.test_s3.test_multipart_upload_resend_part with ec cache pools

Added by Casey Bodley almost 6 years ago. Updated over 4 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

History

#1 Updated by Casey Bodley almost 6 years ago

  • Status changed from New to 12
  • Assignee set to Casey Bodley

#2 Updated by Casey Bodley almost 6 years ago

  • Status changed from 12 to Fix Under Review

https://github.com/ceph/ceph/pull/22126 removes ec-cache pools from the rgw suite

#3 Updated by Josh Durgin almost 6 years ago

Casey, could you or someone else familiar with rgw look through the logs for this and identify the relevant OSD requests on the client side? In particular, which writes would have the expected data at the end, and which ones contained the data that was actually read?

#4 Updated by Casey Bodley almost 6 years ago

Josh Durgin wrote:

Casey, could you or someone else familiar with rgw look through the logs for this and identify the relevant OSD requests on the client side? In particular, which writes would have the expected data at the end, and which ones contained the data that was actually read?

Hi Josh,

I dug through the radosgw log and learned a bit more about the failure. To start, here's the s3test itself:

def test_multipart_upload_resend_part():
    bucket = get_new_bucket()
    key="mymultipart" 
    objlen = 30 * 1024 * 1024

    _check_upload_multipart_resend(bucket, key, objlen, [0])
    _check_upload_multipart_resend(bucket, key, objlen, [1])
    _check_upload_multipart_resend(bucket, key, objlen, [2])
    _check_upload_multipart_resend(bucket, key, objlen, [1,2])
    _check_upload_multipart_resend(bucket, key, objlen, [0,1,2,3,4,5])

Each of these calls performs a separate multipart upload to the same bucket/key. The last call is the one that fails here, but it's not due to corruption of the object data itself. Instead, when it reads the final multipart object back, I see that radosgw is reading from rados objects that were written during the previous multipart upload, rather than that final upload.

This list of rados objects is stored in the head object's manifest xattr (user.rgw.manifest), which is overwritten during the multipart complete operation. So I think it's this xattr overwrite that's the source of the issue.

The setxattr for the previous upload (with [1,2]) happens in these osd ops (the first exclusive create fails with EEXIST, so we resend with cmpxattr user.rgw.idtag):

2018-05-01 19:07:19.551 7f280280e700  1 -- 172.21.15.102:0/2878596714 --> 172.21.15.102:6813/12555 -- osd_op(unknown.0.0:19994 4.10 4:0810eaca:::8a6be99d-bd98-4b04-813c-0cd2179c3c6f.4334.276_mymultipart:head [create,setxattr user.rgw.idtag (47),setxattr user.rgw.tail_tag (47),setxattr user.rgw.manifest (560),setxattr user.rgw.acl (185),setxattr user.rgw.content_type (9),setxattr user.rgw.etag (34),setxattr user.rgw.pg_ver (8),setxattr user.rgw.source_zone (4),setxattr user.rgw.x-amz-meta-foo (4)] snapc 0=[] ondisk+write+known_if_redirected e40) v8 -- 0x55559f9d69c0 con 0
2018-05-01 19:07:19.555 7f2824c6b700  1 -- 172.21.15.102:0/2878596714 <== osd.3 172.21.15.102:6813/12555 3039 ==== osd_op_reply(19994 8a6be99d-bd98-4b04-813c-0cd2179c3c6f.4334.276_mymultipart [create,setxattr (47),setxattr (47),setxattr (560),setxattr (185),setxattr (9),setxattr (34),setxattr (8),setxattr (4),setxattr (4)] v40'20 uv19 ondisk = -17 ((17) File exists)) v8 ==== 579+0+0 (1654451763 0 0) 0x55559f9d69c0 con 0x55559f599100

2018-05-01 19:07:19.555 7f280280e700  1 -- 172.21.15.102:0/2878596714 --> 172.21.15.102:6813/12555 -- osd_op(unknown.0.0:19996 4.10 4:0810eaca:::8a6be99d-bd98-4b04-813c-0cd2179c3c6f.4334.276_mymultipart:head [cmpxattr user.rgw.idtag (47) op 1 mode 1,create,call rgw.obj_remove,setxattr user.rgw.idtag (47),setxattr user.rgw.tail_tag (47),setxattr user.rgw.manifest (560),setxattr user.rgw.acl (185),setxattr user.rgw.content_type (9),setxattr user.rgw.etag (34),setxattr user.rgw.pg_ver (8),setxattr user.rgw.source_zone (4),setxattr user.rgw.x-amz-meta-foo (4)] snapc 0=[] ondisk+write+known_if_redirected e40) v8 -- 0x55559f9d83c0 con 0
2018-05-01 19:07:19.559 7f2824c6b700  1 -- 172.21.15.102:0/2878596714 <== osd.3 172.21.15.102:6813/12555 3041 ==== osd_op_reply(19996 8a6be99d-bd98-4b04-813c-0cd2179c3c6f.4334.276_mymultipart [cmpxattr (47) op 1 mode 1,create,call,setxattr (47),setxattr (47),setxattr (560),setxattr (185),setxattr (9),setxattr (34),setxattr (8),setxattr (4),setxattr (4)] v40'21 uv21 ondisk = 0) v8 ==== 663+0+0 (3977230168 0 0) 0x55559f9d83c0 con 0x55559f599100

The final setxattr happens here, its reply is delayed 5+ seconds, and the exclusive create appears to succeed this time:

2018-05-01 19:07:54.630 7f280280e700  1 -- 172.21.15.102:0/2878596714 --> 172.21.15.102:6813/12555 -- osd_op(unknown.0.0:20960 4.10 4:0810eaca:::8a6be99d-bd98-4b04-813c-0cd2179c3c6f.4334.276_mymultipart:head [create,setxattr user.rgw.idtag (47),setxattr user.rgw.tail_tag (47),setxattr user.rgw.manifest (779),setxattr user.rgw.acl (185),setxattr user.rgw.content_type (9),setxattr user.rgw.etag (34),setxattr user.rgw.pg_ver (8),setxattr user.rgw.source_zone (4),setxattr user.rgw.x-amz-meta-foo (4)] snapc 0=[] ondisk+write+known_if_redirected e40) v8 -- 0x55559fcf09c0 con 0

2018-05-01 19:08:00.290 7f2824c6b700  1 -- 172.21.15.102:0/2878596714 <== osd.3 172.21.15.102:6813/12555 3251 ==== osd_op_reply(20960 8a6be99d-bd98-4b04-813c-0cd2179c3c6f.4334.276_mymultipart [create,setxattr (47),setxattr (47),setxattr (779),setxattr (185),setxattr (9),setxattr (34),setxattr (8),setxattr (4),setxattr (4)] v40'27 uv0 ondisk = 0) v8 ==== 579+0+0 (785525316 0 0) 0x55559fcf0340 con 0x55559f599100

#5 Updated by Casey Bodley almost 6 years ago

  • Project changed from rgw to RADOS
  • Status changed from Fix Under Review to 12
  • Assignee deleted (Casey Bodley)
  • Priority changed from Urgent to Normal

https://github.com/ceph/ceph/pull/22126 merged to remove failures from rgw suite. moving to rados project

#7 Updated by Patrick Donnelly over 4 years ago

  • Status changed from 12 to New

Also available in: Atom PDF