Project

General

Profile

Actions

Bug #65436

open

Getting Object Crashing radosgw services

Added by Reid Guyett about 1 month ago. Updated about 12 hours ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hello,

We are seeing crashes when users are trying to get a specific file.

2024-04-11T11:02:38.174+0000 7f6bbf6ee700 20 req 1201706618685104296 0.000000000s s3:get_obj get_obj_state: rctx=0x7f6fb47f6ac0 obj=<bucketname>:.cache/<112 characters>.jpg/<33 characters> state=0x7f698418d728 s->prefetch_data=1
2024-04-11T11:02:38.174+0000 7f6bbf6ee700 20 req 1201706618685104296 0.000000000s s3:get_obj get_obj_state: rctx=0x7f6fb47f6ac0 obj=<bucketname>:.cache/<112 characters>.jpg/<33 characters> state=0x7f698418d728 s->prefetch_data=1
2024-04-11T11:02:38.174+0000 7f6bbf6ee700 20 req 1201706618685104296 0.000000000s s3:get_obj Read xattr rgw_rados: user.rgw.idtag
2024-04-11T11:02:38.174+0000 7f6bbf6ee700 20 req 1201706618685104296 0.000000000s s3:get_obj Read xattr rgw_rados: user.rgw.manifest
2024-04-11T11:02:38.174+0000 7f6bbf6ee700 20 req 1201706618685104296 0.000000000s s3:get_obj Read xattr rgw_rados: user.rgw.olh.idtag
2024-04-11T11:02:38.174+0000 7f6bbf6ee700 20 req 1201706618685104296 0.000000000s s3:get_obj Read xattr rgw_rados: user.rgw.olh.info
2024-04-11T11:02:38.174+0000 7f6bbf6ee700 20 req 1201706618685104296 0.000000000s s3:get_obj Read xattr rgw_rados: user.rgw.olh.ver
2024-04-11T11:02:38.178+0000 7f6bbf6ee700 -1 *** Caught signal (Aborted) **

This is reproducible by on this specific object:

$ s3cmd -c s3 get s3://<bucket>/.cache/<112 characters>.jpg/<33 characters> 
download: 's3://<bucket>/.cache/<112 characters>.jpg/<33 characters>' -> './<33 characters>'  [1 of 1]
ERROR: Error parsing xml: mismatched tag: line 6, column 2
ERROR: b'<html>\r\n<head><title>502 Bad Gateway</title></head>\r\n<body>\r\n<center><h1>502 Bad Gateway</h1></center>\r\n<hr><center>nginx</center>\r\n</body>\r\n</html>\r\n'
ERROR: Download of './<33 characters>' failed (Reason: 502 (Bad Gateway))
ERROR: S3 error: 502 (Bad Gateway)

We are running:
rgw on 17.2.5
rest is 17.2.7
on Debian 11


Files

radosgw crash details (scrubbed).log (15.7 KB) radosgw crash details (scrubbed).log Full rgw log Reid Guyett, 04/11/2024 05:10 PM
Actions #1

Updated by hoan nv about 1 month ago

I have same issue. After some days, i found bug https://tracker.ceph.com/issues/61359
After upgrade to 17.2.7, this bug gone. But i should delete error file, i can't fix this file.

Actions #2

Updated by Casey Bodley 29 days ago

  • Status changed from New to Need More Info

After upgrade to 17.2.7, this bug gone

it sounds like this bug is fixed in later point release, can you please try to upgrade? we can't do anything to fix 17.2.5 specifically

Actions #3

Updated by Reid Guyett 15 days ago

Hello,

I was able to test in 17.2.7 and the rgw service is still crashing with the same error message.


2024-05-02T17:26:25.256+0000 7f399f7be700 20 req 6086159647010032067 0.000000000s s3:get_obj Read xattr rgw_rados: user.rgw.idtag
2024-05-02T17:26:25.256+0000 7f399f7be700 20 req 6086159647010032067 0.000000000s s3:get_obj Read xattr rgw_rados: user.rgw.manifest
2024-05-02T17:26:25.256+0000 7f399f7be700 20 req 6086159647010032067 0.000000000s s3:get_obj Read xattr rgw_rados: user.rgw.olh.idtag
2024-05-02T17:26:25.256+0000 7f399f7be700 20 req 6086159647010032067 0.000000000s s3:get_obj Read xattr rgw_rados: user.rgw.olh.info
2024-05-02T17:26:25.256+0000 7f399f7be700 20 req 6086159647010032067 0.000000000s s3:get_obj Read xattr rgw_rados: user.rgw.olh.ver
2024-05-02T17:26:25.260+0000 7f399f7be700 -1 *** Caught signal (Aborted) **
 in thread 7f399f7be700 thread_name:radosgw

 ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)
 1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x13140) [0x7f3adad98140]
 2: gsignal()
 3: abort()
 4: /lib/x86_64-linux-gnu/libstdc++.so.6(+0x9a7ec) [0x7f3adac527ec]
 5: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa5966) [0x7f3adac5d966]
 6: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa4a49) [0x7f3adac5ca49]
 7: __gxx_personality_v0()
 8: /lib/x86_64-linux-gnu/libgcc_s.so.1(+0x1073f) [0x7f3adabae73f]
 9: _Unwind_Resume()
 10: /lib/libradosgw.so.2(+0x53cccf) [0x7f3adb2e3ccf]
 11: /lib/libradosgw.so.2(+0x6388c6) [0x7f3adb3df8c6]
 12: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xceed0) [0x7f3adac86ed0]
 13: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7ea7) [0x7f3adad8cea7]
 14: clone()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Actions #4

Updated by hoan nv 14 days ago

Did you try with error file on old bucket ?
Error file can't fix by upgrade ceph. You need delete error file or all file on bucket.

After upgrade, you can try create bucket, enable bucket versioning, upload file then disable bucket versioning and try upload file. You must do it a few times. If no file error it will ok.

Actions #5

Updated by Reid Guyett 14 days ago

So the solution is to upgrade RGW, delete and recreate the bucket?

Since we do not own or control the data being uploaded by customers, I don't think it is really feasible. The RGW should return an HTTP error to the client instead of crashing the whole service.

Actions #6

Updated by Reid Guyett 14 days ago

We are also blocked by https://tracker.ceph.com/issues/64308 in moving to 17.2.7.

Actions #7

Updated by hoan nv 13 days ago

Reid Guyett wrote in #note-6:

We are also blocked by https://tracker.ceph.com/issues/64308 in moving to 17.2.7.

I have same issue. I must config on proxy layer.

Actions #8

Updated by Reid Guyett 11 days ago

What did you do to fix it at the proxy layer? Strip the parameters from the URL?

Actions #9

Updated by hoan nv 10 days ago

Reid Guyett wrote in #note-8:

What did you do to fix it at the proxy layer? Strip the parameters from the URL?

Prefight CORS is options request, so you just catch all options request will be set CORS header is allow.

Actions #10

Updated by Reid Guyett about 12 hours ago

I was able to reproduce this error on 17.2.7.

Using [s3-tests](https://github.com/ceph/s3-tests/) test_versioning_obj_suspended_copy, I am able to reproduce the RGW crashing each time.

S3TEST_CONF=s3tests-new.conf tox -- s3tests_boto3/functional/test_s3.py::test_versioning_obj_suspended_copy

<...>
FAILED s3tests_boto3/functional/test_s3.py::test_versioning_obj_suspended_copy - botocore.exceptions.ClientError: An error occurred (502) when calling the GetObject operation (reached max retries: 4): Bad Gateway
ERROR s3tests_boto3/functional/test_s3.py::test_versioning_obj_suspended_copy - botocore.exceptions.ClientError: An error occurred (502) when calling the ListBuckets operation (reached max retries: 4): Bad Gateway

It has the same error in the RGW logs when crashing.

terminate called after throwing an instance of 'ceph::buffer::v15_2_0::end_of_buffer'
  what():  End of buffer
*** Caught signal (Aborted) **
 in thread 7f1550070700 thread_name:radosgw
 ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)

Actions

Also available in: Atom PDF