Bug #65436
open
Getting Object Crashing radosgw services
Added by Reid Guyett about 1 month ago.
Updated 5 days ago.
Description
Hello,
We are seeing crashes when users are trying to get a specific file.
2024-04-11T11:02:38.174+0000 7f6bbf6ee700 20 req 1201706618685104296 0.000000000s s3:get_obj get_obj_state: rctx=0x7f6fb47f6ac0 obj=<bucketname>:.cache/<112 characters>.jpg/<33 characters> state=0x7f698418d728 s->prefetch_data=1
2024-04-11T11:02:38.174+0000 7f6bbf6ee700 20 req 1201706618685104296 0.000000000s s3:get_obj get_obj_state: rctx=0x7f6fb47f6ac0 obj=<bucketname>:.cache/<112 characters>.jpg/<33 characters> state=0x7f698418d728 s->prefetch_data=1
2024-04-11T11:02:38.174+0000 7f6bbf6ee700 20 req 1201706618685104296 0.000000000s s3:get_obj Read xattr rgw_rados: user.rgw.idtag
2024-04-11T11:02:38.174+0000 7f6bbf6ee700 20 req 1201706618685104296 0.000000000s s3:get_obj Read xattr rgw_rados: user.rgw.manifest
2024-04-11T11:02:38.174+0000 7f6bbf6ee700 20 req 1201706618685104296 0.000000000s s3:get_obj Read xattr rgw_rados: user.rgw.olh.idtag
2024-04-11T11:02:38.174+0000 7f6bbf6ee700 20 req 1201706618685104296 0.000000000s s3:get_obj Read xattr rgw_rados: user.rgw.olh.info
2024-04-11T11:02:38.174+0000 7f6bbf6ee700 20 req 1201706618685104296 0.000000000s s3:get_obj Read xattr rgw_rados: user.rgw.olh.ver
2024-04-11T11:02:38.178+0000 7f6bbf6ee700 -1 *** Caught signal (Aborted) **
This is reproducible by on this specific object:
$ s3cmd -c s3 get s3://<bucket>/.cache/<112 characters>.jpg/<33 characters>
download: 's3://<bucket>/.cache/<112 characters>.jpg/<33 characters>' -> './<33 characters>' [1 of 1]
ERROR: Error parsing xml: mismatched tag: line 6, column 2
ERROR: b'<html>\r\n<head><title>502 Bad Gateway</title></head>\r\n<body>\r\n<center><h1>502 Bad Gateway</h1></center>\r\n<hr><center>nginx</center>\r\n</body>\r\n</html>\r\n'
ERROR: Download of './<33 characters>' failed (Reason: 502 (Bad Gateway))
ERROR: S3 error: 502 (Bad Gateway)
We are running:
rgw on 17.2.5
rest is 17.2.7
on Debian 11
Files
- Status changed from New to Need More Info
After upgrade to 17.2.7, this bug gone
it sounds like this bug is fixed in later point release, can you please try to upgrade? we can't do anything to fix 17.2.5 specifically
Hello,
I was able to test in 17.2.7 and the rgw service is still crashing with the same error message.
2024-05-02T17:26:25.256+0000 7f399f7be700 20 req 6086159647010032067 0.000000000s s3:get_obj Read xattr rgw_rados: user.rgw.idtag
2024-05-02T17:26:25.256+0000 7f399f7be700 20 req 6086159647010032067 0.000000000s s3:get_obj Read xattr rgw_rados: user.rgw.manifest
2024-05-02T17:26:25.256+0000 7f399f7be700 20 req 6086159647010032067 0.000000000s s3:get_obj Read xattr rgw_rados: user.rgw.olh.idtag
2024-05-02T17:26:25.256+0000 7f399f7be700 20 req 6086159647010032067 0.000000000s s3:get_obj Read xattr rgw_rados: user.rgw.olh.info
2024-05-02T17:26:25.256+0000 7f399f7be700 20 req 6086159647010032067 0.000000000s s3:get_obj Read xattr rgw_rados: user.rgw.olh.ver
2024-05-02T17:26:25.260+0000 7f399f7be700 -1 *** Caught signal (Aborted) **
in thread 7f399f7be700 thread_name:radosgw
ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)
1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x13140) [0x7f3adad98140]
2: gsignal()
3: abort()
4: /lib/x86_64-linux-gnu/libstdc++.so.6(+0x9a7ec) [0x7f3adac527ec]
5: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa5966) [0x7f3adac5d966]
6: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa4a49) [0x7f3adac5ca49]
7: __gxx_personality_v0()
8: /lib/x86_64-linux-gnu/libgcc_s.so.1(+0x1073f) [0x7f3adabae73f]
9: _Unwind_Resume()
10: /lib/libradosgw.so.2(+0x53cccf) [0x7f3adb2e3ccf]
11: /lib/libradosgw.so.2(+0x6388c6) [0x7f3adb3df8c6]
12: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xceed0) [0x7f3adac86ed0]
13: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7ea7) [0x7f3adad8cea7]
14: clone()
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Did you try with error file on old bucket ?
Error file can't fix by upgrade ceph. You need delete error file or all file on bucket.
After upgrade, you can try create bucket, enable bucket versioning, upload file then disable bucket versioning and try upload file. You must do it a few times. If no file error it will ok.
So the solution is to upgrade RGW, delete and recreate the bucket?
Since we do not own or control the data being uploaded by customers, I don't think it is really feasible. The RGW should return an HTTP error to the client instead of crashing the whole service.
What did you do to fix it at the proxy layer? Strip the parameters from the URL?
Reid Guyett wrote in #note-8:
What did you do to fix it at the proxy layer? Strip the parameters from the URL?
Prefight CORS is options request, so you just catch all options request will be set CORS header is allow.
I was able to reproduce this error on 17.2.7.
Using [s3-tests](https://github.com/ceph/s3-tests/) test_versioning_obj_suspended_copy, I am able to reproduce the RGW crashing each time.
S3TEST_CONF=s3tests-new.conf tox -- s3tests_boto3/functional/test_s3.py::test_versioning_obj_suspended_copy
<...>
FAILED s3tests_boto3/functional/test_s3.py::test_versioning_obj_suspended_copy - botocore.exceptions.ClientError: An error occurred (502) when calling the GetObject operation (reached max retries: 4): Bad Gateway
ERROR s3tests_boto3/functional/test_s3.py::test_versioning_obj_suspended_copy - botocore.exceptions.ClientError: An error occurred (502) when calling the ListBuckets operation (reached max retries: 4): Bad Gateway
It has the same error in the RGW logs when crashing.
terminate called after throwing an instance of 'ceph::buffer::v15_2_0::end_of_buffer'
what(): End of buffer
*** Caught signal (Aborted) **
in thread 7f1550070700 thread_name:radosgw
ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)
Also available in: Atom
PDF