Bug #63245
openrgw/s3select: crashes in test_progress_expressions in run_s3select_on_csv
0%
Description
Crashes in the functional testing of s3select on main on 10/19/2023.
Teuthology results:
http://qa-proxy.ceph.com/teuthology/ivancich-2023-10-19_14:26:37-rgw-wip-eric-testing-1-distro-default-smithi/7432363/teuthology.log
Stack trace:
2023-10-19T15:17:33.828 INFO:tasks.rgw.client.0.smithi016.stdout:radosgw: ./src/s3select/include/s3select_csv_parser.h:315: char* CSVParser::next_line(): Assertion `data_begin < data_end' failed. 2023-10-19T15:17:33.828 INFO:tasks.rgw.client.0.smithi016.stdout:*** Caught signal (Aborted) ** 2023-10-19T15:17:33.829 INFO:tasks.rgw.client.0.smithi016.stdout: in thread 7f1b72059640 thread_name:radosgw 2023-10-19T15:17:33.836 INFO:tasks.rgw.client.0.smithi016.stdout: ceph version 18.0.0-6773-g3760fae3 (3760fae306efe59523385b538dfa0e949242cb9c) reef (dev) 2023-10-19T15:17:33.836 INFO:tasks.rgw.client.0.smithi016.stdout: 1: /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7f1be9b27520] 2023-10-19T15:17:33.836 INFO:tasks.rgw.client.0.smithi016.stdout: 2: pthread_kill() 2023-10-19T15:17:33.837 INFO:tasks.rgw.client.0.smithi016.stdout: 3: raise() 2023-10-19T15:17:33.837 INFO:tasks.rgw.client.0.smithi016.stdout: 4: abort() 2023-10-19T15:17:33.837 INFO:tasks.rgw.client.0.smithi016.stdout: 5: /lib/x86_64-linux-gnu/libc.so.6(+0x2871b) [0x7f1be9b0d71b] 2023-10-19T15:17:33.837 INFO:tasks.rgw.client.0.smithi016.stdout: 6: /lib/x86_64-linux-gnu/libc.so.6(+0x39e96) [0x7f1be9b1ee96] 2023-10-19T15:17:33.837 INFO:tasks.rgw.client.0.smithi016.stdout: 7: radosgw(+0x767301) [0x5634b0427301] 2023-10-19T15:17:33.838 INFO:tasks.rgw.client.0.smithi016.stdout: 8: radosgw(+0x769473) [0x5634b0429473] 2023-10-19T15:17:33.838 INFO:tasks.rgw.client.0.smithi016.stdout: 9: radosgw(+0x102e24c) [0x5634b0cee24c] 2023-10-19T15:17:33.838 INFO:tasks.rgw.client.0.smithi016.stdout: 10: radosgw(+0x76bc64) [0x5634b042bc64] 2023-10-19T15:17:33.838 INFO:tasks.rgw.client.0.smithi016.stdout: 11: (RGWSelectObj_ObjStore_S3::run_s3select_on_csv(char const*, char const*, unsigned long)+0x8b3) [0x5634b043c593] 2023-10-19T15:17:33.838 INFO:tasks.rgw.client.0.smithi016.stdout: 12: (RGWSelectObj_ObjStore_S3::csv_processing(ceph::buffer::v15_2_0::list&, long, long)+0x242) [0x5634b04518b2] 2023-10-19T15:17:33.839 INFO:tasks.rgw.client.0.smithi016.stdout: 13: (RGWGetObj_Decompress::handle_data(ceph::buffer::v15_2_0::list&, long, long)+0x267) [0x5634b0315737] 2023-10-19T15:17:33.839 INFO:tasks.rgw.client.0.smithi016.stdout: 14: (get_obj_data::flush(rgw::OwningList<rgw::AioResultEntry>&&)+0x7b8) [0x5634b053bde8] 2023-10-19T15:17:33.839 INFO:tasks.rgw.client.0.smithi016.stdout: 15: (RGWRados::Object::Read::iterate(DoutPrefixProvider const*, long, long, RGWGetDataCB*, optional_yield)+0x2eb) [0x5634b053f85b] 2023-10-19T15:17:33.839 INFO:tasks.rgw.client.0.smithi016.stdout: 16: (RGWGetObj::execute(optional_yield)+0x1145) [0x5634b035da25] 2023-10-19T15:17:33.839 INFO:tasks.rgw.client.0.smithi016.stdout: 17: (RGWSelectObj_ObjStore_S3::execute(optional_yield)+0xc1) [0x5634b0450b71] 2023-10-19T15:17:33.840 INFO:tasks.rgw.client.0.smithi016.stdout: 18: (rgw_process_authenticated(RGWHandler_REST*, RGWOp*&, RGWRequest*, req_state*, optional_yield, rgw::sal::Driver*, bool)+0x9dc) [0x5634b0210cac] 2023-10-19T15:17:33.840 INFO:tasks.rgw.client.0.smithi016.stdout: 19: (process_request(RGWProcessEnv const&, RGWRequest*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, RGWRestfulIO*, optional_yield, rgw::dmclock::Scheduler*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*, int*)+0x201e) [0x5634b02247ae]
Updated by Gal Salomon 7 months ago
investigating.
currently, this failure is not reproducible.
Updated by Casey Bodley 5 months ago
- Status changed from Can't reproduce to New
happened again in http://qa-proxy.ceph.com/teuthology/cbodley-2023-12-08_04:50:09-rgw-wip-rgw-sal-acl-owner-distro-default-smithi/7483734/teuthology.log
2023-12-08T19:50:30.381 INFO:tasks.rgw.client.0.smithi080.stdout:*** Caught signal (Segmentation fault) ** 2023-12-08T19:50:30.381 INFO:tasks.rgw.client.0.smithi080.stdout: in thread 6dcbd640 thread_name:memcheck-amd64- 2023-12-08T19:50:30.435 INFO:tasks.rgw.client.0.smithi080.stdout: ceph version 19.0.0-42-gc2ece1ef (c2ece1efd0a18b9b4db4477c5aae273826a38988) reef (dev) 2023-12-08T19:50:30.436 INFO:tasks.rgw.client.0.smithi080.stdout: 1: /lib64/libc.so.6(+0x54db0) [0x792ddb0] 2023-12-08T19:50:30.436 INFO:tasks.rgw.client.0.smithi080.stdout: 2: _vgr20181ZZ_libcZdsoZa_memmove() 2023-12-08T19:50:30.436 INFO:tasks.rgw.client.0.smithi080.stdout: 3: radosgw(+0x68c9a5) [0x7949a5] 2023-12-08T19:50:30.436 INFO:tasks.rgw.client.0.smithi080.stdout: 4: radosgw(+0x6a371e) [0x7ab71e] 2023-12-08T19:50:30.436 INFO:tasks.rgw.client.0.smithi080.stdout: 5: (RGWSelectObj_ObjStore_S3::run_s3select_on_csv(char const*, char const*, unsigned long)+0x7ba) [0x7b2e4a] 2023-12-08T19:50:30.436 INFO:tasks.rgw.client.0.smithi080.stdout: 6: (RGWSelectObj_ObjStore_S3::csv_processing(ceph::buffer::v15_2_0::list&, long, long)+0x507) [0x7b5887] 2023-12-08T19:50:30.436 INFO:tasks.rgw.client.0.smithi080.stdout: 7: (RGWGetObj_Decompress::handle_data(ceph::buffer::v15_2_0::list&, long, long)+0x3d6) [0x687676] 2023-12-08T19:50:30.436 INFO:tasks.rgw.client.0.smithi080.stdout: 8: (get_obj_data::flush(rgw::OwningList<rgw::AioResultEntry>&&)+0x7b8) [0x8b3c68] 2023-12-08T19:50:30.437 INFO:tasks.rgw.client.0.smithi080.stdout: 9: (RGWRados::Object::Read::iterate(DoutPrefixProvider const*, long, long, RGWGetDataCB*, optional_yield)+0x2eb) [0x8b7e7b] 2023-12-08T19:50:30.437 INFO:tasks.rgw.client.0.smithi080.stdout: 10: (RGWGetObj::execute(optional_yield)+0x11cf) [0x6cf57f] 2023-12-08T19:50:30.437 INFO:tasks.rgw.client.0.smithi080.stdout: 11: (RGWSelectObj_ObjStore_S3::execute(optional_yield)+0xc1) [0x7b7ad1] 2023-12-08T19:50:30.437 INFO:tasks.rgw.client.0.smithi080.stdout: 12: (rgw_process_authenticated(RGWHandler_REST*, RGWOp*&, RGWRequest*, req_state*, optional_yield, rgw::sal::Driver*, bool)+0xa6a) [0x582d7a] 2023-12-08T19:50:30.437 INFO:tasks.rgw.client.0.smithi080.stdout: 13: (process_request(RGWProcessEnv const&, RGWRequest*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, RGWRestfulIO*, optional_yield, rgw::dmclock::Scheduler*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*, int*)+0xf7d) [0x58693d] 2023-12-08T19:50:30.437 INFO:tasks.rgw.client.0.smithi080.stdout: 14: radosgw(+0xc65d70) [0xd6dd70] 2023-12-08T19:50:30.437 INFO:tasks.rgw.client.0.smithi080.stdout: 15: radosgw(+0x3cf746) [0x4d7746] 2023-12-08T19:50:30.437 INFO:tasks.rgw.client.0.smithi080.stdout: 16: make_fcontext()
notably, i'm seeing RGWGetObj_Decompress::handle_data
in the backtraces which shows that compression is enabled
Updated by Casey Bodley 4 months ago
- Priority changed from Normal to Urgent
same crash in 2 jobs from https://pulpito.ceph.com/cbodley-2024-01-04_20:15:15-rgw-wip-cbodley-testing-distro-default-smithi/
http://qa-proxy.ceph.com/teuthology/cbodley-2024-01-04_20:15:15-rgw-wip-cbodley-testing-distro-default-smithi/7507516/teuthology.log
http://qa-proxy.ceph.com/teuthology/cbodley-2024-01-04_20:15:15-rgw-wip-cbodley-testing-distro-default-smithi/7507519/teuthology.log
Updated by Gal Salomon 4 months ago
i did not succeed to re-produce this issue.
but, QE did discover an issue similar to that.
hopefully, it is the same.
Updated by Casey Bodley 4 months ago
thanks Gal. are you enabling compression in your reproducer?
https://docs.ceph.com/en/latest/radosgw/compression/#configuration shows an example for zlib. you'd just need to restart radosgw after running that command
Updated by Gal Salomon 4 months ago
does that crash happen upon using compression?
when the compression is used? is it observed in the log?
if that is the case, it is probably a different issue from the QE issue.
i will try to reproduce as you mentioned here.
Updated by Gal Salomon 4 months ago
with https://docs.ceph.com/en/latest/radosgw/compression/#configuration
it is not reproduced, (actually, it did not compress the object)
with
vstart .... --rgw_compression zlib
the crash is re-produced
it seems similar to the QE bug (huge/very-big CSV objects)
Updated by Casey Bodley 3 months ago
any updates here? would be nice to fix for squid
Updated by Gal Salomon 2 months ago
https://github.com/ceph/ceph/pull/55891
this PR removes the assert residing in the CSV-parser and replaces it with exceptions.
RGW will report on error, and reject the request.
note:
testing the PR I noticed the following.
upon compression is set in RGW;
the `len` and `bufferlist::it.length()` are not correlated (contrary to non-compression state)
the callback `send_response_data` returns sometimes {len > it.length()}
this may cause wrong pointers calculation(on csv-parser) followed by an assert(and crash)
it seems that `it.length()` is the correct size, it needs to be verified.
Updated by Gal Salomon 2 months ago
ignore the previous comment.
the exception (before that it was an assert) was caused by a small-size chunk.
the size of the chunk is smaller than the row size, which leads to wrong pointer arithmetic.
(the assumption was that a row may split between 2 chunks, not more)
Updated by Casey Bodley about 2 months ago
- Status changed from New to Fix Under Review
- Tags set to s3select
- Backport set to quincy reef squid
- Pull request ID set to 55891
Updated by Gal Salomon about 2 months ago
- Status changed from Fix Under Review to New
the PR is fixing the issue of too-small-chunk (flow was changed to append these small chunks)
thus, upon compression setup that might lead to too-small-chunk, it will aggregate these chunks until a complete row and will process that.
Updated by Casey Bodley about 2 months ago
- Status changed from New to Pending Backport
Updated by Backport Bot about 2 months ago
- Copied to Backport #64692: quincy: rgw/s3select: crashes in test_progress_expressions in run_s3select_on_csv added
Updated by Backport Bot about 2 months ago
- Copied to Backport #64693: reef: rgw/s3select: crashes in test_progress_expressions in run_s3select_on_csv added
Updated by Backport Bot about 2 months ago
- Copied to Backport #64694: squid: rgw/s3select: crashes in test_progress_expressions in run_s3select_on_csv added
Updated by Backport Bot about 2 months ago
- Tags changed from s3select to s3select backport_processed