Bug #57905
openmultisite: terminate called after throwing an instance of 'ceph::buffer::v15_2_0::end_of_buffer'
0%
Description
example from rgw/multisite suite: http://qa-proxy.ceph.com/teuthology/cbodley-2022-10-19_23:28:37-rgw-wip-cbodley-testing-distro-default-smithi/7075088/teuthology.log
the tcmalloc warnings make it look like we're decode something, getting a really big 'size', and failing to decode that many bytes
2022-10-20T05:29:45.277 DEBUG:tasks.util.rgw:rgwadmin: cmd=['adjust-ulimits', 'ceph-coverage', '/home/ubuntu/cephtest/archive/coverage', 'radosgw-admin', '--log-to-stderr', '--format', 'json', '-n', 'client.0', '--cluster', 'c1', 'bucket', 'sync', 'checkpoint', '--bucket', 'swwtcn-52', '--source-zone', 'a1', '--retry-delay-ms', '5000', '--timeout-sec', '300', '--rgw-zone', 'a2', '--rgw-zonegroup', 'a', '--rgw-realm', 'test-realm', '--cluster', 'c1', '--debug-rgw', '1', '--debug-ms', '0'] 2022-10-20T05:29:45.277 DEBUG:teuthology.orchestra.run.smithi150:> adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage radosgw-admin --log-to-stderr --format json -n client.0 --cluster c1 bucket sync checkpoint --bucket swwtcn-52 --source-zone a1 --retry-delay-ms 5000 --timeout-sec 300 --rgw-zone a2 --rgw-zonegroup a --rgw-realm test-realm --cluster c1 --debug-rgw 1 --debug-ms 0 2022-10-20T05:29:45.336 INFO:teuthology.orchestra.run.smithi150.stderr:ignoring --setuser ceph since I am not root 2022-10-20T05:29:45.337 INFO:teuthology.orchestra.run.smithi150.stderr:ignoring --setgroup ceph since I am not root 2022-10-20T05:29:45.381 INFO:teuthology.orchestra.run.smithi150.stderr:2022-10-20T05:29:45.380+0000 7f59b3460780 1 waiting to reach incremental sync.. 2022-10-20T05:29:47.652 INFO:tasks.rgw.c1.client.0.smithi150.stdout:tcmalloc: large alloc 13655506944 bytes == 0x560fc6c8c000 @ 0x7f7e06715760 0x7f7e06736c64 0x7f7cc5270166 0x7f7cc526ee93 0x560fba1059e9 0x560fba2170a4 0x560fba1f69a6 0x560fba2346c1 0x560fba234f44 0x560fba1c2525 0x560fb9f429f3 0x560fb9f443b7 0x560fb9e9ad96 0x560fb9e9b94a 0x560fbaca884f 2022-10-20T05:29:47.657 INFO:tasks.rgw.c1.client.0.smithi150.stdout:tcmalloc: large alloc 9825697792 bytes == 0x5612f6420000 @ 0x7f7e06715760 0x7f7e06736c64 0x7f7cc5270166 0x7f7cc526ee93 0x560fba1059e9 0x560fba2170a4 0x560fba1f69a6 0x560fba2346c1 0x560fba234f44 0x560fba1c2525 0x560fb9f429f3 0x560fb9f443b7 0x560fb9e9ad96 0x560fb9e9b94a 0x560fbaca884f 2022-10-20T05:29:50.382 INFO:teuthology.orchestra.run.smithi150.stderr:2022-10-20T05:29:50.381+0000 7f59b3460780 1 waiting to reach incremental sync.. 2022-10-20T05:29:51.336 INFO:tasks.rgw.c1.client.0.smithi150.stdout:terminate called after throwing an instance of 'ceph::buffer::v15_2_0::end_of_buffer' 2022-10-20T05:29:51.336 INFO:tasks.rgw.c1.client.0.smithi150.stdout: what(): End of buffer [buffer:2] 2022-10-20T05:29:51.337 INFO:tasks.rgw.c1.client.0.smithi150.stdout:*** Caught signal (Aborted) ** 2022-10-20T05:29:51.337 INFO:tasks.rgw.c1.client.0.smithi150.stdout: in thread 7f7cf11dc700 thread_name:radosgw 2022-10-20T05:29:51.338 INFO:tasks.rgw.c1.client.0.smithi150.stdout: ceph version 18.0.0-564-g492571cb (492571cb93a9d1551a1968e5374657023093a0a8) reef (dev) 2022-10-20T05:29:51.338 INFO:tasks.rgw.c1.client.0.smithi150.stdout: 1: /lib64/libpthread.so.0(+0x12cf0) [0x7f7e0596ccf0] 2022-10-20T05:29:51.338 INFO:tasks.rgw.c1.client.0.smithi150.stdout: 2: gsignal() 2022-10-20T05:29:51.339 INFO:tasks.rgw.c1.client.0.smithi150.stdout: 3: abort() 2022-10-20T05:29:51.339 INFO:tasks.rgw.c1.client.0.smithi150.stdout: 4: /lib64/libstdc++.so.6(+0x9009b) [0x7f7e050b309b] 2022-10-20T05:29:51.339 INFO:tasks.rgw.c1.client.0.smithi150.stdout: 5: /lib64/libstdc++.so.6(+0x9653c) [0x7f7e050b953c] 2022-10-20T05:29:51.339 INFO:tasks.rgw.c1.client.0.smithi150.stdout: 6: /lib64/libstdc++.so.6(+0x95559) [0x7f7e050b8559] 2022-10-20T05:29:51.340 INFO:tasks.rgw.c1.client.0.smithi150.stdout: 7: __gxx_personality_v0() 2022-10-20T05:29:51.340 INFO:tasks.rgw.c1.client.0.smithi150.stdout: 8: /lib64/libgcc_s.so.1(+0x10b03) [0x7f7e04a99b03] 2022-10-20T05:29:51.340 INFO:tasks.rgw.c1.client.0.smithi150.stdout: 9: _Unwind_Resume() 2022-10-20T05:29:51.340 INFO:tasks.rgw.c1.client.0.smithi150.stdout: 10: radosgw(+0x524ec4) [0x560fb9d5cec4] 2022-10-20T05:29:51.341 INFO:tasks.rgw.c1.client.0.smithi150.stdout: 11: radosgw(+0x653ecd) [0x560fb9e8becd] 2022-10-20T05:29:51.341 INFO:tasks.rgw.c1.client.0.smithi150.stdout: 12: /lib64/libstdc++.so.6(+0xc2ba3) [0x7f7e050e5ba3] 2022-10-20T05:29:51.341 INFO:tasks.rgw.c1.client.0.smithi150.stdout: 13: /lib64/libpthread.so.0(+0x81ca) [0x7f7e059621ca] 2022-10-20T05:29:51.341 INFO:tasks.rgw.c1.client.0.smithi150.stdout: 14: clone()
Updated by Casey Bodley over 1 year ago
- Subject changed from multisite: crashes in 'bucket sync checkpoint' to multisite: terminate called after throwing an instance of 'ceph::buffer::v15_2_0::end_of_buffer'
oops, this wasn't a crash in the radosgw-admin process, it was in radosgw
Updated by Yuval Lifshitz about 1 year ago
similar backtrace is seen in this PR: https://github.com/ceph/ceph/pull/48709
see: http://qa-proxy.ceph.com/teuthology/yuvalif-2023-02-12_19:44:50-rgw-wip-yuval-tracing-port-distro-default-smithi/7170021/remote/smithi023/crash/
same test crashed in a different PR (https://github.com/ceph/ceph/pull/49335) with more meaningful backtrace:
http://qa-proxy.ceph.com/teuthology/yuvalif-2023-02-12_19:47:07-rgw-wip-yuval-fix-58167-distro-default-smithi/7170113/remote/smithi045/crash/
Updated by Casey Bodley about 1 year ago
- Priority changed from Urgent to Immediate
- Backport set to reef
Updated by Shilpa MJ about 1 year ago
I initially thought that the crash might be related to rgw compression and object encryption. So I ran multisite qa suite and also test_encrypted_object_sync() in particular with rgw_compression enabled, but failed to reproduce the crash. Will be trying with other rgw config variables on.
Updated by Casey Bodley 11 months ago
suggested retesting in teuthology with debuginfo packages install
Updated by Shilpa MJ 11 months ago
more info from recent testing. (http://qa-proxy.ceph.com/teuthology/smanjara-2023-05-26_14:54:55-rgw:multisite-wip-smanjara-debug-distro-default-smithi/7287193/teuthology.log)
from teuthology logs:
2023-05-26T22:17:42.162 INFO:tasks.rgw_multisite_tests:tasks.rgw.c1.client.0.smithi003.stdout: INFO: 2023-05-26T16:08:19.154+0000 7f00133b3700 -1 Decompression failed with exit code -1
2023-05-26T22:17:42.162 INFO:tasks.rgw_multisite_tests:tasks.rgw.c1.client.0.smithi003.stdout: INFO: tcmalloc: large alloc 9825697792 bytes == 0x555d3c2ae000 @ 0x7f00f9893760 0x7f00f98b4c64 0x7effb8a76166 0x7effb8a74e93 0x555d2fee5d09 0
x555d301b55c1 0x555d30195a56 0x555d301d7a56 0x555d301d82b8 0x555d2ff7db7c 0x555d2fd6f28c 0x555d2fd729a4 0x555d2fcae766 0x555d2fcaf434 0x555d30b0d3ff
2023-05-26T22:17:42.163 INFO:tasks.rgw_multisite_tests:tasks.rgw.c1.client.0.smithi003.stdout: INFO: terminate called after throwing an instance of 'ceph::buffer::v15_2_0::end_of_buffer'
2023-05-26T22:17:42.163 INFO:tasks.rgw_multisite_tests:tasks.rgw.c1.client.0.smithi003.stdout: INFO: what(): End of buffer [buffer:2]
2023-05-26T22:17:42.163 INFO:tasks.rgw_multisite_tests:tasks.rgw.c1.client.0.smithi003.stdout: INFO: ** Caught signal (Aborted) *
2023-05-26T22:17:42.164 INFO:tasks.rgw_multisite_tests:tasks.rgw.c1.client.0.smithi003.stdout: INFO: in thread 7f00293df700 thread_name:radosgw
looking at the rgw logs, we seem to fail decompress an encrypted object leading up to the crash:
2023-05-26T16:08:19.154+0000 7f00133b3700 10 Compression for rgw is enabled, decompress part bl_ofs=0, bl_len=9
2023-05-26T16:08:19.154+0000 7f00293df700 10 req 7489100291051346547 0.000999987s cache get: name=a1.rgw.log++script.getdata. : hit (negative entry)
2023-05-26T16:08:19.154+0000 7f00133b3700 1 ZlibCompressor: Decompression error: decompress return 3>prefetch_data=1
2023-05-26T16:08:19.154+0000 7f00133b3700 -1 Decompression failed with exit code -1
2023-05-26T16:08:19.154+0000 7f00133b3700 0 req 2510031781517155540 0.000999987s s3:get_obj iterate_obj() failed with -1
2023-05-26T16:08:19.154+0000 7f00293df700 20 req 7489100291051346547 0.000999987s s3:get_obj get_obj_state: rctx=0x555d398a5ab0 obj=fdmmdm-52:testobj-sse-c state=0x555d387a99e8 s
2023-05-26T16:08:19.154+0000 7f00293df700 10 Compression for rgw is enabled, decompress part bl_ofs=0, bl_len=24
...
2023-05-26T16:08:23.026+0000 7f00293df700 -1 ** Caught signal (Aborted) *
in thread 7f00293df700 thread_name:radosgw
Updated by Casey Bodley 10 months ago
- Assignee changed from Shilpa MJ to Casey Bodley
it sounds like this failing GET request is coming from fetch_remote_obj()
, which uses the rgwx-skip-decrypt
header to skip the decryption filter and transfer the object data in its encrypted form. however, we still try to decompress the data if it was both compressed and encrypted. but the compression was applied to the unencrypted data, so we can't decompress without decrypting first
as a solution, the rgwx-skip-decrypt
header should also cause us to skip decompression of encrypted objects. then fetch_remote_obj()
will need to preserve the original RGW_ATTR_COMPRESSION
attribute instead of trying to recompress the data it stores
Updated by Casey Bodley 10 months ago
- Status changed from In Progress to Fix Under Review
- Pull request ID set to 52247
Updated by Casey Bodley 10 months ago
- Status changed from Fix Under Review to Pending Backport
Updated by Backport Bot 10 months ago
- Copied to Backport #61888: reef: multisite: terminate called after throwing an instance of 'ceph::buffer::v15_2_0::end_of_buffer' added
Updated by Backport Bot 10 months ago
- Tags changed from multisite to multisite backport_processed
Updated by Josh Salomon 10 months ago
Please note that compression before encryption is considered a security breach. I would not implement this without a clear warning and specific user approval.
Updated by Casey Bodley 6 months ago
- Related to Bug #63306: multisite: objects replicated with compress-encrypted store wrong (compressed) size in the bucket index added