Project

General

Profile

Actions

Bug #57905

open

multisite: terminate called after throwing an instance of 'ceph::buffer::v15_2_0::end_of_buffer'

Added by Casey Bodley over 1 year ago. Updated 10 months ago.

Status:
Pending Backport
Priority:
Immediate
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
multisite backport_processed
Backport:
reef
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

example from rgw/multisite suite: http://qa-proxy.ceph.com/teuthology/cbodley-2022-10-19_23:28:37-rgw-wip-cbodley-testing-distro-default-smithi/7075088/teuthology.log

the tcmalloc warnings make it look like we're decode something, getting a really big 'size', and failing to decode that many bytes

2022-10-20T05:29:45.277 DEBUG:tasks.util.rgw:rgwadmin: cmd=['adjust-ulimits', 'ceph-coverage', '/home/ubuntu/cephtest/archive/coverage', 'radosgw-admin', '--log-to-stderr', '--format', 'json', '-n', 'client.0', '--cluster', 'c1', 'bucket', 'sync', 'checkpoint', '--bucket', 'swwtcn-52', '--source-zone', 'a1', '--retry-delay-ms', '5000', '--timeout-sec', '300', '--rgw-zone', 'a2', '--rgw-zonegroup', 'a', '--rgw-realm', 'test-realm', '--cluster', 'c1', '--debug-rgw', '1', '--debug-ms', '0']
2022-10-20T05:29:45.277 DEBUG:teuthology.orchestra.run.smithi150:> adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage radosgw-admin --log-to-stderr --format json -n client.0 --cluster c1 bucket sync checkpoint --bucket swwtcn-52 --source-zone a1 --retry-delay-ms 5000 --timeout-sec 300 --rgw-zone a2 --rgw-zonegroup a --rgw-realm test-realm --cluster c1 --debug-rgw 1 --debug-ms 0
2022-10-20T05:29:45.336 INFO:teuthology.orchestra.run.smithi150.stderr:ignoring --setuser ceph since I am not root
2022-10-20T05:29:45.337 INFO:teuthology.orchestra.run.smithi150.stderr:ignoring --setgroup ceph since I am not root
2022-10-20T05:29:45.381 INFO:teuthology.orchestra.run.smithi150.stderr:2022-10-20T05:29:45.380+0000 7f59b3460780  1 waiting to reach incremental sync..
2022-10-20T05:29:47.652 INFO:tasks.rgw.c1.client.0.smithi150.stdout:tcmalloc: large alloc 13655506944 bytes == 0x560fc6c8c000 @  0x7f7e06715760 0x7f7e06736c64 0x7f7cc5270166 0x7f7cc526ee93 0x560fba1059e9 0x560fba2170a4 0x560fba1f69a6 0x560fba2346c1 0x560fba234f44 0x560fba1c2525 0x560fb9f429f3 0x560fb9f443b7 0x560fb9e9ad96 0x560fb9e9b94a 0x560fbaca884f
2022-10-20T05:29:47.657 INFO:tasks.rgw.c1.client.0.smithi150.stdout:tcmalloc: large alloc 9825697792 bytes == 0x5612f6420000 @  0x7f7e06715760 0x7f7e06736c64 0x7f7cc5270166 0x7f7cc526ee93 0x560fba1059e9 0x560fba2170a4 0x560fba1f69a6 0x560fba2346c1 0x560fba234f44 0x560fba1c2525 0x560fb9f429f3 0x560fb9f443b7 0x560fb9e9ad96 0x560fb9e9b94a 0x560fbaca884f
2022-10-20T05:29:50.382 INFO:teuthology.orchestra.run.smithi150.stderr:2022-10-20T05:29:50.381+0000 7f59b3460780  1 waiting to reach incremental sync..
2022-10-20T05:29:51.336 INFO:tasks.rgw.c1.client.0.smithi150.stdout:terminate called after throwing an instance of 'ceph::buffer::v15_2_0::end_of_buffer'
2022-10-20T05:29:51.336 INFO:tasks.rgw.c1.client.0.smithi150.stdout:  what():  End of buffer [buffer:2]
2022-10-20T05:29:51.337 INFO:tasks.rgw.c1.client.0.smithi150.stdout:*** Caught signal (Aborted) **
2022-10-20T05:29:51.337 INFO:tasks.rgw.c1.client.0.smithi150.stdout: in thread 7f7cf11dc700 thread_name:radosgw
2022-10-20T05:29:51.338 INFO:tasks.rgw.c1.client.0.smithi150.stdout: ceph version 18.0.0-564-g492571cb (492571cb93a9d1551a1968e5374657023093a0a8) reef (dev)
2022-10-20T05:29:51.338 INFO:tasks.rgw.c1.client.0.smithi150.stdout: 1: /lib64/libpthread.so.0(+0x12cf0) [0x7f7e0596ccf0]
2022-10-20T05:29:51.338 INFO:tasks.rgw.c1.client.0.smithi150.stdout: 2: gsignal()
2022-10-20T05:29:51.339 INFO:tasks.rgw.c1.client.0.smithi150.stdout: 3: abort()
2022-10-20T05:29:51.339 INFO:tasks.rgw.c1.client.0.smithi150.stdout: 4: /lib64/libstdc++.so.6(+0x9009b) [0x7f7e050b309b]
2022-10-20T05:29:51.339 INFO:tasks.rgw.c1.client.0.smithi150.stdout: 5: /lib64/libstdc++.so.6(+0x9653c) [0x7f7e050b953c]
2022-10-20T05:29:51.339 INFO:tasks.rgw.c1.client.0.smithi150.stdout: 6: /lib64/libstdc++.so.6(+0x95559) [0x7f7e050b8559]
2022-10-20T05:29:51.340 INFO:tasks.rgw.c1.client.0.smithi150.stdout: 7: __gxx_personality_v0()
2022-10-20T05:29:51.340 INFO:tasks.rgw.c1.client.0.smithi150.stdout: 8: /lib64/libgcc_s.so.1(+0x10b03) [0x7f7e04a99b03]
2022-10-20T05:29:51.340 INFO:tasks.rgw.c1.client.0.smithi150.stdout: 9: _Unwind_Resume()
2022-10-20T05:29:51.340 INFO:tasks.rgw.c1.client.0.smithi150.stdout: 10: radosgw(+0x524ec4) [0x560fb9d5cec4]
2022-10-20T05:29:51.341 INFO:tasks.rgw.c1.client.0.smithi150.stdout: 11: radosgw(+0x653ecd) [0x560fb9e8becd]
2022-10-20T05:29:51.341 INFO:tasks.rgw.c1.client.0.smithi150.stdout: 12: /lib64/libstdc++.so.6(+0xc2ba3) [0x7f7e050e5ba3]
2022-10-20T05:29:51.341 INFO:tasks.rgw.c1.client.0.smithi150.stdout: 13: /lib64/libpthread.so.0(+0x81ca) [0x7f7e059621ca]
2022-10-20T05:29:51.341 INFO:tasks.rgw.c1.client.0.smithi150.stdout: 14: clone()

Related issues 2 (1 open1 closed)

Related to rgw - Bug #63306: multisite: objects replicated with compress-encrypted store wrong (compressed) size in the bucket indexPending BackportCasey Bodley

Actions
Copied to rgw - Backport #61888: reef: multisite: terminate called after throwing an instance of 'ceph::buffer::v15_2_0::end_of_buffer'ResolvedCasey BodleyActions
Actions #1

Updated by Casey Bodley over 1 year ago

  • Subject changed from multisite: crashes in 'bucket sync checkpoint' to multisite: terminate called after throwing an instance of 'ceph::buffer::v15_2_0::end_of_buffer'

oops, this wasn't a crash in the radosgw-admin process, it was in radosgw

Actions #3

Updated by Casey Bodley about 1 year ago

  • Priority changed from Urgent to Immediate
  • Backport set to reef
Actions #4

Updated by Casey Bodley about 1 year ago

  • Assignee set to Shilpa MJ
Actions #5

Updated by Shilpa MJ about 1 year ago

I initially thought that the crash might be related to rgw compression and object encryption. So I ran multisite qa suite and also test_encrypted_object_sync() in particular with rgw_compression enabled, but failed to reproduce the crash. Will be trying with other rgw config variables on.

Actions #7

Updated by Casey Bodley 11 months ago

suggested retesting in teuthology with debuginfo packages install

Actions #8

Updated by Casey Bodley 11 months ago

  • Status changed from New to In Progress
Actions #9

Updated by Shilpa MJ 11 months ago

more info from recent testing. (http://qa-proxy.ceph.com/teuthology/smanjara-2023-05-26_14:54:55-rgw:multisite-wip-smanjara-debug-distro-default-smithi/7287193/teuthology.log)

from teuthology logs:

2023-05-26T22:17:42.162 INFO:tasks.rgw_multisite_tests:tasks.rgw.c1.client.0.smithi003.stdout: INFO: 2023-05-26T16:08:19.154+0000 7f00133b3700 -1 Decompression failed with exit code -1
2023-05-26T22:17:42.162 INFO:tasks.rgw_multisite_tests:tasks.rgw.c1.client.0.smithi003.stdout: INFO: tcmalloc: large alloc 9825697792 bytes == 0x555d3c2ae000 @ 0x7f00f9893760 0x7f00f98b4c64 0x7effb8a76166 0x7effb8a74e93 0x555d2fee5d09 0
x555d301b55c1 0x555d30195a56 0x555d301d7a56 0x555d301d82b8 0x555d2ff7db7c 0x555d2fd6f28c 0x555d2fd729a4 0x555d2fcae766 0x555d2fcaf434 0x555d30b0d3ff
2023-05-26T22:17:42.163 INFO:tasks.rgw_multisite_tests:tasks.rgw.c1.client.0.smithi003.stdout: INFO: terminate called after throwing an instance of 'ceph::buffer::v15_2_0::end_of_buffer'
2023-05-26T22:17:42.163 INFO:tasks.rgw_multisite_tests:tasks.rgw.c1.client.0.smithi003.stdout: INFO: what(): End of buffer [buffer:2]
2023-05-26T22:17:42.163 INFO:tasks.rgw_multisite_tests:tasks.rgw.c1.client.0.smithi003.stdout: INFO: ** Caught signal (Aborted) *
2023-05-26T22:17:42.164 INFO:tasks.rgw_multisite_tests:tasks.rgw.c1.client.0.smithi003.stdout: INFO: in thread 7f00293df700 thread_name:radosgw

looking at the rgw logs, we seem to fail decompress an encrypted object leading up to the crash:

2023-05-26T16:08:19.154+0000 7f00133b3700 10 Compression for rgw is enabled, decompress part bl_ofs=0, bl_len=9
2023-05-26T16:08:19.154+0000 7f00293df700 10 req 7489100291051346547 0.000999987s cache get: name=a1.rgw.log++script.getdata. : hit (negative entry)
2023-05-26T16:08:19.154+0000 7f00133b3700 1 ZlibCompressor: Decompression error: decompress return 3
2023-05-26T16:08:19.154+0000 7f00133b3700 -1 Decompression failed with exit code -1
2023-05-26T16:08:19.154+0000 7f00133b3700 0 req 2510031781517155540 0.000999987s s3:get_obj iterate_obj() failed with -1
2023-05-26T16:08:19.154+0000 7f00293df700 20 req 7489100291051346547 0.000999987s s3:get_obj get_obj_state: rctx=0x555d398a5ab0 obj=fdmmdm-52:testobj-sse-c state=0x555d387a99e8 s
>prefetch_data=1
2023-05-26T16:08:19.154+0000 7f00293df700 10 Compression for rgw is enabled, decompress part bl_ofs=0, bl_len=24
...
2023-05-26T16:08:23.026+0000 7f00293df700 -1 ** Caught signal (Aborted) *
in thread 7f00293df700 thread_name:radosgw

Actions #10

Updated by Casey Bodley 10 months ago

  • Assignee changed from Shilpa MJ to Casey Bodley

it sounds like this failing GET request is coming from fetch_remote_obj(), which uses the rgwx-skip-decrypt header to skip the decryption filter and transfer the object data in its encrypted form. however, we still try to decompress the data if it was both compressed and encrypted. but the compression was applied to the unencrypted data, so we can't decompress without decrypting first

as a solution, the rgwx-skip-decrypt header should also cause us to skip decompression of encrypted objects. then fetch_remote_obj() will need to preserve the original RGW_ATTR_COMPRESSION attribute instead of trying to recompress the data it stores

Actions #11

Updated by Casey Bodley 10 months ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 52247
Actions #12

Updated by Casey Bodley 10 months ago

  • Status changed from Fix Under Review to Pending Backport
Actions #13

Updated by Backport Bot 10 months ago

  • Copied to Backport #61888: reef: multisite: terminate called after throwing an instance of 'ceph::buffer::v15_2_0::end_of_buffer' added
Actions #14

Updated by Backport Bot 10 months ago

  • Tags changed from multisite to multisite backport_processed
Actions #15

Updated by Josh Salomon 10 months ago

Please note that compression before encryption is considered a security breach. I would not implement this without a clear warning and specific user approval.

Actions #16

Updated by Casey Bodley 6 months ago

  • Related to Bug #63306: multisite: objects replicated with compress-encrypted store wrong (compressed) size in the bucket index added
Actions

Also available in: Atom PDF