Bug #57905: multisite: terminate called after throwing an instance of 'ceph::buffer::v15_2_0::end_of_buffer' - rgw - Ceph

Actions

Copy link

Bug #57905

open

multisite: terminate called after throwing an instance of 'ceph::buffer::v15_2_0::end_of_buffer'

Added by Casey Bodley over 1 year ago. Updated 10 months ago.

Status:

Pending Backport

Priority:

Immediate

Assignee:

Casey Bodley

Target version:

% Done:

Source:

Tags:

multisite backport_processed

Backport:

reef

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

52247

Crash signature (v1):

Crash signature (v2):

Description

example from rgw/multisite suite: http://qa-proxy.ceph.com/teuthology/cbodley-2022-10-19_23:28:37-rgw-wip-cbodley-testing-distro-default-smithi/7075088/teuthology.log

the tcmalloc warnings make it look like we're decode something, getting a really big 'size', and failing to decode that many bytes

2022-10-20T05:29:45.277 DEBUG:tasks.util.rgw:rgwadmin: cmd=['adjust-ulimits', 'ceph-coverage', '/home/ubuntu/cephtest/archive/coverage', 'radosgw-admin', '--log-to-stderr', '--format', 'json', '-n', 'client.0', '--cluster', 'c1', 'bucket', 'sync', 'checkpoint', '--bucket', 'swwtcn-52', '--source-zone', 'a1', '--retry-delay-ms', '5000', '--timeout-sec', '300', '--rgw-zone', 'a2', '--rgw-zonegroup', 'a', '--rgw-realm', 'test-realm', '--cluster', 'c1', '--debug-rgw', '1', '--debug-ms', '0']
2022-10-20T05:29:45.277 DEBUG:teuthology.orchestra.run.smithi150:> adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage radosgw-admin --log-to-stderr --format json -n client.0 --cluster c1 bucket sync checkpoint --bucket swwtcn-52 --source-zone a1 --retry-delay-ms 5000 --timeout-sec 300 --rgw-zone a2 --rgw-zonegroup a --rgw-realm test-realm --cluster c1 --debug-rgw 1 --debug-ms 0
2022-10-20T05:29:45.336 INFO:teuthology.orchestra.run.smithi150.stderr:ignoring --setuser ceph since I am not root
2022-10-20T05:29:45.337 INFO:teuthology.orchestra.run.smithi150.stderr:ignoring --setgroup ceph since I am not root
2022-10-20T05:29:45.381 INFO:teuthology.orchestra.run.smithi150.stderr:2022-10-20T05:29:45.380+0000 7f59b3460780  1 waiting to reach incremental sync..
2022-10-20T05:29:47.652 INFO:tasks.rgw.c1.client.0.smithi150.stdout:tcmalloc: large alloc 13655506944 bytes == 0x560fc6c8c000 @  0x7f7e06715760 0x7f7e06736c64 0x7f7cc5270166 0x7f7cc526ee93 0x560fba1059e9 0x560fba2170a4 0x560fba1f69a6 0x560fba2346c1 0x560fba234f44 0x560fba1c2525 0x560fb9f429f3 0x560fb9f443b7 0x560fb9e9ad96 0x560fb9e9b94a 0x560fbaca884f
2022-10-20T05:29:47.657 INFO:tasks.rgw.c1.client.0.smithi150.stdout:tcmalloc: large alloc 9825697792 bytes == 0x5612f6420000 @  0x7f7e06715760 0x7f7e06736c64 0x7f7cc5270166 0x7f7cc526ee93 0x560fba1059e9 0x560fba2170a4 0x560fba1f69a6 0x560fba2346c1 0x560fba234f44 0x560fba1c2525 0x560fb9f429f3 0x560fb9f443b7 0x560fb9e9ad96 0x560fb9e9b94a 0x560fbaca884f
2022-10-20T05:29:50.382 INFO:teuthology.orchestra.run.smithi150.stderr:2022-10-20T05:29:50.381+0000 7f59b3460780  1 waiting to reach incremental sync..
2022-10-20T05:29:51.336 INFO:tasks.rgw.c1.client.0.smithi150.stdout:terminate called after throwing an instance of 'ceph::buffer::v15_2_0::end_of_buffer'
2022-10-20T05:29:51.336 INFO:tasks.rgw.c1.client.0.smithi150.stdout:  what():  End of buffer [buffer:2]
2022-10-20T05:29:51.337 INFO:tasks.rgw.c1.client.0.smithi150.stdout:*** Caught signal (Aborted) **
2022-10-20T05:29:51.337 INFO:tasks.rgw.c1.client.0.smithi150.stdout: in thread 7f7cf11dc700 thread_name:radosgw
2022-10-20T05:29:51.338 INFO:tasks.rgw.c1.client.0.smithi150.stdout: ceph version 18.0.0-564-g492571cb (492571cb93a9d1551a1968e5374657023093a0a8) reef (dev)
2022-10-20T05:29:51.338 INFO:tasks.rgw.c1.client.0.smithi150.stdout: 1: /lib64/libpthread.so.0(+0x12cf0) [0x7f7e0596ccf0]
2022-10-20T05:29:51.338 INFO:tasks.rgw.c1.client.0.smithi150.stdout: 2: gsignal()
2022-10-20T05:29:51.339 INFO:tasks.rgw.c1.client.0.smithi150.stdout: 3: abort()
2022-10-20T05:29:51.339 INFO:tasks.rgw.c1.client.0.smithi150.stdout: 4: /lib64/libstdc++.so.6(+0x9009b) [0x7f7e050b309b]
2022-10-20T05:29:51.339 INFO:tasks.rgw.c1.client.0.smithi150.stdout: 5: /lib64/libstdc++.so.6(+0x9653c) [0x7f7e050b953c]
2022-10-20T05:29:51.339 INFO:tasks.rgw.c1.client.0.smithi150.stdout: 6: /lib64/libstdc++.so.6(+0x95559) [0x7f7e050b8559]
2022-10-20T05:29:51.340 INFO:tasks.rgw.c1.client.0.smithi150.stdout: 7: __gxx_personality_v0()
2022-10-20T05:29:51.340 INFO:tasks.rgw.c1.client.0.smithi150.stdout: 8: /lib64/libgcc_s.so.1(+0x10b03) [0x7f7e04a99b03]
2022-10-20T05:29:51.340 INFO:tasks.rgw.c1.client.0.smithi150.stdout: 9: _Unwind_Resume()
2022-10-20T05:29:51.340 INFO:tasks.rgw.c1.client.0.smithi150.stdout: 10: radosgw(+0x524ec4) [0x560fb9d5cec4]
2022-10-20T05:29:51.341 INFO:tasks.rgw.c1.client.0.smithi150.stdout: 11: radosgw(+0x653ecd) [0x560fb9e8becd]
2022-10-20T05:29:51.341 INFO:tasks.rgw.c1.client.0.smithi150.stdout: 12: /lib64/libstdc++.so.6(+0xc2ba3) [0x7f7e050e5ba3]
2022-10-20T05:29:51.341 INFO:tasks.rgw.c1.client.0.smithi150.stdout: 13: /lib64/libpthread.so.0(+0x81ca) [0x7f7e059621ca]
2022-10-20T05:29:51.341 INFO:tasks.rgw.c1.client.0.smithi150.stdout: 14: clone()

Related issues 2 (1 open — 1 closed)

Actions

Copy link

Updated by Casey Bodley over 1 year ago

Subject changed from multisite: crashes in 'bucket sync checkpoint' to multisite: terminate called after throwing an instance of 'ceph::buffer::v15_2_0::end_of_buffer'

oops, this wasn't a crash in the radosgw-admin process, it was in radosgw

Actions

Copy link

Updated by Yuval Lifshitz about 1 year ago

similar backtrace is seen in this PR: https://github.com/ceph/ceph/pull/48709
see: http://qa-proxy.ceph.com/teuthology/yuvalif-2023-02-12_19:44:50-rgw-wip-yuval-tracing-port-distro-default-smithi/7170021/remote/smithi023/crash/

same test crashed in a different PR (https://github.com/ceph/ceph/pull/49335) with more meaningful backtrace:
http://qa-proxy.ceph.com/teuthology/yuvalif-2023-02-12_19:47:07-rgw-wip-yuval-fix-58167-distro-default-smithi/7170113/remote/smithi045/crash/

Actions

Copy link

Updated by Casey Bodley about 1 year ago

Priority changed from Urgent to Immediate
Backport set to reef

Actions

Copy link

Updated by Casey Bodley about 1 year ago

Assignee set to Shilpa MJ

Actions

Copy link

Updated by Shilpa MJ about 1 year ago

I initially thought that the crash might be related to rgw compression and object encryption. So I ran multisite qa suite and also test_encrypted_object_sync() in particular with rgw_compression enabled, but failed to reproduce the crash. Will be trying with other rgw config variables on.

Actions

Copy link

Updated by Shilpa MJ 12 months ago

seen on both rhel and ubuntu

http://qa-proxy.ceph.com/teuthology/cbodley-2023-05-10_20:34:35-rgw-wip-shilpa-test-sync-fairness-distro-default-smithi/7270800/teuthology.log

http://qa-proxy.ceph.com/teuthology/cbodley-2023-05-10_20:34:35-rgw-wip-shilpa-test-sync-fairness-distro-default-smithi/7270764/teuthology.log

Actions

Copy link

Updated by Casey Bodley 11 months ago

suggested retesting in teuthology with debuginfo packages install

Actions

Copy link

Updated by Casey Bodley 11 months ago

Status changed from New to In Progress

Actions

Copy link

Updated by Shilpa MJ 11 months ago

more info from recent testing. (http://qa-proxy.ceph.com/teuthology/smanjara-2023-05-26_14:54:55-rgw:multisite-wip-smanjara-debug-distro-default-smithi/7287193/teuthology.log)

from teuthology logs:

2023-05-26T22:17:42.162 INFO:tasks.rgw_multisite_tests:tasks.rgw.c1.client.0.smithi003.stdout: INFO: 2023-05-26T16:08:19.154+0000 7f00133b3700 -1 Decompression failed with exit code -1
2023-05-26T22:17:42.162 INFO:tasks.rgw_multisite_tests:tasks.rgw.c1.client.0.smithi003.stdout: INFO: tcmalloc: large alloc 9825697792 bytes == 0x555d3c2ae000 @ 0x7f00f9893760 0x7f00f98b4c64 0x7effb8a76166 0x7effb8a74e93 0x555d2fee5d09 0
x555d301b55c1 0x555d30195a56 0x555d301d7a56 0x555d301d82b8 0x555d2ff7db7c 0x555d2fd6f28c 0x555d2fd729a4 0x555d2fcae766 0x555d2fcaf434 0x555d30b0d3ff
2023-05-26T22:17:42.163 INFO:tasks.rgw_multisite_tests:tasks.rgw.c1.client.0.smithi003.stdout: INFO: terminate called after throwing an instance of 'ceph::buffer::v15_2_0::end_of_buffer'
2023-05-26T22:17:42.163 INFO:tasks.rgw_multisite_tests:tasks.rgw.c1.client.0.smithi003.stdout: INFO: what(): End of buffer [buffer:2]
2023-05-26T22:17:42.163 INFO:tasks.rgw_multisite_tests:tasks.rgw.c1.client.0.smithi003.stdout: INFO: ** Caught signal (Aborted) *
2023-05-26T22:17:42.164 INFO:tasks.rgw_multisite_tests:tasks.rgw.c1.client.0.smithi003.stdout: INFO: in thread 7f00293df700 thread_name:radosgw

looking at the rgw logs, we seem to fail decompress an encrypted object leading up to the crash:

2023-05-26T16:08:19.154+0000 7f00133b3700 10 Compression for rgw is enabled, decompress part bl_ofs=0, bl_len=9
2023-05-26T16:08:19.154+0000 7f00293df700 10 req 7489100291051346547 0.000999987s cache get: name=a1.rgw.log++script.getdata. : hit (negative entry)
2023-05-26T16:08:19.154+0000 7f00133b3700 1 ZlibCompressor: Decompression error: decompress return 3
2023-05-26T16:08:19.154+0000 7f00133b3700 -1 Decompression failed with exit code -1
2023-05-26T16:08:19.154+0000 7f00133b3700 0 req 2510031781517155540 0.000999987s s3:get_obj iterate_obj() failed with -1
2023-05-26T16:08:19.154+0000 7f00293df700 20 req 7489100291051346547 0.000999987s s3:get_obj get_obj_state: rctx=0x555d398a5ab0 obj=fdmmdm-52:testobj-sse-c state=0x555d387a99e8 s>prefetch_data=1
2023-05-26T16:08:19.154+0000 7f00293df700 10 Compression for rgw is enabled, decompress part bl_ofs=0, bl_len=24
...
2023-05-26T16:08:23.026+0000 7f00293df700 -1 ** Caught signal (Aborted) *
in thread 7f00293df700 thread_name:radosgw

Actions

Copy link

#10

Updated by Casey Bodley 10 months ago

Assignee changed from Shilpa MJ to Casey Bodley

it sounds like this failing GET request is coming from fetch_remote_obj(), which uses the rgwx-skip-decrypt header to skip the decryption filter and transfer the object data in its encrypted form. however, we still try to decompress the data if it was both compressed and encrypted. but the compression was applied to the unencrypted data, so we can't decompress without decrypting first

as a solution, the rgwx-skip-decrypt header should also cause us to skip decompression of encrypted objects. then fetch_remote_obj() will need to preserve the original RGW_ATTR_COMPRESSION attribute instead of trying to recompress the data it stores

Actions

Copy link

#11

Updated by Casey Bodley 10 months ago

Status changed from In Progress to Fix Under Review
Pull request ID set to 52247

Actions

Copy link

#12

Updated by Casey Bodley 10 months ago

Status changed from Fix Under Review to Pending Backport

Actions

Copy link

#13

Updated by Backport Bot 10 months ago

Copied to Backport #61888: reef: multisite: terminate called after throwing an instance of 'ceph::buffer::v15_2_0::end_of_buffer' added

Actions

Copy link

#14

Updated by Backport Bot 10 months ago

Tags changed from multisite to multisite backport_processed

Actions

Copy link

#15

Updated by Josh Salomon 10 months ago

Please note that compression before encryption is considered a security breach. I would not implement this without a clear warning and specific user approval.

Actions

Copy link

#16

Updated by Casey Bodley 6 months ago

Related to Bug #63306: multisite: objects replicated with compress-encrypted store wrong (compressed) size in the bucket index added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » rgw

Custom queries

Bug #57905

multisite: terminate called after throwing an instance of 'ceph::buffer::v15_2_0::end_of_buffer'

Updated by Casey Bodley over 1 year ago

Updated by Yuval Lifshitz about 1 year ago

Updated by Casey Bodley about 1 year ago

Updated by Casey Bodley about 1 year ago

Updated by Shilpa MJ about 1 year ago

Updated by Shilpa MJ 12 months ago

Updated by Casey Bodley 11 months ago

Updated by Casey Bodley 11 months ago

Updated by Shilpa MJ 11 months ago

Updated by Casey Bodley 10 months ago

Updated by Casey Bodley 10 months ago

Updated by Casey Bodley 10 months ago

Updated by Backport Bot 10 months ago

Updated by Backport Bot 10 months ago

Updated by Josh Salomon 10 months ago

Updated by Casey Bodley 6 months ago