Bug #54363
closedsegfault when Resharding occurs during LC
0%
Description
on master a segfault occurs when dynamic resharding is triggered during lc processing:
repro:
- write 10,000,000 objects in a single bucket
- in terminal #1: apply LC policy to delete all the objects and instruct to process it
s3cmd setlifecycle ./lc-expiration.xml s3://b001b000000000000 sudo ./bin/radosgw-admin lc process --debug_rgw=20 &> lc_process.log
- in terminal #2: add resharding and instruct to process it
sudo ./bin/radosgw-admin reshard add --bucket=b001b000000000000 --num-shards=199 sudo ./bin/radosgw-admin reshard process
a segfault occurs in radosgw-admin executing the ls process and in radosgw (if also performing the lc in parallel):
less ./out/radosgw.8000.log ... -10> 2022-02-21T21:28:49.597+0200 7fffba0e7700 20 lifecycle: get_obj_state: rctx=0x7fffba0e6128 obj=b001b000000000000:o001o000001254189 state=0x5f119e8 s->prefetch_data=0 -9> 2022-02-21T21:28:49.598+0200 7fffba0e7700 10 lifecycle: If-UnModified-Since: 2022-02-21T19:28:43.698805+0200 Last-Modified: 0.000000 -8> 2022-02-21T21:28:49.598+0200 7fffba0e7700 1 lifecycle: ERROR: publishing notification failed, with error: -2 -7> 2022-02-21T21:28:49.598+0200 7fffba0e7700 0 lifecycle: ERROR: remove_expired_obj :b001b000000000000[4da573f1-5c09-4e56-8196-088a61e2573f.4196.4]):o001o000001254189 (2) No such fi> -6> 2022-02-21T21:28:49.598+0200 7fffba0e7700 0 lifecycle: ERROR: remove_expired_obj :b001b000000000000[4da573f1-5c09-4e56-8196-088a61e2573f.4196.4]):o001o000001254189 (2) No such fi> -5> 2022-02-21T21:28:49.598+0200 7fffba0e7700 20 lifecycle: ERROR: orule.process() returned ret=-2thread:wp_thrd: 3, 9 -4> 2022-02-21T21:28:49.598+0200 7fffba0e7700 20 lifecycle: operator()(): key=o001o000001253282wp_thrd: 3, 9 -3> 2022-02-21T21:28:49.598+0200 7fffba0e7700 20 lifecycle: check(): key=o001o000001253282: is_expired=1 wp_thrd: 3, 9 -2> 2022-02-21T21:28:49.598+0200 7fffba0e7700 20 lifecycle: get_obj_state: rctx=0x7fffba0e6128 obj=b001b000000000000:o001o000001253282 state=0x5f119e8 s->prefetch_data=0 -1> 2022-02-21T21:28:49.598+0200 7fffba0e7700 10 lifecycle: If-UnModified-Since: 2022-02-21T19:28:43.319021+0200 Last-Modified: 0.000000 0> 2022-02-21T21:28:49.598+0200 7fffb98e6700 -1 *** Caught signal (Aborted) ** in thread 7fffb98e6700 thread_name:lifecycle_thr_3 ceph version 17.0.0-10766-g88eb23e585d (88eb23e585da3fda0e43c96161eccf1b45ac3bf0) quincy (dev) 1: /lib64/libpthread.so.0(+0x14a90) [0x7ffff5a21a90] 2: gsignal() 3: abort() 4: /lib64/libstdc++.so.6(+0x9e941) [0x7ffff5863941] 5: /lib64/libstdc++.so.6(+0xaa32c) [0x7ffff586f32c] 6: /lib64/libstdc++.so.6(+0xaa397) [0x7ffff586f397] 7: /lib64/libstdc++.so.6(+0xaa649) [0x7ffff586f649] 8: (std::__throw_logic_error(char const*)+0x41) [0x7ffff58662a0] 9: (LCOpRule::update()+0x314) [0x7ffff76a5914] 10: (RGWLC::bucket_lc_process(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, RGWLC::LCWorker*, long, bool)+0x127f) [0x7ffff76a777f] 11: (RGWLC::process(int, int, RGWLC::LCWorker*, bool)+0x126e) [0x7ffff76ac0ee] 12: (RGWLC::process(RGWLC::LCWorker*, std::unique_ptr<rgw::sal::Bucket, std::default_delete<rgw::sal::Bucket> > const&, bool)+0x516) [0x7ffff76a3096] 13: (RGWLC::LCWorker::entry()+0x1f3) [0x7ffff76a22b3] 14: (Thread::entry_wrapper()+0xaa) [0x7ffff60fba6a] 15: /lib64/libpthread.so.0(+0x9432) [0x7ffff5a16432] 16: clone() NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Updated by Mark Kogan about 2 years ago
- Subject changed from segfault when Resharding occures during LC to segfault when Resharding occurs during LC
Updated by Matt Benjamin about 2 years ago
I believe this is the matching ceph backtrace:
backtrace:
[root@extensa022 ~]# ceph crash info 2022-02-20T12:10:34.608728Z_fc777516-b76d-4dd6-be2c-416701b18556 { "crash_id": "2022-02-20T12:10:34.608728Z_fc777516-b76d-4dd6-be2c-416701b18556", "timestamp": "2022-02-20T12:10:34.608728Z", "process_name": "radosgw", "entity_name": "client.rgw.extensa022.rgw2", "ceph_version": "14.2.22-67.el8cp", "utsname_hostname": "extensa022.ceph.redhat.com", "utsname_sysname": "Linux", "utsname_release": "4.18.0-348.2.1.el8_5.x86_64", "utsname_version": "#1 SMP Mon Nov 8 13:30:15 EST 2021", "utsname_machine": "x86_64", "os_name": "Red Hat Enterprise Linux", "os_id": "rhel", "os_version_id": "8.5", "os_version": "8.5 (Ootpa)", "backtrace": [ "(()+0x12c20) [0x7fadec6abc20]", "(gsignal()+0x10f) [0x7fadeb9dc37f]", "(abort()+0x127) [0x7fadeb9c6db5]", "(()+0x9009b) [0x7fadec39409b]", "(()+0x9653c) [0x7fadec39a53c]", "(()+0x96597) [0x7fadec39a597]", "(()+0x967f8) [0x7fadec39a7f8]", "(()+0x19e94) [0x7fadf8b05e94]", "(tcmalloc::allocate_full_cpp_throw_oom(unsigned long)+0x146) [0x7fadf8b27326]", "(()+0x6689ca) [0x558bdc2f09ca]", "(LCOpRule::update()+0x82) [0x558bdc2f2612]", "(RGWLC::bucket_lc_process(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, RGWLC::LCWorker*, long)+0x15d8) [0x558bdc2f8e28]", "(RGWLC::process(int, int, RGWLC::LCWorker*, bool)+0x750) [0x558bdc2fc480]", "(RGWLC::process(RGWLC::LCWorker*, rgw_bucket const*, bool)+0x20d) [0x558bdc2fd52d]", "(RGWLC::LCWorker::entry()+0x196) [0x558bdc2fd756]", "(()+0x817a) [0x7fadec6a117a]", "(clone()+0x43) [0x7fadebaa1dc3]" ] }
Updated by Mark Kogan about 2 years ago
(copying from https://bugzilla.redhat.com/show_bug.cgi?id=2056906#c10)
from inspecting the journalctl log below - it reports that tcmalloc is trying to perform a very large memory allocation (94059315200000 bytes)
Feb 20 12:10:34 extensa022.ceph.redhat.com radosgw[318347]: tcmalloc: large alloc 94059315200000 bytes == (nil) @ 0x7fadf8b06650 0x7fadf8b272f4 0x558bdc2f09ca 0x558bdc2f2612 0x558bdc2f8e28 0x558bdc2fc480 0x558bdc2fd52d 0x558bdc2fd756 0x7fadec6a117a 0x7fadebaa1dc3
journalctl ... Feb 20 12:06:21 extensa022.ceph.redhat.com radosgw[318347]: 2022-02-20 12:06:21.795 7fad94578700 -1 WARNING: The bucket info cache is inconsistent. This is a failure that should be debugged. I am a nice machine, so I will try to recover. Feb 20 12:06:21 extensa022.ceph.redhat.com radosgw[318347]: 2022-02-20 12:06:21.796 7fad94578700 -1 WARNING: The OSD has the same version I have. Something may have gone squirrelly. An administrator may have forced a change; otherwise there is a problem somewhere. Feb 20 12:06:21 extensa022.ceph.redhat.com radosgw[318347]: 2022-02-20 12:06:21.811 7fad94578700 -1 WARNING: The OSD has the same version I have. Something may have gone squirrelly. An administrator may have forced a change; otherwise there is a problem somewhere. Feb 20 12:06:26 extensa022.ceph.redhat.com radosgw[318347]: 2022-02-20 12:06:26.790 7fad96d7d700 -1 WARNING: The OSD has the same version I have. Something may have gone squirrelly. An administrator may have forced a change; otherwise there is a problem somewhere. Feb 20 12:10:20 extensa022.ceph.redhat.com systemd[1]: Starting system activity accounting tool... Feb 20 12:10:20 extensa022.ceph.redhat.com systemd[1]: sysstat-collect.service: Succeeded. Feb 20 12:10:20 extensa022.ceph.redhat.com systemd[1]: Started system activity accounting tool. Feb 20 12:10:34 extensa022.ceph.redhat.com radosgw[318347]: tcmalloc: large alloc 94059315200000 bytes == (nil) @ 0x7fadf8b06650 0x7fadf8b272f4 0x558bdc2f09ca 0x558bdc2f2612 0x558bdc2f8e28 0x558bdc2fc480 0x558bdc2fd52d 0x558bdc2fd756 0x7fadec6a117a 0x7fadebaa1dc3 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Feb 20 12:10:34 extensa022.ceph.redhat.com radosgw[318347]: terminate called after throwing an instance of 'std::bad_alloc' Feb 20 12:10:34 extensa022.ceph.redhat.com radosgw[318347]: what(): std::bad_alloc Feb 20 12:10:34 extensa022.ceph.redhat.com radosgw[318347]: *** Caught signal (Aborted) ** Feb 20 12:10:34 extensa022.ceph.redhat.com radosgw[318347]: in thread 7fada3d97700 thread_name:lifecycle_thr_4 Feb 20 12:10:34 extensa022.ceph.redhat.com radosgw[318347]: ceph version 14.2.22-67.el8cp (73c7168dd308ed246bde4ccac7f60234281252d0) nautilus (stable) Feb 20 12:10:34 extensa022.ceph.redhat.com radosgw[318347]: 1: (()+0x12c20) [0x7fadec6abc20] Feb 20 12:10:34 extensa022.ceph.redhat.com radosgw[318347]: 2: (gsignal()+0x10f) [0x7fadeb9dc37f] Feb 20 12:10:34 extensa022.ceph.redhat.com radosgw[318347]: 3: (abort()+0x127) [0x7fadeb9c6db5] Feb 20 12:10:34 extensa022.ceph.redhat.com radosgw[318347]: 4: (()+0x9009b) [0x7fadec39409b] Feb 20 12:10:34 extensa022.ceph.redhat.com radosgw[318347]: 5: (()+0x9653c) [0x7fadec39a53c] Feb 20 12:10:34 extensa022.ceph.redhat.com radosgw[318347]: 6: (()+0x96597) [0x7fadec39a597] Feb 20 12:10:34 extensa022.ceph.redhat.com radosgw[318347]: 7: (()+0x967f8) [0x7fadec39a7f8] Feb 20 12:10:34 extensa022.ceph.redhat.com radosgw[318347]: 8: (()+0x19e94) [0x7fadf8b05e94] Feb 20 12:10:34 extensa022.ceph.redhat.com radosgw[318347]: 9: (tcmalloc::allocate_full_cpp_throw_oom(unsigned long)+0x146) [0x7fadf8b27326] Feb 20 12:10:34 extensa022.ceph.redhat.com radosgw[318347]: 10: (()+0x6689ca) [0x558bdc2f09ca] Feb 20 12:10:34 extensa022.ceph.redhat.com radosgw[318347]: 11: (LCOpRule::update()+0x82) [0x558bdc2f2612] Feb 20 12:10:34 extensa022.ceph.redhat.com radosgw[318347]: 12: (RGWLC::bucket_lc_process(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, RGWLC::LCWorker*, long)+0x15d8) [0x558bdc2f8e28] Feb 20 12:10:34 extensa022.ceph.redhat.com radosgw[318347]: 13: (RGWLC::process(int, int, RGWLC::LCWorker*, bool)+0x750) [0x558bdc2fc480] Feb 20 12:10:34 extensa022.ceph.redhat.com radosgw[318347]: 14: (RGWLC::process(RGWLC::LCWorker*, rgw_bucket const*, bool)+0x20d) [0x558bdc2fd52d] Feb 20 12:10:34 extensa022.ceph.redhat.com radosgw[318347]: 15: (RGWLC::LCWorker::entry()+0x196) [0x558bdc2fd756] Feb 20 12:10:34 extensa022.ceph.redhat.com radosgw[318347]: 16: (()+0x817a) [0x7fadec6a117a] Feb 20 12:10:34 extensa022.ceph.redhat.com radosgw[318347]: 17: (clone()+0x43) [0x7fadebaa1dc3] Feb 20 12:10:34 extensa022.ceph.redhat.com radosgw[318347]: 2022-02-20 12:10:34.607 7fada3d97700 -1 *** Caught signal (Aborted) **
Updated by Mark Kogan about 2 years ago
quick update, cause found:
below when an error condition occurs instead of false a true is returned as result of casting errno to bool
src/rgw/rgw_lc.cc bool get_obj(const DoutPrefixProvider *dpp, rgw_bucket_dir_entry **obj, ^^^^ std::function<void(void)> fetch_barrier = []() { /* nada */}) { if (obj_iter == list_results.objs.end()) { if (!list_results.is_truncated) { delay(); return false; } else { fetch_barrier(); list_params.marker = pre_obj.key; int ret = fetch(dpp); if (ret < 0) { ldpp_dout(dpp, 0) << "ERROR: list_op returned ret=" << ret << dendl; return ret; ^^^^^^^^^^^ } } delay(); } /* returning address of entry in objs */ *obj = &(*obj_iter); return obj_iter != list_results.objs.end(); }
Updated by Casey Bodley about 2 years ago
- Status changed from In Progress to Fix Under Review
- Tags set to lifecycle
- Backport set to octopus pacific quincy
Updated by Mark Kogan about 2 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Backport Bot about 2 years ago
- Copied to Backport #54967: quincy: segfault when Resharding occurs during LC added
Updated by Backport Bot about 2 years ago
- Copied to Backport #54968: pacific: segfault when Resharding occurs during LC added
Updated by Backport Bot about 2 years ago
- Copied to Backport #54969: octopus: segfault when Resharding occurs during LC added
Updated by Backport Bot over 1 year ago
- Tags changed from lifecycle to lifecycle backport_processed
Updated by Konstantin Shalygin over 1 year ago
- Status changed from Pending Backport to Resolved
- Tags changed from lifecycle backport_processed to lifecycle