Bug #35543
multisite: segfault on shutdown/realm reload
Status:
Resolved
Priority:
Normal
Assignee:
-
Target version:
-
% Done:
0%
Source:
Tags:
multisite
Backport:
luminous mimic
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
I've been seeing a lot of these segfaults in multisite tests. For example, http://qa-proxy.ceph.com/teuthology/cbodley-2018-09-04_16:22:33-rgw:multisite-wip-rgw-sync-trace-cleanup-distro-basic-smithi/2979041/teuthology.log
-229> 2018-09-04 18:23:48.596 35e6e700 5 data sync: Sync:e2fa9a6e:data:Data:all:finish -228> 2018-09-04 18:23:48.596 35e6e700 0 data sync: ERROR: failed to run sync ... -26> 2018-09-04 18:23:49.566 2f04e700 20 RGWWQ: -25> 2018-09-04 18:23:49.566 2f04e700 20 req: 0x13f066580 -24> 2018-09-04 18:23:49.566 2f04e700 20 req: 0x13f421fb0 -23> 2018-09-04 18:23:49.566 2f04e700 20 req: 0x13f45e090 -22> 2018-09-04 18:23:49.566 2f04e700 20 req: 0x16480b50 -21> 2018-09-04 18:23:49.566 2f04e700 20 req: 0x1442a9150 -20> 2018-09-04 18:23:49.566 2f04e700 20 req: 0x13f2abf10 -19> 2018-09-04 18:23:49.566 2f04e700 20 req: 0x3ae97c90 -18> 2018-09-04 18:23:49.566 2f04e700 20 req: 0x1441d7470 -17> 2018-09-04 18:23:49.566 2f04e700 20 req: 0x145b1a3e0 -16> 2018-09-04 18:23:49.566 2f04e700 20 req: 0x16407d60 -15> 2018-09-04 18:23:49.566 2f04e700 20 req: 0x1a77aec0 -14> 2018-09-04 18:23:49.566 2f04e700 20 req: 0x1a6f0780 -13> 2018-09-04 18:23:49.566 2f04e700 20 req: 0x1649f770 -12> 2018-09-04 18:23:49.566 2f04e700 20 req: 0x15017560 -11> 2018-09-04 18:23:49.566 2f04e700 20 req: 0x1617e400 -10> 2018-09-04 18:23:49.566 2f04e700 20 req: 0x1a7174d0 -9> 2018-09-04 18:23:49.567 2f04e700 20 req: 0x16233830 -8> 2018-09-04 18:23:49.567 2f04e700 20 req: 0x1a6acbb0 -7> 2018-09-04 18:23:49.567 2f04e700 20 req: 0x13f2627e0 -6> 2018-09-04 18:23:49.567 2f04e700 20 req: 0x145d85ac0 -5> 2018-09-04 18:23:49.567 2f04e700 20 req: 0x1a80e220 -4> 2018-09-04 18:23:49.567 2f04e700 20 req: 0x145cf3e60 -3> 2018-09-04 18:23:49.567 2f04e700 20 req: 0x13f550820 -2> 2018-09-04 18:23:49.567 2f04e700 20 req: 0x143af4df0 -1> 2018-09-04 18:23:49.567 17c20700 -1 *** Caught signal (Segmentation fault) ** in thread 17c20700 thread_name:msgr-worker-1 ceph version 14.0.0-2709-gf71a21c (f71a21c4e844f4f84439a7b4a5aed84dd0111a78) nautilus (dev) 1: (()+0xf6d0) [0xe8ea6d0] 2: (ceph::buffer::list::crc32c(unsigned int) const+0x6b) [0x614469b] 3: (Message::encode(unsigned long, int)+0xed) [0x606dead] 4: (AsyncConnection::prepare_send_message(unsigned long, Message*, ceph::buffer::list&)+0x44) [0x6106e34] 5: (AsyncConnection::handle_write()+0x1d0) [0x610e0d0] 6: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xa67) [0x6126e37] 7: (()+0x4a7cd5) [0x612bcd5] 8: (()+0x6c6fff) [0x634afff] 9: (()+0x7e25) [0xe8e2e25] 10: (clone()+0x6d) [0x119babad] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
It's not clear which message is segfaulting here, but it's likely referencing memory that was released with a coroutine.
Related issues
History
#1 Updated by Casey Bodley over 4 years ago
- Status changed from New to 7
testing https://github.com/ceph/ceph/pull/23920 as a fix
#2 Updated by Casey Bodley over 4 years ago
- Status changed from 7 to Pending Backport
- Backport set to luminous mimic
#3 Updated by Patrick Donnelly over 4 years ago
- Copied to Backport #35856: luminous: multisite: segfault on shutdown/realm reload added
#4 Updated by Patrick Donnelly over 4 years ago
- Copied to Backport #35857: mimic: multisite: segfault on shutdown/realm reload added
#5 Updated by Casey Bodley over 4 years ago
- Related to Bug #23661: RGWAsyncGetSystemObj failed assertion on shutdown/realm reload added
#6 Updated by Nathan Cutler over 4 years ago
- Status changed from Pending Backport to Resolved