Project

General

Profile

Bug #35543

multisite: segfault on shutdown/realm reload

Added by Casey Bodley 3 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Target version:
-
Start date:
09/04/2018
Due date:
% Done:

0%

Source:
Tags:
multisite
Backport:
luminous mimic
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

I've been seeing a lot of these segfaults in multisite tests. For example, http://qa-proxy.ceph.com/teuthology/cbodley-2018-09-04_16:22:33-rgw:multisite-wip-rgw-sync-trace-cleanup-distro-basic-smithi/2979041/teuthology.log

  -229> 2018-09-04 18:23:48.596 35e6e700  5 data sync: Sync:e2fa9a6e:data:Data:all:finish
  -228> 2018-09-04 18:23:48.596 35e6e700  0 data sync: ERROR: failed to run sync
...
   -26> 2018-09-04 18:23:49.566 2f04e700 20 RGWWQ:
   -25> 2018-09-04 18:23:49.566 2f04e700 20 req: 0x13f066580
   -24> 2018-09-04 18:23:49.566 2f04e700 20 req: 0x13f421fb0
   -23> 2018-09-04 18:23:49.566 2f04e700 20 req: 0x13f45e090
   -22> 2018-09-04 18:23:49.566 2f04e700 20 req: 0x16480b50
   -21> 2018-09-04 18:23:49.566 2f04e700 20 req: 0x1442a9150
   -20> 2018-09-04 18:23:49.566 2f04e700 20 req: 0x13f2abf10
   -19> 2018-09-04 18:23:49.566 2f04e700 20 req: 0x3ae97c90
   -18> 2018-09-04 18:23:49.566 2f04e700 20 req: 0x1441d7470
   -17> 2018-09-04 18:23:49.566 2f04e700 20 req: 0x145b1a3e0
   -16> 2018-09-04 18:23:49.566 2f04e700 20 req: 0x16407d60
   -15> 2018-09-04 18:23:49.566 2f04e700 20 req: 0x1a77aec0
   -14> 2018-09-04 18:23:49.566 2f04e700 20 req: 0x1a6f0780
   -13> 2018-09-04 18:23:49.566 2f04e700 20 req: 0x1649f770
   -12> 2018-09-04 18:23:49.566 2f04e700 20 req: 0x15017560
   -11> 2018-09-04 18:23:49.566 2f04e700 20 req: 0x1617e400
   -10> 2018-09-04 18:23:49.566 2f04e700 20 req: 0x1a7174d0
    -9> 2018-09-04 18:23:49.567 2f04e700 20 req: 0x16233830
    -8> 2018-09-04 18:23:49.567 2f04e700 20 req: 0x1a6acbb0
    -7> 2018-09-04 18:23:49.567 2f04e700 20 req: 0x13f2627e0
    -6> 2018-09-04 18:23:49.567 2f04e700 20 req: 0x145d85ac0
    -5> 2018-09-04 18:23:49.567 2f04e700 20 req: 0x1a80e220
    -4> 2018-09-04 18:23:49.567 2f04e700 20 req: 0x145cf3e60
    -3> 2018-09-04 18:23:49.567 2f04e700 20 req: 0x13f550820
    -2> 2018-09-04 18:23:49.567 2f04e700 20 req: 0x143af4df0
    -1> 2018-09-04 18:23:49.567 17c20700 -1 *** Caught signal (Segmentation fault) **
 in thread 17c20700 thread_name:msgr-worker-1

 ceph version 14.0.0-2709-gf71a21c (f71a21c4e844f4f84439a7b4a5aed84dd0111a78) nautilus (dev)
 1: (()+0xf6d0) [0xe8ea6d0]
 2: (ceph::buffer::list::crc32c(unsigned int) const+0x6b) [0x614469b]
 3: (Message::encode(unsigned long, int)+0xed) [0x606dead]
 4: (AsyncConnection::prepare_send_message(unsigned long, Message*, ceph::buffer::list&)+0x44) [0x6106e34]
 5: (AsyncConnection::handle_write()+0x1d0) [0x610e0d0]
 6: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xa67) [0x6126e37]
 7: (()+0x4a7cd5) [0x612bcd5]
 8: (()+0x6c6fff) [0x634afff]
 9: (()+0x7e25) [0xe8e2e25]
 10: (clone()+0x6d) [0x119babad]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

It's not clear which message is segfaulting here, but it's likely referencing memory that was released with a coroutine.


Related issues

Related to rgw - Bug #23661: RGWAsyncGetSystemObj failed assertion on shutdown/realm reload Resolved 04/11/2018
Copied to rgw - Backport #35856: luminous: multisite: segfault on shutdown/realm reload Resolved
Copied to rgw - Backport #35857: mimic: multisite: segfault on shutdown/realm reload Resolved

History

#1 Updated by Casey Bodley 3 months ago

  • Status changed from New to Testing

#2 Updated by Casey Bodley 3 months ago

  • Status changed from Testing to Pending Backport
  • Backport set to luminous mimic

#3 Updated by Patrick Donnelly 3 months ago

  • Copied to Backport #35856: luminous: multisite: segfault on shutdown/realm reload added

#4 Updated by Patrick Donnelly 3 months ago

  • Copied to Backport #35857: mimic: multisite: segfault on shutdown/realm reload added

#5 Updated by Casey Bodley 3 months ago

  • Related to Bug #23661: RGWAsyncGetSystemObj failed assertion on shutdown/realm reload added

#6 Updated by Nathan Cutler about 2 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF