Project

General

Profile

Actions

Backport #17974

closed

jewel: ceph/Client segfaults in handle_mds_map when switching mds

Added by Dan van der Ster over 7 years ago. Updated about 7 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
Release:
jewel
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Actions #1

Updated by Nathan Cutler over 7 years ago

  • Category set to 87
Actions #2

Updated by John Spray over 7 years ago

I wouldn't be surprised if it was fixed in Kraken but I'll look at the Jewel code.

Actions #3

Updated by John Spray over 7 years ago

(the updates to the code in Kraken were ceph-mgr related so if they fixed a bug it was completely accidental!)

Actions #4

Updated by John Spray over 7 years ago

  • Status changed from New to Fix Under Review

In jewel there is no call to erase a command from the table after it receives a reply, so if a command has ever been sent then it will crash as soon as a failover happens.

https://github.com/ceph/ceph/pull/12137

Actions #5

Updated by Loïc Dachary over 7 years ago

  • Tracker changed from Bug to Backport
  • Status changed from Fix Under Review to In Progress
Actions #6

Updated by Loïc Dachary over 7 years ago

  • Description updated (diff)

Original description

Our manila-share daemon is segfaulting when our active mds goes away and we switch to the standby.

The crash is in handle_mds_map:

(gdb) where
#0  0x0000000003580270 in ?? ()
#1  0x00007f06046ca84a in Client::handle_mds_map (this=this@entry=0x3b58bb0, m=m@entry=0x7f05d8001190) at client/Client.cc:2548
#2  0x00007f060470565b in Client::ms_dispatch (this=0x3b58bb0, m=0x7f05d8001190) at client/Client.cc:2443
#3  0x00007f060496dc9a in ms_deliver_dispatch (m=0x7f05d8001190, this=0x3b58340) at msg/Messenger.h:584
#4  DispatchQueue::entry (this=0x3b58510) at msg/simple/DispatchQueue.cc:185
#5  0x00007f06049df36d in DispatchQueue::DispatchThread::entry (this=<optimized out>) at msg/simple/DispatchQueue.h:103
#6  0x00007f061ddffdc5 in start_thread () from /lib64/libpthread.so.0
#7  0x00007f061d423ced in clone () from /lib64/libc.so.6

(gdb) up
#1  0x00007f06046ca84a in Client::handle_mds_map (this=this@entry=0x3b58bb0, m=m@entry=0x7f05d8001190) at client/Client.cc:2548
2548            i->second.on_finish->complete(-ETIMEDOUT);

(gdb) p i
$1 = {first = 1, second = {con = {px = },
    mds_gid = {<boost::totally_ordered1<mds_gid_t, boost::totally_ordered2<mds_gid_t, unsigned long, boost::detail::empty_base<mds_gid_t> > >> = {<boost::less_than_comparable1<mds_gid_t, boost::equality_comparable1<mds_gid_t, boost::totally_ordered2<mds_gid_t, unsigned long, boost::detail::empty_base<mds_gid_t> > > >> = {<boost::equality_comparable1<mds_gid_t, boost::totally_ordered2<mds_gid_t, unsigned long, boost::detail::empty_base<mds_gid_t> > >> = {<boost::totally_ordered2<mds_gid_t, unsigned long, boost::detail::empty_base<mds_gid_t> >> = {<boost::less_than_comparable2<mds_gid_t, unsigned long, boost::equality_comparable2<mds_gid_t, unsigned long, boost::detail::empty_base<mds_gid_t> > >> = {<boost::equality_comparable2<mds_gid_t, unsigned long, boost::detail::empty_base<mds_gid_t> >> = {<boost::detail::empty_base<mds_gid_t>> = {<No data fields>}, <No data fields>}, <No data fields>}, <No data fields>}, <No data fields>}, <No data fields>}, <No data fields>}, t = 310423241}, tid = 1, on_finish = , outbl = , outs = }}

Note that on_finish looks null there.

We get the same segfault with 10.2.3 and the jewel branch as of today (soon to be 10.2.4).

We've also noticed that the code has been refactored in kraken (which is why I assigned this to John):

client: refactor command handling
common: refactor CommandTable

so we tried the current kraken builds and indeed the crashes are fixed. It's not clear that it was the refactoring that fixes this -- it could have been something else in jewel...kraken instead.

Any ideas? Happy to help track this down.

Actions #7

Updated by Loïc Dachary over 7 years ago

  • Description updated (diff)
Actions #8

Updated by John Spray about 7 years ago

  • Status changed from In Progress to Resolved
Actions #9

Updated by Nathan Cutler about 7 years ago

  • Target version set to v10.2.6
Actions

Also available in: Atom PDF