https://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2016-11-21T11:30:53ZCeph CephFS - Backport #17974: jewel: ceph/Client segfaults in handle_mds_map when switching mdshttps://tracker.ceph.com/issues/17974?journal_id=816522016-11-21T11:30:53ZNathan Cutlerncutler@suse.cz
<ul><li><strong>Category</strong> set to <i>87</i></li></ul> CephFS - Backport #17974: jewel: ceph/Client segfaults in handle_mds_map when switching mdshttps://tracker.ceph.com/issues/17974?journal_id=816642016-11-21T14:46:28ZJohn Sprayjcspray@gmail.com
<ul></ul><p>I wouldn't be surprised if it was fixed in Kraken but I'll look at the Jewel code.</p> CephFS - Backport #17974: jewel: ceph/Client segfaults in handle_mds_map when switching mdshttps://tracker.ceph.com/issues/17974?journal_id=816652016-11-21T14:46:58ZJohn Sprayjcspray@gmail.com
<ul></ul><p>(the updates to the code in Kraken were ceph-mgr related so if they fixed a bug it was completely accidental!)</p> CephFS - Backport #17974: jewel: ceph/Client segfaults in handle_mds_map when switching mdshttps://tracker.ceph.com/issues/17974?journal_id=817872016-11-22T15:51:37ZJohn Sprayjcspray@gmail.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>Fix Under Review</i></li></ul><p>In jewel there is no call to erase a command from the table after it receives a reply, so if a command has ever been sent then it will crash as soon as a failover happens.</p>
<p><a class="external" href="https://github.com/ceph/ceph/pull/12137">https://github.com/ceph/ceph/pull/12137</a></p> CephFS - Backport #17974: jewel: ceph/Client segfaults in handle_mds_map when switching mdshttps://tracker.ceph.com/issues/17974?journal_id=817942016-11-22T16:38:00ZLoïc Dacharyloic@dachary.org
<ul><li><strong>Tracker</strong> changed from <i>Bug</i> to <i>Backport</i></li><li><strong>Status</strong> changed from <i>Fix Under Review</i> to <i>In Progress</i></li></ul> CephFS - Backport #17974: jewel: ceph/Client segfaults in handle_mds_map when switching mdshttps://tracker.ceph.com/issues/17974?journal_id=817952016-11-22T16:38:26ZLoïc Dacharyloic@dachary.org
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/81795/diff?detail_id=78897">diff</a>)</li></ul><a name="Original-description"></a>
<h3 >Original description<a href="#Original-description" class="wiki-anchor">¶</a></h3>
<p>Our manila-share daemon is segfaulting when our active mds goes away and we switch to the standby.</p>
<p>The crash is in handle_mds_map:</p>
<pre>
(gdb) where
#0 0x0000000003580270 in ?? ()
#1 0x00007f06046ca84a in Client::handle_mds_map (this=this@entry=0x3b58bb0, m=m@entry=0x7f05d8001190) at client/Client.cc:2548
#2 0x00007f060470565b in Client::ms_dispatch (this=0x3b58bb0, m=0x7f05d8001190) at client/Client.cc:2443
#3 0x00007f060496dc9a in ms_deliver_dispatch (m=0x7f05d8001190, this=0x3b58340) at msg/Messenger.h:584
#4 DispatchQueue::entry (this=0x3b58510) at msg/simple/DispatchQueue.cc:185
#5 0x00007f06049df36d in DispatchQueue::DispatchThread::entry (this=<optimized out>) at msg/simple/DispatchQueue.h:103
#6 0x00007f061ddffdc5 in start_thread () from /lib64/libpthread.so.0
#7 0x00007f061d423ced in clone () from /lib64/libc.so.6
(gdb) up
#1 0x00007f06046ca84a in Client::handle_mds_map (this=this@entry=0x3b58bb0, m=m@entry=0x7f05d8001190) at client/Client.cc:2548
2548 i->second.on_finish->complete(-ETIMEDOUT);
(gdb) p i
$1 = {first = 1, second = {con = {px = },
mds_gid = {<boost::totally_ordered1<mds_gid_t, boost::totally_ordered2<mds_gid_t, unsigned long, boost::detail::empty_base<mds_gid_t> > >> = {<boost::less_than_comparable1<mds_gid_t, boost::equality_comparable1<mds_gid_t, boost::totally_ordered2<mds_gid_t, unsigned long, boost::detail::empty_base<mds_gid_t> > > >> = {<boost::equality_comparable1<mds_gid_t, boost::totally_ordered2<mds_gid_t, unsigned long, boost::detail::empty_base<mds_gid_t> > >> = {<boost::totally_ordered2<mds_gid_t, unsigned long, boost::detail::empty_base<mds_gid_t> >> = {<boost::less_than_comparable2<mds_gid_t, unsigned long, boost::equality_comparable2<mds_gid_t, unsigned long, boost::detail::empty_base<mds_gid_t> > >> = {<boost::equality_comparable2<mds_gid_t, unsigned long, boost::detail::empty_base<mds_gid_t> >> = {<boost::detail::empty_base<mds_gid_t>> = {<No data fields>}, <No data fields>}, <No data fields>}, <No data fields>}, <No data fields>}, <No data fields>}, <No data fields>}, t = 310423241}, tid = 1, on_finish = , outbl = , outs = }}
</pre>
<p>Note that on_finish looks null there.</p>
<p>We get the same segfault with 10.2.3 and the jewel branch as of today (soon to be 10.2.4).</p>
<p>We've also noticed that the code has been refactored in kraken (which is why I assigned this to John):</p>
<pre><code>client: refactor command handling<br />common: refactor CommandTable</code></pre>
<p>so we tried the current kraken builds and indeed the crashes are fixed. It's not clear that it was the refactoring that fixes this -- it could have been something else in jewel...kraken instead.</p>
<p>Any ideas? Happy to help track this down.</p> CephFS - Backport #17974: jewel: ceph/Client segfaults in handle_mds_map when switching mdshttps://tracker.ceph.com/issues/17974?journal_id=817962016-11-22T16:38:47ZLoïc Dacharyloic@dachary.org
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/81796/diff?detail_id=78898">diff</a>)</li></ul> CephFS - Backport #17974: jewel: ceph/Client segfaults in handle_mds_map when switching mdshttps://tracker.ceph.com/issues/17974?journal_id=847402017-01-25T14:06:33ZJohn Sprayjcspray@gmail.com
<ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Resolved</i></li></ul> CephFS - Backport #17974: jewel: ceph/Client segfaults in handle_mds_map when switching mdshttps://tracker.ceph.com/issues/17974?journal_id=847832017-01-25T17:10:29ZNathan Cutlerncutler@suse.cz
<ul><li><strong>Target version</strong> set to <i>v10.2.6</i></li></ul>