https://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2015-04-27T21:10:21ZCeph CephFS - Bug #11481: "mds/MDSTable.cc: 146: FAILED assert(is_undef())" on standby->replay transition https://tracker.ceph.com/issues/11481?journal_id=510992015-04-27T21:10:21ZGreg Farnumgfarnum@redhat.com
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/51099/diff?detail_id=49989">diff</a>)</li></ul> CephFS - Bug #11481: "mds/MDSTable.cc: 146: FAILED assert(is_undef())" on standby->replay transition https://tracker.ceph.com/issues/11481?journal_id=511082015-04-28T06:10:33ZZheng Yanukernel@gmail.com
<ul></ul><p>I can't access <a class="external" href="http://pulpito-rdu.front.sepia.ceph.com/">http://pulpito-rdu.front.sepia.ceph.com/</a>. can anyone post full mds log here</p> CephFS - Bug #11481: "mds/MDSTable.cc: 146: FAILED assert(is_undef())" on standby->replay transition https://tracker.ceph.com/issues/11481?journal_id=513082015-05-05T15:16:19ZJohn Sprayjcspray@gmail.com
<ul></ul><p>Weird! This is an MDS going from state active to state standby, then back again. In the final stage it notices that its already initialized inotable, so asserts out.</p> CephFS - Bug #11481: "mds/MDSTable.cc: 146: FAILED assert(is_undef())" on standby->replay transition https://tracker.ceph.com/issues/11481?journal_id=513162015-05-05T16:16:42ZJohn Sprayjcspray@gmail.com
<ul></ul><p>This looks like a mon bug of some kind. After a leader election, a bunch of routed messages are being resent to the leader by another mon. These include boot beacons from various MDS instances. The daemon states in the MDSMap flap around like this:<br /><pre>
--
4107: 172.20.133.69:6808/20631 'a' mds.-1.0 up:standby seq 1
4129: 172.20.133.65:6811/16012 'a-s' mds.0.4 up:active seq 10
--
4107: 172.20.133.69:6808/20631 'a' mds.-1.0 up:standby seq 1
4111: 172.20.133.65:6808/16012 'a-s' mds.-1.0 up:standby seq 1
--
4107: 172.20.133.69:6808/20631 'a' mds.-1.0 up:standby seq 1
4123: 172.20.133.65:6809/16012 'a-s' mds.-1.0 up:standby seq 1
--
4107: 172.20.133.69:6808/20631 'a' mds.-1.0 up:standby seq 1
4126: 172.20.133.65:6810/16012 'a-s' mds.-1.0 up:standby seq 1
--
4107: 172.20.133.69:6808/20631 'a' mds.-1.0 up:standby seq 1
4129: 172.20.133.65:6811/16012 'a-s' mds.-1.0 up:standby seq 1
--
4107: 172.20.133.69:6808/20631 'a' mds.-1.0 up:standby seq 1
4129: 172.20.133.65:6811/16012 'a-s' mds.-1.0 up:standby seq 1
--
4107: 172.20.133.69:6808/20631 'a' mds.-1.0 up:standby seq 1
4129: 172.20.133.65:6811/16012 'a-s' mds.0.5 up:replay seq 1
</pre></p>
<p>(the '4129' GID is the one that is then crashing, as a result of being sent back into standby after having been active)</p> CephFS - Bug #11481: "mds/MDSTable.cc: 146: FAILED assert(is_undef())" on standby->replay transition https://tracker.ceph.com/issues/11481?journal_id=513172015-05-05T16:48:45ZJohn Sprayjcspray@gmail.com
<ul></ul><p>The mons appear to be rather unhappy in this test run, but there are no thrashers or messenger failure settings turned on. So I wonder if this was an existing bug, and we're only seeing it now because of something nasty with the hosts or network in the new lab that's causing lots of thrashing of the monitors?</p>
<p>The MDSs are also getting restarted for some reason (before we see the failure condition), but there's nothing in the logs to indicate why.</p>
<p>Was there ever a mechanism in MDSMonitor to protect against seeing a message from a dead MDS <strong>after</strong> we see the message from a live MDS that replaces it on the same host with the same ID?</p> CephFS - Bug #11481: "mds/MDSTable.cc: 146: FAILED assert(is_undef())" on standby->replay transition https://tracker.ceph.com/issues/11481?journal_id=516482015-05-12T04:48:07ZGreg Farnumgfarnum@redhat.com
<ul><li><strong>Regression</strong> set to <i>No</i></li></ul><p>For some reason the leader is not responding to all of the MDSBeacon messages it receives (but it does respond to others), and so the ones it didn't respond to are getting re-routed on each election. This is deliberate behavior on the part of the peon.</p>
<p>Moreover, there is a seq value in the MDSBeacon and the monitors are supposed to ignore any of them with a seq value lower than what's in the map.</p>
<p>Based on C_Updated and the comment in MDSMonitor::_updated, it looks to me like we're expecting the MDS to get a beacon reply during update_from_paxos, but there's absolutely nothing that would do that. It looks to me like it's just very broken and I'm not sure why we wouldn't have seen this in the past — perhaps our timeouts are just generous enough it usually doesn't cause issues. :/</p> CephFS - Bug #11481: "mds/MDSTable.cc: 146: FAILED assert(is_undef())" on standby->replay transition https://tracker.ceph.com/issues/11481?journal_id=516502015-05-12T05:02:14ZGreg Farnumgfarnum@redhat.com
<ul></ul><p>John, can you do something so the MDS resets itself if it gets moved out of the active state? I think there might be ad-hoc mechanisms to do that in some places already, and although it <strong>should</strong> be either actually dead or else blacklisted whenever that transition happens, it's always good to be resilient.</p>
<p>I've created <a class="issue tracker-7 status-3 priority-4 priority-default closed" title="Fix: MDSMonitor: handle MDSBeacon messages properly (Resolved)" href="https://tracker.ceph.com/issues/11590">#11590</a> to track the monitor beacon replies bit of this. I'm still unclear on how the monitor could have accepted any outdated MDSBeacon messages though — and also on how it's skipping all the debugging output I'd expect to see on which paths it's taking (these logs appear to be at mon debug 20).</p> CephFS - Bug #11481: "mds/MDSTable.cc: 146: FAILED assert(is_undef())" on standby->replay transition https://tracker.ceph.com/issues/11481?journal_id=521292015-05-19T08:31:19ZZheng Yanukernel@gmail.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>Resolved</i></li></ul> CephFS - Bug #11481: "mds/MDSTable.cc: 146: FAILED assert(is_undef())" on standby->replay transition https://tracker.ceph.com/issues/11481?journal_id=544182015-07-02T02:56:57ZKefu Chaitchaikov@gmail.com
<ul></ul><p>fixed in <a class="external" href="https://github.com/ceph/ceph/pull/4658">https://github.com/ceph/ceph/pull/4658</a></p> CephFS - Bug #11481: "mds/MDSTable.cc: 146: FAILED assert(is_undef())" on standby->replay transition https://tracker.ceph.com/issues/11481?journal_id=746212016-07-13T05:52:34ZGreg Farnumgfarnum@redhat.com
<ul><li><strong>Component(FS)</strong> <i>MDS</i> added</li></ul>