https://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2017-01-26T23:22:11ZCeph Linux kernel client - Bug #18690: kclient: FAILED assert(0 == "old msgs despite reconnect_seq feature")https://tracker.ceph.com/issues/18690?journal_id=849062017-01-26T23:22:11ZPatrick Donnellypdonnell@redhat.com
<ul></ul><p>Happened in this run too: <a class="external" href="http://pulpito.ceph.com/pdonnell-2017-01-26_18:37:20-multimds:thrash-wip-multimds-tests-testing-basic-mira/752781/">http://pulpito.ceph.com/pdonnell-2017-01-26_18:37:20-multimds:thrash-wip-multimds-tests-testing-basic-mira/752781/</a></p> Linux kernel client - Bug #18690: kclient: FAILED assert(0 == "old msgs despite reconnect_seq feature")https://tracker.ceph.com/issues/18690?journal_id=851432017-01-30T14:45:04ZSage Weilsage@newdream.net
<ul><li><strong>Priority</strong> changed from <i>Normal</i> to <i>Urgent</i></li></ul> Linux kernel client - Bug #18690: kclient: FAILED assert(0 == "old msgs despite reconnect_seq feature")https://tracker.ceph.com/issues/18690?journal_id=851652017-01-30T18:34:06ZGreg Farnumgfarnum@redhat.com
<ul><li><strong>Project</strong> changed from <i>CephFS</i> to <i>Ceph</i></li><li><strong>Subject</strong> changed from <i>msgr: FAILED assert(0 == "old msgs despite reconnect_seq feature")</i> to <i>async msgr: FAILED assert(0 == "old msgs despite reconnect_seq feature")</i></li><li><strong>Category</strong> changed from <i>90</i> to <i>msgr</i></li><li><strong>Assignee</strong> set to <i>Haomai Wang</i></li></ul><p>Pinging Haomai. Sounds like the MDS is triggering cases we didn't hit in the OSD?</p> Linux kernel client - Bug #18690: kclient: FAILED assert(0 == "old msgs despite reconnect_seq feature")https://tracker.ceph.com/issues/18690?journal_id=856372017-02-04T04:24:41ZHaomai Wanghaomaiwang@gmail.com
<ul></ul><p>client <-> MDS with stateful_server policy, then mds crashed because of old msg seq.</p>
<p>is this teuthology job can reproduce? we need to improve to debug_ms to verify the problem.</p>
<p>ping @Patrick Donnelly</p> Linux kernel client - Bug #18690: kclient: FAILED assert(0 == "old msgs despite reconnect_seq feature")https://tracker.ceph.com/issues/18690?journal_id=856992017-02-06T15:29:20ZPatrick Donnellypdonnell@redhat.com
<ul></ul><p>I has shown up many times in the multimds thrasher runs. I'll raise debug_ms for future runs.</p> Linux kernel client - Bug #18690: kclient: FAILED assert(0 == "old msgs despite reconnect_seq feature")https://tracker.ceph.com/issues/18690?journal_id=857272017-02-06T20:23:55ZPatrick Donnellypdonnell@redhat.com
<ul></ul><p>Here you go: /ceph/teuthology-archive/pdonnell-2017-02-06_19:24:21-multimds:thrash-master-testing-basic-smithi/791580/remote/smithi149/log/ceph-mds.f.log.gz</p> Linux kernel client - Bug #18690: kclient: FAILED assert(0 == "old msgs despite reconnect_seq feature")https://tracker.ceph.com/issues/18690?journal_id=857432017-02-07T05:29:09ZHaomai Wanghaomaiwang@gmail.com
<ul></ul><p>Needs client log too.. <a class="external" href="http://qa-proxy.ceph.com/teuthology/pdonnell-2017-02-06_19:24:21-multimds:thrash-master-testing-basic-smithi/791580/remote/smithi165/log/">http://qa-proxy.ceph.com/teuthology/pdonnell-2017-02-06_19:24:21-multimds:thrash-master-testing-basic-smithi/791580/remote/smithi165/log/</a> shows empty...</p>
<p>BTW I search the whole repo that "ms_die_on_old_message" doesn't set to true in qa. why it will trigger assert? Do you enable this in this branch?</p> Linux kernel client - Bug #18690: kclient: FAILED assert(0 == "old msgs despite reconnect_seq feature")https://tracker.ceph.com/issues/18690?journal_id=857802017-02-07T23:11:01ZPatrick Donnellypdonnell@redhat.com
<ul></ul><p>Client logs are missing because it's the kernel client. I will need to rerun the test suite to see if I can coax a failure with the ceph-fuse client (it has happened before with ceph-fuse).</p>
<p>I did not enable that config which indeed makes this more strange.</p> Linux kernel client - Bug #18690: kclient: FAILED assert(0 == "old msgs despite reconnect_seq feature")https://tracker.ceph.com/issues/18690?journal_id=857822017-02-08T02:05:13ZHaomai Wanghaomaiwang@gmail.com
<ul></ul><p>OH, I guess kernel client handle reconnect seq inconsistent with async msgr. if fuse client help, it would be great.</p> Linux kernel client - Bug #18690: kclient: FAILED assert(0 == "old msgs despite reconnect_seq feature")https://tracker.ceph.com/issues/18690?journal_id=858032017-02-08T15:59:34ZGreg Farnumgfarnum@redhat.com
<ul></ul><p>It looks like teuthology sets "ms die on old message = true" in its ceph.conf.template file.</p>
<p>Haomai, we don't <strong>expect</strong> the kernel client to have any problems with the AsyncMessenger, right?</p> Linux kernel client - Bug #18690: kclient: FAILED assert(0 == "old msgs despite reconnect_seq feature")https://tracker.ceph.com/issues/18690?journal_id=858292017-02-09T06:50:55ZHaomai Wanghaomaiwang@gmail.com
<ul></ul><p>yes..</p> Linux kernel client - Bug #18690: kclient: FAILED assert(0 == "old msgs despite reconnect_seq feature")https://tracker.ceph.com/issues/18690?journal_id=862282017-02-17T16:36:56ZPatrick Donnellypdonnell@redhat.com
<ul></ul><p>Haomai, here's a test run:</p>
<p><a class="external" href="http://pulpito.ceph.com/pdonnell-2017-02-17_15:35:17-multimds:thrash-master-testing-basic-smithi/826081/">http://pulpito.ceph.com/pdonnell-2017-02-17_15:35:17-multimds:thrash-master-testing-basic-smithi/826081/</a></p>
<p>and here's the kernel log with "echo file net/ceph/messenger.c +p >/sys/kernel/debug/dynamic_debug/control" executed on the client:</p>
<p>teuthology:/home/pdonnell/826081/kern.log.gz</p> Linux kernel client - Bug #18690: kclient: FAILED assert(0 == "old msgs despite reconnect_seq feature")https://tracker.ceph.com/issues/18690?journal_id=874242017-03-09T14:56:52ZJohn Sprayjcspray@gmail.com
<ul></ul><p>Lots more of these on latest run:<br /><a class="external" href="http://pulpito.ceph.com/jspray-2017-03-08_14:08:01-multimds-master-testing-basic-smithi/">http://pulpito.ceph.com/jspray-2017-03-08_14:08:01-multimds-master-testing-basic-smithi/</a></p> Linux kernel client - Bug #18690: kclient: FAILED assert(0 == "old msgs despite reconnect_seq feature")https://tracker.ceph.com/issues/18690?journal_id=874252017-03-09T14:57:13ZJohn Sprayjcspray@gmail.com
<ul><li><strong>Target version</strong> set to <i>v12.0.0</i></li></ul> Linux kernel client - Bug #18690: kclient: FAILED assert(0 == "old msgs despite reconnect_seq feature")https://tracker.ceph.com/issues/18690?journal_id=874262017-03-09T15:08:11ZHaomai Wanghaomaiwang@gmail.com
<ul></ul><p>Ah, I checked Patrick Donnelly's log before. But I'm really have no experience on ceph kernel codes, I don't have any idea why kernel msgr doesn't react proper reconnect feature.</p>
<p>Anyone who can help on this case? @yanzheng?</p> Linux kernel client - Bug #18690: kclient: FAILED assert(0 == "old msgs despite reconnect_seq feature")https://tracker.ceph.com/issues/18690?journal_id=874272017-03-09T15:08:47ZHaomai Wanghaomaiwang@gmail.com
<ul></ul><p>BTW it looks this crash only happen when multimds? kernel client + multimds is a clue.</p> Linux kernel client - Bug #18690: kclient: FAILED assert(0 == "old msgs despite reconnect_seq feature")https://tracker.ceph.com/issues/18690?journal_id=881852017-03-29T13:07:17ZJohn Sprayjcspray@gmail.com
<ul></ul><p>Still seeing this here:</p>
<p><a class="external" href="http://pulpito.ceph.com/jspray-2017-03-29_01:19:13-multimds-wip-jcsp-testing-20170328-testing-basic-smithi/958269">http://pulpito.ceph.com/jspray-2017-03-29_01:19:13-multimds-wip-jcsp-testing-20170328-testing-basic-smithi/958269</a><br /><a class="external" href="http://pulpito.ceph.com/jspray-2017-03-29_01:19:13-multimds-wip-jcsp-testing-20170328-testing-basic-smithi/958256">http://pulpito.ceph.com/jspray-2017-03-29_01:19:13-multimds-wip-jcsp-testing-20170328-testing-basic-smithi/958256</a></p>
<p>(both are indeed kclient)</p>
<p>Even if there is a kclient bug here, we need the server to handle this cleanly instead of crashing.</p> Linux kernel client - Bug #18690: kclient: FAILED assert(0 == "old msgs despite reconnect_seq feature")https://tracker.ceph.com/issues/18690?journal_id=883552017-04-04T10:20:14ZHaomai Wanghaomaiwang@gmail.com
<ul></ul><p>I read a little at kernel/net/ceph/messenger.cc it looks not respect on reconnect seq actually. maybe we could disable ms_die_on_old_message test in qa with kernel client</p> Linux kernel client - Bug #18690: kclient: FAILED assert(0 == "old msgs despite reconnect_seq feature")https://tracker.ceph.com/issues/18690?journal_id=886642017-04-07T13:38:14ZHaomai Wanghaomaiwang@gmail.com
<ul></ul><p>I suspect simple msgr also has this problem, maybe we can try this</p> Linux kernel client - Bug #18690: kclient: FAILED assert(0 == "old msgs despite reconnect_seq feature")https://tracker.ceph.com/issues/18690?journal_id=890872017-04-12T17:23:02ZJohn Sprayjcspray@gmail.com
<ul><li><strong>Subject</strong> changed from <i>async msgr: FAILED assert(0 == "old msgs despite reconnect_seq feature")</i> to <i>kclient: FAILED assert(0 == "old msgs despite reconnect_seq feature")</i></li><li><strong>Assignee</strong> changed from <i>Haomai Wang</i> to <i>Zheng Yan</i></li></ul><p>Thanks for the input Haomai.</p>
<p>Kernel git log has this commit claiming to implement reconnect_seq:<br /><pre>
commit 3a23083bda56850a1dc0e1c6d270b1f5dc789f07
Author: Sage Weil <sage@inktank.com>
Date: Mon Mar 25 08:47:40 2013 -0700
libceph: implement RECONNECT_SEQ feature
This is an old protocol extension that allows the client and server to
avoid resending old messages after a reconnect (following a socket error).
Instead, the exchange their sequence numbers during the handshake. This
avoids sending a bunch of useless data over the socket.
It has been supported in the server code since v0.22 (Sep 2010).
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
</pre></p>
<p>However, clearly it isn't working correctly.</p>
<p>I notice that the multimds tests have a ``ms inject socket failures`` in their thrashing config, whereas the general kcephfs tests do not.</p>
<p>I've created a jcsp/wip-18690-qa branch with the .yaml files tweaked to disable "die on old message" for the multimds tests, and enable msgr connection failures in the kcephfs tests, so that hopefully we can temporarily ignore this for multimds testing, and at the same time reproduce it in the kcephfs suite to get the underlying cause fixed.</p> Linux kernel client - Bug #18690: kclient: FAILED assert(0 == "old msgs despite reconnect_seq feature")https://tracker.ceph.com/issues/18690?journal_id=892092017-04-13T10:22:54ZJohn Sprayjcspray@gmail.com
<ul><li><strong>Project</strong> changed from <i>Ceph</i> to <i>Linux kernel client</i></li><li><strong>Category</strong> changed from <i>msgr</i> to <i>libceph</i></li></ul><p>OK, here is the issue reproduced on the kcephfs suite, with the qa/ modification in jcsp/wip-18690-qa</p>
<p><a class="external" href="http://pulpito.ceph.com/jspray-2017-04-12_23:35:53-kcephfs:thrash-master-testing-basic-smithi/">http://pulpito.ceph.com/jspray-2017-04-12_23:35:53-kcephfs:thrash-master-testing-basic-smithi/</a></p> Linux kernel client - Bug #18690: kclient: FAILED assert(0 == "old msgs despite reconnect_seq feature")https://tracker.ceph.com/issues/18690?journal_id=908142017-05-05T11:09:21ZZheng Yanukernel@gmail.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>7</i></li></ul><p><a class="external" href="https://github.com/ceph/ceph-client/commit/c4561c5b195a564423bd002c1a8017a876e44819">https://github.com/ceph/ceph-client/commit/c4561c5b195a564423bd002c1a8017a876e44819</a></p> Linux kernel client - Bug #18690: kclient: FAILED assert(0 == "old msgs despite reconnect_seq feature")https://tracker.ceph.com/issues/18690?journal_id=917912017-05-30T08:50:22ZIlya Dryomov
<ul><li><strong>Status</strong> changed from <i>7</i> to <i>Resolved</i></li></ul><p><a class="external" href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0a2ad541071f99eaf4589c3551176fca191c1ee2">https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0a2ad541071f99eaf4589c3551176fca191c1ee2</a> in 4.12-rc3</p> Linux kernel client - Bug #18690: kclient: FAILED assert(0 == "old msgs despite reconnect_seq feature")https://tracker.ceph.com/issues/18690?journal_id=1399162019-07-01T22:27:39ZPatrick Donnellypdonnell@redhat.com
<ul><li><strong>Related to</strong> <i><a class="issue tracker-1 status-1 priority-4 priority-default" href="/issues/40613">Bug #40613</a>: kclient: .handle_message_footer got old message 1 <= 648 0x558ceadeaac0 client_session(request_renewcaps seq 12), discarding</i> added</li></ul>