Bug #18690
closedkclient: FAILED assert(0 == "old msgs despite reconnect_seq feature")
0%
Description
2017-01-26 21:55:21.740004 7f114f1a5700 10 mds.b ms_handle_accept 172.21.4.112:0/3776505678 con 0x55e0a2bb5800 session 0x55e0a2ac6300 2017-01-26 21:55:21.740294 7f1151143700 0 -- 172.21.4.128:6812/1494558506 >> 172.21.4.112:0/3776505678 conn(0x55e0a2bb5800 :6812 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=6 cs=3 l=0).process got old message 2 <= 2 0x55e0a2bc2600 client_request(client.4173:1 getattr p #1 2017-01-26 21:55:21.596971 caller_uid=0, caller_gid=0{}) v2, discarding 2017-01-26 21:55:21.742454 7f1151143700 -1 /tmp/buildd/ceph-11.1.0-6912-gaf9152f/src/msg/async/AsyncConnection.cc: In function 'void AsyncConnection::process()' thread 7f1151143700 time 2017-01-26 21:55:21.740343 /tmp/buildd/ceph-11.1.0-6912-gaf9152f/src/msg/async/AsyncConnection.cc: 741: FAILED assert(0 == "old msgs despite reconnect_seq feature") ceph version 11.1.0-6912-gaf9152f (af9152f34a416a46bbf96b14b1416ca004c80f84) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x7f11558e6542] 2: (AsyncConnection::process()+0x2f35) [0x7f1155a02c75] 3: (EventCenter::process_events(int)+0x9b1) [0x7f1155a125e1] 4: (()+0x39dfe9) [0x7f1155a16fe9] 5: (()+0xb8c80) [0x7f1154d86c80] 6: (()+0x770a) [0x7f115546370a] 7: (clone()+0x6d) [0x7f11544ec82d]
From: /ceph/teuthology-archive/pdonnell-2017-01-26_18:37:20-multimds:thrash-wip-multimds-tests-testing-basic-mira/752781/remote/mira025/log/ceph-mds.b.log.gz
This failure happened before any thrashing.
Updated by Patrick Donnelly about 7 years ago
Updated by Greg Farnum about 7 years ago
- Project changed from CephFS to Ceph
- Subject changed from msgr: FAILED assert(0 == "old msgs despite reconnect_seq feature") to async msgr: FAILED assert(0 == "old msgs despite reconnect_seq feature")
- Category changed from 90 to msgr
- Assignee set to Haomai Wang
Pinging Haomai. Sounds like the MDS is triggering cases we didn't hit in the OSD?
Updated by Haomai Wang about 7 years ago
client <-> MDS with stateful_server policy, then mds crashed because of old msg seq.
is this teuthology job can reproduce? we need to improve to debug_ms to verify the problem.
ping @Patrick Donnelly
Updated by Patrick Donnelly about 7 years ago
I has shown up many times in the multimds thrasher runs. I'll raise debug_ms for future runs.
Updated by Patrick Donnelly about 7 years ago
Here you go: /ceph/teuthology-archive/pdonnell-2017-02-06_19:24:21-multimds:thrash-master-testing-basic-smithi/791580/remote/smithi149/log/ceph-mds.f.log.gz
Updated by Haomai Wang about 7 years ago
Needs client log too.. http://qa-proxy.ceph.com/teuthology/pdonnell-2017-02-06_19:24:21-multimds:thrash-master-testing-basic-smithi/791580/remote/smithi165/log/ shows empty...
BTW I search the whole repo that "ms_die_on_old_message" doesn't set to true in qa. why it will trigger assert? Do you enable this in this branch?
Updated by Patrick Donnelly about 7 years ago
Client logs are missing because it's the kernel client. I will need to rerun the test suite to see if I can coax a failure with the ceph-fuse client (it has happened before with ceph-fuse).
I did not enable that config which indeed makes this more strange.
Updated by Haomai Wang about 7 years ago
OH, I guess kernel client handle reconnect seq inconsistent with async msgr. if fuse client help, it would be great.
Updated by Greg Farnum about 7 years ago
It looks like teuthology sets "ms die on old message = true" in its ceph.conf.template file.
Haomai, we don't expect the kernel client to have any problems with the AsyncMessenger, right?
Updated by Patrick Donnelly about 7 years ago
Haomai, here's a test run:
and here's the kernel log with "echo file net/ceph/messenger.c +p >/sys/kernel/debug/dynamic_debug/control" executed on the client:
teuthology:/home/pdonnell/826081/kern.log.gz
Updated by John Spray about 7 years ago
Lots more of these on latest run:
http://pulpito.ceph.com/jspray-2017-03-08_14:08:01-multimds-master-testing-basic-smithi/
Updated by Haomai Wang about 7 years ago
Ah, I checked Patrick Donnelly's log before. But I'm really have no experience on ceph kernel codes, I don't have any idea why kernel msgr doesn't react proper reconnect feature.
Anyone who can help on this case? @yanzheng?
Updated by Haomai Wang about 7 years ago
BTW it looks this crash only happen when multimds? kernel client + multimds is a clue.
Updated by John Spray about 7 years ago
Still seeing this here:
http://pulpito.ceph.com/jspray-2017-03-29_01:19:13-multimds-wip-jcsp-testing-20170328-testing-basic-smithi/958269
http://pulpito.ceph.com/jspray-2017-03-29_01:19:13-multimds-wip-jcsp-testing-20170328-testing-basic-smithi/958256
(both are indeed kclient)
Even if there is a kclient bug here, we need the server to handle this cleanly instead of crashing.
Updated by Haomai Wang about 7 years ago
I read a little at kernel/net/ceph/messenger.cc it looks not respect on reconnect seq actually. maybe we could disable ms_die_on_old_message test in qa with kernel client
Updated by Haomai Wang about 7 years ago
I suspect simple msgr also has this problem, maybe we can try this
Updated by John Spray about 7 years ago
- Subject changed from async msgr: FAILED assert(0 == "old msgs despite reconnect_seq feature") to kclient: FAILED assert(0 == "old msgs despite reconnect_seq feature")
- Assignee changed from Haomai Wang to Zheng Yan
Thanks for the input Haomai.
Kernel git log has this commit claiming to implement reconnect_seq:
commit 3a23083bda56850a1dc0e1c6d270b1f5dc789f07 Author: Sage Weil <sage@inktank.com> Date: Mon Mar 25 08:47:40 2013 -0700 libceph: implement RECONNECT_SEQ feature This is an old protocol extension that allows the client and server to avoid resending old messages after a reconnect (following a socket error). Instead, the exchange their sequence numbers during the handshake. This avoids sending a bunch of useless data over the socket. It has been supported in the server code since v0.22 (Sep 2010). Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com>
However, clearly it isn't working correctly.
I notice that the multimds tests have a ``ms inject socket failures`` in their thrashing config, whereas the general kcephfs tests do not.
I've created a jcsp/wip-18690-qa branch with the .yaml files tweaked to disable "die on old message" for the multimds tests, and enable msgr connection failures in the kcephfs tests, so that hopefully we can temporarily ignore this for multimds testing, and at the same time reproduce it in the kcephfs suite to get the underlying cause fixed.
Updated by John Spray about 7 years ago
- Project changed from Ceph to Linux kernel client
- Category changed from msgr to libceph
OK, here is the issue reproduced on the kcephfs suite, with the qa/ modification in jcsp/wip-18690-qa
http://pulpito.ceph.com/jspray-2017-04-12_23:35:53-kcephfs:thrash-master-testing-basic-smithi/
Updated by Zheng Yan almost 7 years ago
- Status changed from New to 7
Updated by Ilya Dryomov almost 7 years ago
- Status changed from 7 to Resolved
Updated by Patrick Donnelly almost 5 years ago
- Related to Bug #40613: kclient: .handle_message_footer got old message 1 <= 648 0x558ceadeaac0 client_session(request_renewcaps seq 12), discarding added