Bug #18690: kclient: FAILED assert(0 == "old msgs despite reconnect_seq feature") - Linux kernel client - Ceph

Actions

Copy link

Bug #18690

closed

kclient: FAILED assert(0 == "old msgs despite reconnect_seq feature")

Added by Patrick Donnelly about 7 years ago. Updated almost 7 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Zheng Yan

Category:

libceph

Target version:

Ceph - v12.0.0

% Done:

Source:

Development

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Crash signature (v1):

Crash signature (v2):

Description

2017-01-26 21:55:21.740004 7f114f1a5700 10 mds.b ms_handle_accept 172.21.4.112:0/3776505678 con 0x55e0a2bb5800 session 0x55e0a2ac6300
2017-01-26 21:55:21.740294 7f1151143700  0 -- 172.21.4.128:6812/1494558506 >> 172.21.4.112:0/3776505678 conn(0x55e0a2bb5800 :6812 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=6 cs=3 l=0).process got old message 2 <= 2 0x55e0a2bc2600 client_request(client.4173:1 getattr p #1 2017-01-26 21:55:21.596971 caller_uid=0, caller_gid=0{}) v2, discarding
2017-01-26 21:55:21.742454 7f1151143700 -1 /tmp/buildd/ceph-11.1.0-6912-gaf9152f/src/msg/async/AsyncConnection.cc: In function 'void AsyncConnection::process()' thread 7f1151143700 time 2017-01-26 21:55:21.740343
/tmp/buildd/ceph-11.1.0-6912-gaf9152f/src/msg/async/AsyncConnection.cc: 741: FAILED assert(0 == "old msgs despite reconnect_seq feature")

 ceph version 11.1.0-6912-gaf9152f (af9152f34a416a46bbf96b14b1416ca004c80f84)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x7f11558e6542]
 2: (AsyncConnection::process()+0x2f35) [0x7f1155a02c75]
 3: (EventCenter::process_events(int)+0x9b1) [0x7f1155a125e1]
 4: (()+0x39dfe9) [0x7f1155a16fe9]
 5: (()+0xb8c80) [0x7f1154d86c80]
 6: (()+0x770a) [0x7f115546370a]
 7: (clone()+0x6d) [0x7f11544ec82d]

From: /ceph/teuthology-archive/pdonnell-2017-01-26_18:37:20-multimds:thrash-wip-multimds-tests-testing-basic-mira/752781/remote/mira025/log/ceph-mds.b.log.gz

This failure happened before any thrashing.

Related issues 1 (1 open — 0 closed)

Actions

Copy link

Updated by Patrick Donnelly about 7 years ago

Happened in this run too: http://pulpito.ceph.com/pdonnell-2017-01-26_18:37:20-multimds:thrash-wip-multimds-tests-testing-basic-mira/752781/

Actions

Copy link

Updated by Sage Weil about 7 years ago

Priority changed from Normal to Urgent

Actions

Copy link

Updated by Greg Farnum about 7 years ago

Project changed from CephFS to Ceph
Subject changed from msgr: FAILED assert(0 == "old msgs despite reconnect_seq feature") to async msgr: FAILED assert(0 == "old msgs despite reconnect_seq feature")
Category changed from 90 to msgr
Assignee set to Haomai Wang

Pinging Haomai. Sounds like the MDS is triggering cases we didn't hit in the OSD?

Actions

Copy link

Updated by Haomai Wang about 7 years ago

client <-> MDS with stateful_server policy, then mds crashed because of old msg seq.

is this teuthology job can reproduce? we need to improve to debug_ms to verify the problem.

ping @Patrick Donnelly

Actions

Copy link

Updated by Patrick Donnelly about 7 years ago

I has shown up many times in the multimds thrasher runs. I'll raise debug_ms for future runs.

Actions

Copy link

Updated by Patrick Donnelly about 7 years ago

Here you go: /ceph/teuthology-archive/pdonnell-2017-02-06_19:24:21-multimds:thrash-master-testing-basic-smithi/791580/remote/smithi149/log/ceph-mds.f.log.gz

Actions

Copy link

Updated by Haomai Wang about 7 years ago

Needs client log too.. http://qa-proxy.ceph.com/teuthology/pdonnell-2017-02-06_19:24:21-multimds:thrash-master-testing-basic-smithi/791580/remote/smithi165/log/ shows empty...

BTW I search the whole repo that "ms_die_on_old_message" doesn't set to true in qa. why it will trigger assert? Do you enable this in this branch?

Actions

Copy link

Updated by Patrick Donnelly about 7 years ago

Client logs are missing because it's the kernel client. I will need to rerun the test suite to see if I can coax a failure with the ceph-fuse client (it has happened before with ceph-fuse).

I did not enable that config which indeed makes this more strange.

Actions

Copy link

Updated by Haomai Wang about 7 years ago

OH, I guess kernel client handle reconnect seq inconsistent with async msgr. if fuse client help, it would be great.

Actions

Copy link

#10

Updated by Greg Farnum about 7 years ago

It looks like teuthology sets "ms die on old message = true" in its ceph.conf.template file.

Haomai, we don't expect the kernel client to have any problems with the AsyncMessenger, right?

Actions

Copy link

#11

Updated by Haomai Wang about 7 years ago

yes..

Actions

Copy link

#12

Updated by Patrick Donnelly about 7 years ago

Haomai, here's a test run:

http://pulpito.ceph.com/pdonnell-2017-02-17_15:35:17-multimds:thrash-master-testing-basic-smithi/826081/

and here's the kernel log with "echo file net/ceph/messenger.c +p >/sys/kernel/debug/dynamic_debug/control" executed on the client:

teuthology:/home/pdonnell/826081/kern.log.gz

Actions

Copy link

#13

Updated by John Spray about 7 years ago

Lots more of these on latest run:
http://pulpito.ceph.com/jspray-2017-03-08_14:08:01-multimds-master-testing-basic-smithi/

Actions

Copy link

#14

Updated by John Spray about 7 years ago

Target version set to v12.0.0

Actions

Copy link

#15

Updated by Haomai Wang about 7 years ago

Ah, I checked Patrick Donnelly's log before. But I'm really have no experience on ceph kernel codes, I don't have any idea why kernel msgr doesn't react proper reconnect feature.

Anyone who can help on this case? @yanzheng?

Actions

Copy link

#16

Updated by Haomai Wang about 7 years ago

BTW it looks this crash only happen when multimds? kernel client + multimds is a clue.

Actions

Copy link

#17

Updated by John Spray about 7 years ago

Still seeing this here:

http://pulpito.ceph.com/jspray-2017-03-29_01:19:13-multimds-wip-jcsp-testing-20170328-testing-basic-smithi/958269
http://pulpito.ceph.com/jspray-2017-03-29_01:19:13-multimds-wip-jcsp-testing-20170328-testing-basic-smithi/958256

(both are indeed kclient)

Even if there is a kclient bug here, we need the server to handle this cleanly instead of crashing.

Actions

Copy link

#18

Updated by Haomai Wang about 7 years ago

I read a little at kernel/net/ceph/messenger.cc it looks not respect on reconnect seq actually. maybe we could disable ms_die_on_old_message test in qa with kernel client

Actions

Copy link

#19

Updated by Haomai Wang about 7 years ago

I suspect simple msgr also has this problem, maybe we can try this

Actions

Copy link

#20

Updated by John Spray about 7 years ago

Subject changed from async msgr: FAILED assert(0 == "old msgs despite reconnect_seq feature") to kclient: FAILED assert(0 == "old msgs despite reconnect_seq feature")
Assignee changed from Haomai Wang to Zheng Yan

Thanks for the input Haomai.

Kernel git log has this commit claiming to implement reconnect_seq:

commit 3a23083bda56850a1dc0e1c6d270b1f5dc789f07
Author: Sage Weil <sage@inktank.com>
Date:   Mon Mar 25 08:47:40 2013 -0700

    libceph: implement RECONNECT_SEQ feature

    This is an old protocol extension that allows the client and server to
    avoid resending old messages after a reconnect (following a socket error).
    Instead, the exchange their sequence numbers during the handshake.  This
    avoids sending a bunch of useless data over the socket.

    It has been supported in the server code since v0.22 (Sep 2010).

    Signed-off-by: Sage Weil <sage@inktank.com>
    Reviewed-by: Alex Elder <elder@inktank.com>

However, clearly it isn't working correctly.

I notice that the multimds tests have a ``ms inject socket failures`` in their thrashing config, whereas the general kcephfs tests do not.

I've created a jcsp/wip-18690-qa branch with the .yaml files tweaked to disable "die on old message" for the multimds tests, and enable msgr connection failures in the kcephfs tests, so that hopefully we can temporarily ignore this for multimds testing, and at the same time reproduce it in the kcephfs suite to get the underlying cause fixed.

Actions

Copy link

#21

Updated by John Spray about 7 years ago

Project changed from Ceph to Linux kernel client
Category changed from msgr to libceph

OK, here is the issue reproduced on the kcephfs suite, with the qa/ modification in jcsp/wip-18690-qa

http://pulpito.ceph.com/jspray-2017-04-12_23:35:53-kcephfs:thrash-master-testing-basic-smithi/

Actions

Copy link

#22

Updated by Zheng Yan almost 7 years ago

Status changed from New to 7

https://github.com/ceph/ceph-client/commit/c4561c5b195a564423bd002c1a8017a876e44819

Actions

Copy link

#23

Updated by Ilya Dryomov almost 7 years ago

Status changed from 7 to Resolved

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0a2ad541071f99eaf4589c3551176fca191c1ee2 in 4.12-rc3

Actions

Copy link

#24

Updated by Patrick Donnelly almost 5 years ago

Related to Bug #40613: kclient: .handle_message_footer got old message 1 <= 648 0x558ceadeaac0 client_session(request_renewcaps seq 12), discarding added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » Linux kernel client

Custom queries

Bug #18690

kclient: FAILED assert(0 == "old msgs despite reconnect_seq feature")

Updated by Patrick Donnelly about 7 years ago

Updated by Sage Weil about 7 years ago

Updated by Greg Farnum about 7 years ago

Updated by Haomai Wang about 7 years ago

Updated by Patrick Donnelly about 7 years ago

Updated by Patrick Donnelly about 7 years ago

Updated by Haomai Wang about 7 years ago

Updated by Patrick Donnelly about 7 years ago

Updated by Haomai Wang about 7 years ago

Updated by Greg Farnum about 7 years ago

Updated by Haomai Wang about 7 years ago

Updated by Patrick Donnelly about 7 years ago

Updated by John Spray about 7 years ago

Updated by John Spray about 7 years ago

Updated by Haomai Wang about 7 years ago

Updated by Haomai Wang about 7 years ago

Updated by John Spray about 7 years ago

Updated by Haomai Wang about 7 years ago

Updated by Haomai Wang about 7 years ago

Updated by John Spray about 7 years ago

Updated by John Spray about 7 years ago

Updated by Zheng Yan almost 7 years ago

Updated by Ilya Dryomov almost 7 years ago

Updated by Patrick Donnelly almost 5 years ago