Project

General

Profile

Actions

Bug #18690

closed

kclient: FAILED assert(0 == "old msgs despite reconnect_seq feature")

Added by Patrick Donnelly about 7 years ago. Updated almost 7 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
libceph
Target version:
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

2017-01-26 21:55:21.740004 7f114f1a5700 10 mds.b ms_handle_accept 172.21.4.112:0/3776505678 con 0x55e0a2bb5800 session 0x55e0a2ac6300
2017-01-26 21:55:21.740294 7f1151143700  0 -- 172.21.4.128:6812/1494558506 >> 172.21.4.112:0/3776505678 conn(0x55e0a2bb5800 :6812 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=6 cs=3 l=0).process got old message 2 <= 2 0x55e0a2bc2600 client_request(client.4173:1 getattr p #1 2017-01-26 21:55:21.596971 caller_uid=0, caller_gid=0{}) v2, discarding
2017-01-26 21:55:21.742454 7f1151143700 -1 /tmp/buildd/ceph-11.1.0-6912-gaf9152f/src/msg/async/AsyncConnection.cc: In function 'void AsyncConnection::process()' thread 7f1151143700 time 2017-01-26 21:55:21.740343
/tmp/buildd/ceph-11.1.0-6912-gaf9152f/src/msg/async/AsyncConnection.cc: 741: FAILED assert(0 == "old msgs despite reconnect_seq feature")

 ceph version 11.1.0-6912-gaf9152f (af9152f34a416a46bbf96b14b1416ca004c80f84)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x7f11558e6542]
 2: (AsyncConnection::process()+0x2f35) [0x7f1155a02c75]
 3: (EventCenter::process_events(int)+0x9b1) [0x7f1155a125e1]
 4: (()+0x39dfe9) [0x7f1155a16fe9]
 5: (()+0xb8c80) [0x7f1154d86c80]
 6: (()+0x770a) [0x7f115546370a]
 7: (clone()+0x6d) [0x7f11544ec82d]

From: /ceph/teuthology-archive/pdonnell-2017-01-26_18:37:20-multimds:thrash-wip-multimds-tests-testing-basic-mira/752781/remote/mira025/log/ceph-mds.b.log.gz

This failure happened before any thrashing.


Related issues 1 (1 open0 closed)

Related to CephFS - Bug #40613: kclient: .handle_message_footer got old message 1 <= 648 0x558ceadeaac0 client_session(request_renewcaps seq 12), discardingNew

Actions
Actions #2

Updated by Sage Weil about 7 years ago

  • Priority changed from Normal to Urgent
Actions #3

Updated by Greg Farnum about 7 years ago

  • Project changed from CephFS to Ceph
  • Subject changed from msgr: FAILED assert(0 == "old msgs despite reconnect_seq feature") to async msgr: FAILED assert(0 == "old msgs despite reconnect_seq feature")
  • Category changed from 90 to msgr
  • Assignee set to Haomai Wang

Pinging Haomai. Sounds like the MDS is triggering cases we didn't hit in the OSD?

Actions #4

Updated by Haomai Wang about 7 years ago

client <-> MDS with stateful_server policy, then mds crashed because of old msg seq.

is this teuthology job can reproduce? we need to improve to debug_ms to verify the problem.

ping @Patrick Donnelly

Actions #5

Updated by Patrick Donnelly about 7 years ago

I has shown up many times in the multimds thrasher runs. I'll raise debug_ms for future runs.

Actions #6

Updated by Patrick Donnelly about 7 years ago

Here you go: /ceph/teuthology-archive/pdonnell-2017-02-06_19:24:21-multimds:thrash-master-testing-basic-smithi/791580/remote/smithi149/log/ceph-mds.f.log.gz

Actions #7

Updated by Haomai Wang about 7 years ago

Needs client log too.. http://qa-proxy.ceph.com/teuthology/pdonnell-2017-02-06_19:24:21-multimds:thrash-master-testing-basic-smithi/791580/remote/smithi165/log/ shows empty...

BTW I search the whole repo that "ms_die_on_old_message" doesn't set to true in qa. why it will trigger assert? Do you enable this in this branch?

Actions #8

Updated by Patrick Donnelly about 7 years ago

Client logs are missing because it's the kernel client. I will need to rerun the test suite to see if I can coax a failure with the ceph-fuse client (it has happened before with ceph-fuse).

I did not enable that config which indeed makes this more strange.

Actions #9

Updated by Haomai Wang about 7 years ago

OH, I guess kernel client handle reconnect seq inconsistent with async msgr. if fuse client help, it would be great.

Actions #10

Updated by Greg Farnum about 7 years ago

It looks like teuthology sets "ms die on old message = true" in its ceph.conf.template file.

Haomai, we don't expect the kernel client to have any problems with the AsyncMessenger, right?

Actions #11

Updated by Haomai Wang about 7 years ago

yes..

Actions #12

Updated by Patrick Donnelly about 7 years ago

Haomai, here's a test run:

http://pulpito.ceph.com/pdonnell-2017-02-17_15:35:17-multimds:thrash-master-testing-basic-smithi/826081/

and here's the kernel log with "echo file net/ceph/messenger.c +p >/sys/kernel/debug/dynamic_debug/control" executed on the client:

teuthology:/home/pdonnell/826081/kern.log.gz

Actions #14

Updated by John Spray about 7 years ago

  • Target version set to v12.0.0
Actions #15

Updated by Haomai Wang about 7 years ago

Ah, I checked Patrick Donnelly's log before. But I'm really have no experience on ceph kernel codes, I don't have any idea why kernel msgr doesn't react proper reconnect feature.

Anyone who can help on this case? @yanzheng?

Actions #16

Updated by Haomai Wang about 7 years ago

BTW it looks this crash only happen when multimds? kernel client + multimds is a clue.

Actions #17

Updated by John Spray about 7 years ago

Actions #18

Updated by Haomai Wang about 7 years ago

I read a little at kernel/net/ceph/messenger.cc it looks not respect on reconnect seq actually. maybe we could disable ms_die_on_old_message test in qa with kernel client

Actions #19

Updated by Haomai Wang about 7 years ago

I suspect simple msgr also has this problem, maybe we can try this

Actions #20

Updated by John Spray about 7 years ago

  • Subject changed from async msgr: FAILED assert(0 == "old msgs despite reconnect_seq feature") to kclient: FAILED assert(0 == "old msgs despite reconnect_seq feature")
  • Assignee changed from Haomai Wang to Zheng Yan

Thanks for the input Haomai.

Kernel git log has this commit claiming to implement reconnect_seq:

commit 3a23083bda56850a1dc0e1c6d270b1f5dc789f07
Author: Sage Weil <sage@inktank.com>
Date:   Mon Mar 25 08:47:40 2013 -0700

    libceph: implement RECONNECT_SEQ feature

    This is an old protocol extension that allows the client and server to
    avoid resending old messages after a reconnect (following a socket error).
    Instead, the exchange their sequence numbers during the handshake.  This
    avoids sending a bunch of useless data over the socket.

    It has been supported in the server code since v0.22 (Sep 2010).

    Signed-off-by: Sage Weil <sage@inktank.com>
    Reviewed-by: Alex Elder <elder@inktank.com>

However, clearly it isn't working correctly.

I notice that the multimds tests have a ``ms inject socket failures`` in their thrashing config, whereas the general kcephfs tests do not.

I've created a jcsp/wip-18690-qa branch with the .yaml files tweaked to disable "die on old message" for the multimds tests, and enable msgr connection failures in the kcephfs tests, so that hopefully we can temporarily ignore this for multimds testing, and at the same time reproduce it in the kcephfs suite to get the underlying cause fixed.

Actions #21

Updated by John Spray about 7 years ago

  • Project changed from Ceph to Linux kernel client
  • Category changed from msgr to libceph

OK, here is the issue reproduced on the kcephfs suite, with the qa/ modification in jcsp/wip-18690-qa

http://pulpito.ceph.com/jspray-2017-04-12_23:35:53-kcephfs:thrash-master-testing-basic-smithi/

Actions #24

Updated by Patrick Donnelly almost 5 years ago

  • Related to Bug #40613: kclient: .handle_message_footer got old message 1 <= 648 0x558ceadeaac0 client_session(request_renewcaps seq 12), discarding added
Actions

Also available in: Atom PDF