Project

General

Profile

Bug #18690

kclient: FAILED assert(0 == "old msgs despite reconnect_seq feature")

Added by Patrick Donnelly 10 months ago. Updated 6 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
libceph
Target version:
Start date:
01/26/2017
Due date:
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Release:
Needs Doc:
No

Description

2017-01-26 21:55:21.740004 7f114f1a5700 10 mds.b ms_handle_accept 172.21.4.112:0/3776505678 con 0x55e0a2bb5800 session 0x55e0a2ac6300
2017-01-26 21:55:21.740294 7f1151143700  0 -- 172.21.4.128:6812/1494558506 >> 172.21.4.112:0/3776505678 conn(0x55e0a2bb5800 :6812 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=6 cs=3 l=0).process got old message 2 <= 2 0x55e0a2bc2600 client_request(client.4173:1 getattr p #1 2017-01-26 21:55:21.596971 caller_uid=0, caller_gid=0{}) v2, discarding
2017-01-26 21:55:21.742454 7f1151143700 -1 /tmp/buildd/ceph-11.1.0-6912-gaf9152f/src/msg/async/AsyncConnection.cc: In function 'void AsyncConnection::process()' thread 7f1151143700 time 2017-01-26 21:55:21.740343
/tmp/buildd/ceph-11.1.0-6912-gaf9152f/src/msg/async/AsyncConnection.cc: 741: FAILED assert(0 == "old msgs despite reconnect_seq feature")

 ceph version 11.1.0-6912-gaf9152f (af9152f34a416a46bbf96b14b1416ca004c80f84)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x7f11558e6542]
 2: (AsyncConnection::process()+0x2f35) [0x7f1155a02c75]
 3: (EventCenter::process_events(int)+0x9b1) [0x7f1155a125e1]
 4: (()+0x39dfe9) [0x7f1155a16fe9]
 5: (()+0xb8c80) [0x7f1154d86c80]
 6: (()+0x770a) [0x7f115546370a]
 7: (clone()+0x6d) [0x7f11544ec82d]

From: /ceph/teuthology-archive/pdonnell-2017-01-26_18:37:20-multimds:thrash-wip-multimds-tests-testing-basic-mira/752781/remote/mira025/log/ceph-mds.b.log.gz

This failure happened before any thrashing.

History

#2 Updated by Sage Weil 10 months ago

  • Priority changed from Normal to Urgent

#3 Updated by Greg Farnum 10 months ago

  • Project changed from fs to Ceph
  • Subject changed from msgr: FAILED assert(0 == "old msgs despite reconnect_seq feature") to async msgr: FAILED assert(0 == "old msgs despite reconnect_seq feature")
  • Category changed from multi-MDS to msgr
  • Assignee set to Haomai Wang

Pinging Haomai. Sounds like the MDS is triggering cases we didn't hit in the OSD?

#4 Updated by Haomai Wang 10 months ago

client <-> MDS with stateful_server policy, then mds crashed because of old msg seq.

is this teuthology job can reproduce? we need to improve to debug_ms to verify the problem.

ping @Patrick Donnelly

#5 Updated by Patrick Donnelly 10 months ago

I has shown up many times in the multimds thrasher runs. I'll raise debug_ms for future runs.

#6 Updated by Patrick Donnelly 10 months ago

Here you go: /ceph/teuthology-archive/pdonnell-2017-02-06_19:24:21-multimds:thrash-master-testing-basic-smithi/791580/remote/smithi149/log/ceph-mds.f.log.gz

#7 Updated by Haomai Wang 10 months ago

Needs client log too.. http://qa-proxy.ceph.com/teuthology/pdonnell-2017-02-06_19:24:21-multimds:thrash-master-testing-basic-smithi/791580/remote/smithi165/log/ shows empty...

BTW I search the whole repo that "ms_die_on_old_message" doesn't set to true in qa. why it will trigger assert? Do you enable this in this branch?

#8 Updated by Patrick Donnelly 10 months ago

Client logs are missing because it's the kernel client. I will need to rerun the test suite to see if I can coax a failure with the ceph-fuse client (it has happened before with ceph-fuse).

I did not enable that config which indeed makes this more strange.

#9 Updated by Haomai Wang 10 months ago

OH, I guess kernel client handle reconnect seq inconsistent with async msgr. if fuse client help, it would be great.

#10 Updated by Greg Farnum 10 months ago

It looks like teuthology sets "ms die on old message = true" in its ceph.conf.template file.

Haomai, we don't expect the kernel client to have any problems with the AsyncMessenger, right?

#11 Updated by Haomai Wang 10 months ago

yes..

#12 Updated by Patrick Donnelly 9 months ago

Haomai, here's a test run:

http://pulpito.ceph.com/pdonnell-2017-02-17_15:35:17-multimds:thrash-master-testing-basic-smithi/826081/

and here's the kernel log with "echo file net/ceph/messenger.c +p >/sys/kernel/debug/dynamic_debug/control" executed on the client:

teuthology:/home/pdonnell/826081/kern.log.gz

#14 Updated by John Spray 9 months ago

  • Target version set to v12.0.0

#15 Updated by Haomai Wang 9 months ago

Ah, I checked Patrick Donnelly's log before. But I'm really have no experience on ceph kernel codes, I don't have any idea why kernel msgr doesn't react proper reconnect feature.

Anyone who can help on this case? @yanzheng?

#16 Updated by Haomai Wang 9 months ago

BTW it looks this crash only happen when multimds? kernel client + multimds is a clue.

#17 Updated by John Spray 8 months ago

#18 Updated by Haomai Wang 8 months ago

I read a little at kernel/net/ceph/messenger.cc it looks not respect on reconnect seq actually. maybe we could disable ms_die_on_old_message test in qa with kernel client

#19 Updated by Haomai Wang 8 months ago

I suspect simple msgr also has this problem, maybe we can try this

#20 Updated by John Spray 8 months ago

  • Subject changed from async msgr: FAILED assert(0 == "old msgs despite reconnect_seq feature") to kclient: FAILED assert(0 == "old msgs despite reconnect_seq feature")
  • Assignee changed from Haomai Wang to Zheng Yan

Thanks for the input Haomai.

Kernel git log has this commit claiming to implement reconnect_seq:

commit 3a23083bda56850a1dc0e1c6d270b1f5dc789f07
Author: Sage Weil <sage@inktank.com>
Date:   Mon Mar 25 08:47:40 2013 -0700

    libceph: implement RECONNECT_SEQ feature

    This is an old protocol extension that allows the client and server to
    avoid resending old messages after a reconnect (following a socket error).
    Instead, the exchange their sequence numbers during the handshake.  This
    avoids sending a bunch of useless data over the socket.

    It has been supported in the server code since v0.22 (Sep 2010).

    Signed-off-by: Sage Weil <sage@inktank.com>
    Reviewed-by: Alex Elder <elder@inktank.com>

However, clearly it isn't working correctly.

I notice that the multimds tests have a ``ms inject socket failures`` in their thrashing config, whereas the general kcephfs tests do not.

I've created a jcsp/wip-18690-qa branch with the .yaml files tweaked to disable "die on old message" for the multimds tests, and enable msgr connection failures in the kcephfs tests, so that hopefully we can temporarily ignore this for multimds testing, and at the same time reproduce it in the kcephfs suite to get the underlying cause fixed.

#21 Updated by John Spray 8 months ago

  • Project changed from Ceph to Linux kernel client
  • Category changed from msgr to libceph

OK, here is the issue reproduced on the kcephfs suite, with the qa/ modification in jcsp/wip-18690-qa

http://pulpito.ceph.com/jspray-2017-04-12_23:35:53-kcephfs:thrash-master-testing-basic-smithi/

Also available in: Atom PDF