Bug #47723: corrupted ceph_msg_connect message - Linux kernel client - Ceph

Actions

Copy link

Bug #47723

closed

corrupted ceph_msg_connect message

Added by Patrick Donnelly over 3 years ago. Updated about 3 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Ilya Dryomov

Category:

libceph

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Crash signature (v1):

Crash signature (v2):

Description

2020-09-29T21:42:02.184+0000 7fb6424e8700  5 mds.beacon.c set_want_state: up:replay -> up:reconnect
...
2020-09-29T21:42:02.593+0000 7fb649cf7700 10 mds.c  existing session 0x55dc7480d600 for client.4556 v1:172.21.15.116:0/2729232545 existing con 0, new/authorizing con 0x55dc75597800
2020-09-29T21:42:02.593+0000 7fb649cf7700 10 mds.c parse_caps: parsing auth_cap_str='allow'
2020-09-29T21:42:02.593+0000 7fb649cf7700 10 mds.c ms_handle_accept v1:172.21.15.116:0/2729232545 con 0x55dc75597800 session 0x55dc7480d600
2020-09-29T21:42:02.593+0000 7fb649cf7700 10 mds.c  session connection 0 -> 0x55dc75597800
...
2020-09-29T21:42:02.594+0000 7fb649cf7700  1 -- [v2:172.21.15.45:6827/4195693115,v1:172.21.15.45:6829/4195693115] <== client.4556 v1:172.21.15.116:0/2729232545 2 ==== client_caps(flush ino 0x10000001a9b 1 seq 0 tid 7157 caps=pAsLsXsFsc dirty=Fxw wanted=Fc follows 1 size 2312/0 mtime 2005-06-03T16:32:07.000000+0000 tws 1 xattrs(v=18446618842528883768 l=0)) v10 ==== 236+0+0 (unknown 3718027454 0 0) 0x55dc747e1b00 con 0x55dc75597800
...
2020-09-29T21:42:02.598+0000 7fb64c4fc700  0 -- [v2:172.21.15.45:6827/4195693115,v1:172.21.15.45:6829/4195693115] >> v1:172.21.15.116:0/2729232545 conn(0x55dc75597800 legacy=0x55dc753f5800 unknown :6829 s=STATE_CONNECTION_ESTABLISHED l=0).read_until injecting socket failure
...
2020-09-29T21:42:02.600+0000 7fb64c4fc700  0 -- [v2:172.21.15.45:6827/4195693115,v1:172.21.15.45:6829/4195693115] >> v1:172.21.15.116:0/2729232545 conn(0x55dc75597800 legacy=0x55dc753f5800 unknown :6829 s=STATE_CONNECTION_ESTABLISHED l=0).read_until injecting socket failure
...
2020-09-29T21:42:05.960+0000 7fb64d4fe700 10 -- [v2:172.21.15.45:6827/4195693115,v1:172.21.15.45:6829/4195693115] accept_conn 0x55dc75597800 v1:172.21.15.116:0/2729232545
2020-09-29T21:42:05.960+0000 7fb64d4fe700 10 -- [v2:172.21.15.45:6827/4195693115,v1:172.21.15.45:6829/4195693115] >> v1:172.21.15.116:0/2729232545 conn(0x55dc75597800 legacy=0x55dc753f5800 unknown :6829 s=STATE_CONNECTION_ESTABLISHED l=0)._try_send sent bytes 62 remaining bytes 0
2020-09-29T21:42:05.960+0000 7fb649cf7700 10 mds.c  existing session 0x55dc7480d600 for client.4556 v1:172.21.15.116:0/2729232545 existing con 0x55dc75597800, new/authorizing con 0x55dc75597800
2020-09-29T21:42:05.960+0000 7fb649cf7700 10 mds.c parse_caps: parsing auth_cap_str='allow'
2020-09-29T21:42:05.960+0000 7fb649cf7700 10 mds.c ms_handle_accept v1:172.21.15.116:0/2729232545 con 0x55dc75597800 session 0x55dc7480d600
...
2020-09-29T21:42:15.176+0000 7fb649cf7700  1 -- [v2:172.21.15.45:6827/4195693115,v1:172.21.15.45:6829/4195693115] <== client.4556 v1:172.21.15.116:0/2729232545 2 ==== client_session(request_renewcaps seq 11) ==== 28+0+0 (unknown 2291222804 0 0) 0x55dc74878fc0 con 0x55dc75597800
...
2020-09-29T21:42:51.263+0000 7fb6484f4700  7 mds.0.server reconnect_tick: last seen 2.29396 seconds ago, extending reconnect interval

From: /ceph/teuthology-archive/pdonnell-2020-09-29_05:26:41-multimds-wip-pdonnell-testing-20200929.022151-distro-basic-smithi/5480231/remote/smithi045/log/ceph-mds.c.log.gz

The problem seems to be that the induced socket failures prevent the reconnect from completing. The final client_reconnect message is never received.

Symptom:

2020-09-29T21:48:04.946 ERROR:tasks.mds_thrash.fs.[cephfs]:exception:
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-pdonnell-testing-20200929.022151/qa/tasks/mds_thrash.py", line 124, in _run
    self.do_thrash()
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-pdonnell-testing-20200929.022151/qa/tasks/mds_thrash.py", line 310, in do_thrash
    status = self.wait_for_stable(rank, gid)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-pdonnell-testing-20200929.022151/qa/tasks/mds_thrash.py", line 215, in wait_for_stable
    raise RuntimeError('timeout waiting for cluster to stabilize')
RuntimeError: timeout waiting for cluster to stabilize
2020-09-29T21:48:07.907 INFO:tasks.daemonwatchdog.daemon_watchdog:thrasher.fs.[cephfs] failed

From: /ceph/teuthology-archive/pdonnell-2020-09-29_05:26:41-multimds-wip-pdonnell-testing-20200929.022151-distro-basic-smithi/5480231/teuthology.log

Related issues 1 (1 open — 0 closed)

Actions

Copy link

Updated by Patrick Donnelly over 3 years ago

Status changed from New to In Progress
Assignee set to Jeff Layton

I believe Jeff is looking into this...

Actions

Copy link

Updated by Jeff Layton over 3 years ago

This is not something I've seen before. What exactly happened in this test?

Looking at /ceph/teuthology-archive/pdonnell-2020-09-29_05:26:41-multimds-wip-pdonnell-testing-20200929.022151-distro-basic-smithi/5480231/console_logs/smithi116.log:

[  719.219017] libceph: mds0 (1)172.21.15.45:6828 socket closed (con state OPEN)
[  720.607706] libceph: mds0 (1)172.21.15.45:6828 socket closed (con state OPEN)
[  721.458074] libceph: mds0 (1)172.21.15.45:6828 socket closed (con state OPEN)
[  722.694038] libceph: mds0 (1)172.21.15.45:6828 socket closed (con state OPEN)
[  726.941978] libceph: mds0 (1)172.21.15.45:6828 socket closed (con state OPEN)
[  727.264780] libceph: mds0 (1)172.21.15.45:6828 socket closed (con state OPEN)
[  731.704199] libceph: mds0 (1)172.21.15.45:6828 socket closed (con state OPEN)
[  732.198214] libceph: mds0 (1)172.21.15.45:6828 socket error on write
[  733.243453] libceph: mds0 (1)172.21.15.45:6828 socket closed (con state OPEN)
[  734.265253] libceph: mds0 (1)172.21.15.45:6828 socket closed (con state OPEN)
[  735.216433] libceph: mds0 (1)172.21.15.45:6828 socket closed (con state CONNECTING)
[  736.240421] libceph: mds0 (1)172.21.15.45:6828 socket closed (con state CONNECTING)
[  738.224428] libceph: mds0 (1)172.21.15.45:6828 socket error on write
[  742.256582] libceph: mds0 (1)172.21.15.45:6828 socket closed (con state CONNECTING)
[  750.576662] libceph: mds0 (1)172.21.15.45:6828 socket error on write
[  752.821328] ceph: mds0 reconnect start
[  752.846264] libceph: mds0 (1)172.21.15.45:6829 socket closed (con state OPEN)
[  754.672598] libceph: mds0 (1)172.21.15.45:6829 protocol version mismatch, my 32 != server's 32
[  754.681534] libceph: mds0 (1)172.21.15.45:6829 protocol version mismatch

That version mismatch looks suspicious since the numbers are the same. The code seems to pop that message in response to a CEPH_MSGR_TAG_BADPROTOVER tag in the reply. I'll look to see if the MDS prints anything in this situation. I have to wonder if there may be some sort of encoding/decoding issue, or a message alignment problem.

Actions

Copy link

Updated by Jeff Layton over 3 years ago

I see this in the MDS log:

2020-09-29T21:42:02.952+0000 7fb64d4fe700 10 --1- [v2:172.21.15.45:6827/4195693115,v1:172.21.15.45:6829/4195693115] >> v1:172.21.15.116:0/2729232545 conn(0x55dc75596c00 0x55dc753f6800 :6829 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_message_2 accept my proto 32, their proto 0

The version sent is hardcoded to a macro, so this looks like it may be an alignment or parsing issue.

Actions

Copy link

Updated by Patrick Donnelly over 3 years ago

Jeff Layton wrote:

I see this in the MDS log:

[...]

The version sent is hardcoded to a macro, so this looks like it may be an alignment or parsing issue.

This was with the RHEL8 stock kernel. While I'm not certain this does not happen with the testing kernel, maybe it's already been fixed by Ilya. He's had his head in the kernel messaging layer for the last few months. Let's ask him to have a look?

Actions

Copy link

Updated by Jeff Layton over 3 years ago

Assignee changed from Jeff Layton to Ilya Dryomov

Assigning to Ilya for now, though the problem may still be in cephfs code. Any thoughts on this?

Actions

Copy link

Updated by Jeff Layton over 3 years ago

Looks like the client's kernel was 4.18.0-193.el8.x86_64

Actions

Copy link

Updated by Ilya Dryomov over 3 years ago

Project changed from CephFS to Linux kernel client
Subject changed from mds: stuck in reconnect because kernel client does not send final reconnect message to corrupted ceph_msg_connect message
Category set to libceph

No, this doesn't ring a bell, and it doesn't look like a cephfs bug. Transferring to libceph.

Actions

Copy link

Updated by Ilya Dryomov over 3 years ago

After staring at this for while, I'm pretty sure that the actual ceph_msg_connect message is fine, but garbage bytes from somewhere else get written into the socket. From the MDS log (which has some messenger logging, but not enough to see the full picture), the byte sequence seems to be "00 00 00 04 00 00 55 05 00 00 01 00 xx xx xx xx xx xx xx xx 00 00 00 00", so this is not an encoding or an alignment issue. Looks like a stray pointer or a bad messenger state, I'll keep digging...

Actions

Copy link