Bug #47723: corrupted ceph_msg_connect message - Linux kernel client - Ceph

Actions

Copy link

Bug #47723

closed

corrupted ceph_msg_connect message

Added by Patrick Donnelly over 3 years ago. Updated about 3 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Ilya Dryomov

Category:

libceph

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Crash signature (v1):

Crash signature (v2):

Description

2020-09-29T21:42:02.184+0000 7fb6424e8700  5 mds.beacon.c set_want_state: up:replay -> up:reconnect
...
2020-09-29T21:42:02.593+0000 7fb649cf7700 10 mds.c  existing session 0x55dc7480d600 for client.4556 v1:172.21.15.116:0/2729232545 existing con 0, new/authorizing con 0x55dc75597800
2020-09-29T21:42:02.593+0000 7fb649cf7700 10 mds.c parse_caps: parsing auth_cap_str='allow'
2020-09-29T21:42:02.593+0000 7fb649cf7700 10 mds.c ms_handle_accept v1:172.21.15.116:0/2729232545 con 0x55dc75597800 session 0x55dc7480d600
2020-09-29T21:42:02.593+0000 7fb649cf7700 10 mds.c  session connection 0 -> 0x55dc75597800
...
2020-09-29T21:42:02.594+0000 7fb649cf7700  1 -- [v2:172.21.15.45:6827/4195693115,v1:172.21.15.45:6829/4195693115] <== client.4556 v1:172.21.15.116:0/2729232545 2 ==== client_caps(flush ino 0x10000001a9b 1 seq 0 tid 7157 caps=pAsLsXsFsc dirty=Fxw wanted=Fc follows 1 size 2312/0 mtime 2005-06-03T16:32:07.000000+0000 tws 1 xattrs(v=18446618842528883768 l=0)) v10 ==== 236+0+0 (unknown 3718027454 0 0) 0x55dc747e1b00 con 0x55dc75597800
...
2020-09-29T21:42:02.598+0000 7fb64c4fc700  0 -- [v2:172.21.15.45:6827/4195693115,v1:172.21.15.45:6829/4195693115] >> v1:172.21.15.116:0/2729232545 conn(0x55dc75597800 legacy=0x55dc753f5800 unknown :6829 s=STATE_CONNECTION_ESTABLISHED l=0).read_until injecting socket failure
...
2020-09-29T21:42:02.600+0000 7fb64c4fc700  0 -- [v2:172.21.15.45:6827/4195693115,v1:172.21.15.45:6829/4195693115] >> v1:172.21.15.116:0/2729232545 conn(0x55dc75597800 legacy=0x55dc753f5800 unknown :6829 s=STATE_CONNECTION_ESTABLISHED l=0).read_until injecting socket failure
...
2020-09-29T21:42:05.960+0000 7fb64d4fe700 10 -- [v2:172.21.15.45:6827/4195693115,v1:172.21.15.45:6829/4195693115] accept_conn 0x55dc75597800 v1:172.21.15.116:0/2729232545
2020-09-29T21:42:05.960+0000 7fb64d4fe700 10 -- [v2:172.21.15.45:6827/4195693115,v1:172.21.15.45:6829/4195693115] >> v1:172.21.15.116:0/2729232545 conn(0x55dc75597800 legacy=0x55dc753f5800 unknown :6829 s=STATE_CONNECTION_ESTABLISHED l=0)._try_send sent bytes 62 remaining bytes 0
2020-09-29T21:42:05.960+0000 7fb649cf7700 10 mds.c  existing session 0x55dc7480d600 for client.4556 v1:172.21.15.116:0/2729232545 existing con 0x55dc75597800, new/authorizing con 0x55dc75597800
2020-09-29T21:42:05.960+0000 7fb649cf7700 10 mds.c parse_caps: parsing auth_cap_str='allow'
2020-09-29T21:42:05.960+0000 7fb649cf7700 10 mds.c ms_handle_accept v1:172.21.15.116:0/2729232545 con 0x55dc75597800 session 0x55dc7480d600
...
2020-09-29T21:42:15.176+0000 7fb649cf7700  1 -- [v2:172.21.15.45:6827/4195693115,v1:172.21.15.45:6829/4195693115] <== client.4556 v1:172.21.15.116:0/2729232545 2 ==== client_session(request_renewcaps seq 11) ==== 28+0+0 (unknown 2291222804 0 0) 0x55dc74878fc0 con 0x55dc75597800
...
2020-09-29T21:42:51.263+0000 7fb6484f4700  7 mds.0.server reconnect_tick: last seen 2.29396 seconds ago, extending reconnect interval

From: /ceph/teuthology-archive/pdonnell-2020-09-29_05:26:41-multimds-wip-pdonnell-testing-20200929.022151-distro-basic-smithi/5480231/remote/smithi045/log/ceph-mds.c.log.gz

The problem seems to be that the induced socket failures prevent the reconnect from completing. The final client_reconnect message is never received.

Symptom:

2020-09-29T21:48:04.946 ERROR:tasks.mds_thrash.fs.[cephfs]:exception:
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-pdonnell-testing-20200929.022151/qa/tasks/mds_thrash.py", line 124, in _run
    self.do_thrash()
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-pdonnell-testing-20200929.022151/qa/tasks/mds_thrash.py", line 310, in do_thrash
    status = self.wait_for_stable(rank, gid)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-pdonnell-testing-20200929.022151/qa/tasks/mds_thrash.py", line 215, in wait_for_stable
    raise RuntimeError('timeout waiting for cluster to stabilize')
RuntimeError: timeout waiting for cluster to stabilize
2020-09-29T21:48:07.907 INFO:tasks.daemonwatchdog.daemon_watchdog:thrasher.fs.[cephfs] failed

From: /ceph/teuthology-archive/pdonnell-2020-09-29_05:26:41-multimds-wip-pdonnell-testing-20200929.022151-distro-basic-smithi/5480231/teuthology.log

Related issues 1 (1 open — 0 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » Linux kernel client

Custom queries

Bug #47723

corrupted ceph_msg_connect message

Updated by Patrick Donnelly over 3 years ago

Updated by Jeff Layton over 3 years ago

Updated by Jeff Layton over 3 years ago

Updated by Patrick Donnelly over 3 years ago

Updated by Jeff Layton over 3 years ago

Updated by Jeff Layton over 3 years ago

Updated by Ilya Dryomov over 3 years ago

Updated by Ilya Dryomov over 3 years ago

Updated by Ilya Dryomov over 3 years ago

Updated by Ilya Dryomov over 3 years ago

Updated by Ilya Dryomov over 3 years ago

Updated by Patrick Donnelly over 3 years ago

Updated by Ilya Dryomov over 3 years ago

Updated by Ilya Dryomov over 3 years ago

Updated by Patrick Donnelly about 3 years ago

Updated by Ilya Dryomov about 3 years ago