Project

General

Profile

Actions

Bug #37544

closed

mds: reconnect of client during thrashing fails

Added by Patrick Donnelly over 5 years ago. Updated over 5 years ago.

Status:
Duplicate
Priority:
High
Assignee:
-
Category:
-
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
mimic,luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2018-11-29 11:29:46.056 7f13ebdd7700  1 mds.a-s Updating MDS map to version 84 from mon.2
...
2018-11-29 11:29:46.056 7f13ebdd7700  1 mds.0.83 handle_mds_map state change up:replay --> up:reconnect
...
2018-11-29 11:29:46.060 7f13ee5dc700  0 -- 172.21.15.98:6817/607685025 >> 172.21.15.189:56294/184203095 conn(0x2cea480 legacy :6817 s=STATE_CONNECTION_ESTABLISHED l=0).read_until injecting socket failure
2018-11-29 11:29:46.060 7f13ee5dc700  1 -- 172.21.15.98:6817/607685025 >> 172.21.15.189:56294/184203095 conn(0x2cea480 legacy :6817 s=STATE_CONNECTION_ESTABLISHED l=0)._try_send send error: (32) Broken pipe
2018-11-29 11:29:46.060 7f13ee5dc700  1 -- 172.21.15.98:6817/607685025 >> 172.21.15.189:56294/184203095 conn(0x2cea480 legacy :6817 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0). write connect message reply failed
2018-11-29 11:29:46.060 7f13ebdd7700  5 mds.a-s ms_handle_reset on 172.21.15.189:56294/184203095
2018-11-29 11:29:46.060 7f13ebdd7700  1 -- 172.21.15.98:6817/607685025 >> 172.21.15.189:56294/184203095 conn(0x2cea480 legacy :6817 s=STATE_CLOSED l=0).mark_down
2018-11-29 11:29:46.060 7f13ee5dc700  1 -- 172.21.15.98:6817/607685025 >> - conn(0x391b200 legacy :6817 s=ACCEPTING pgs=0 cs=0 l=0).send_server_banner sd=33 172.21.15.189:41588/0
2018-11-29 11:29:46.060 7f13ee5dc700 10 mds.a-s  existing session 0x3785c00 for client.4418 172.21.15.189:56294/184203095 existing con 0, new/authorizing con 0x391b200
2018-11-29 11:29:46.060 7f13ee5dc700 10 mds.a-s ms_handle_authentication: parsing auth_cap_str='allow'
2018-11-29 11:29:46.060 7f13ee5dc700  0 -- 172.21.15.98:6817/607685025 >> 172.21.15.189:56294/184203095 conn(0x391b200 legacy :6817 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept we reset (peer sent cseq 1), sending RESETSESSION
2018-11-29 11:29:46.060 7f13ee5dc700 10 mds.a-s  existing session 0x3785c00 for client.4418 172.21.15.189:56294/184203095 existing con 0, new/authorizing con 0x391b200
2018-11-29 11:29:46.060 7f13ee5dc700 10 mds.a-s ms_handle_authentication: parsing auth_cap_str='allow'
2018-11-29 11:29:46.060 7f13ebdd7700 10 mds.a-s ms_handle_accept 172.21.15.189:56294/184203095 con 0x391b200 session 0x3785c00
2018-11-29 11:29:46.060 7f13ebdd7700 10 mds.a-s  session connection 0 -> 0x391b200
...
2018-11-29 11:30:32.627 7f13e95d2700  1 mds.0.server reconnect gives up on client.4418 172.21.15.189:56294/184203095

From: /ceph/teuthology-archive/pdonnell-2018-11-29_06:44:45-fs-wip-pdonnell-testing-20181129.042324-distro-basic-smithi/3291698/remote/smithi098/log/ceph-mds.a-s.log.gz

Seems to happen reliably: /ceph/teuthology-archive/pdonnell-2018-12-07_01:03:31-fs-wip-pdonnell-testing-20181129.042324-distro-basic-smithi/3312640/teuthology.log

Socket injection failure is probably related.


Related issues 1 (0 open1 closed)

Is duplicate of CephFS - Bug #36507: client: connection failure during reconnect causes client to hangDuplicatePatrick Donnelly

Actions
Actions #1

Updated by Zheng Yan over 5 years ago

  • Assignee deleted (Zheng Yan)
2018-11-29 11:29:46.060 7f13ee5dc700  0 -- 172.21.15.98:6817/607685025 >> 172.21.15.189:56294/184203095 conn(0x391b200 legacy :6817 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept we reset (peer sent cseq 1), sending RESETSESSION

2018-11-29 11:30:32.627 7f13e95d2700  1 mds.0.server reconnect gives up on client.4418 172.21.15.189:56294/184203095

dup of http://tracker.ceph.com/issues/36507

Actions #2

Updated by Patrick Donnelly over 5 years ago

  • Has duplicate Bug #36507: client: connection failure during reconnect causes client to hang added
Actions #3

Updated by Patrick Donnelly over 5 years ago

  • Has duplicate deleted (Bug #36507: client: connection failure during reconnect causes client to hang)
Actions #4

Updated by Patrick Donnelly over 5 years ago

  • Is duplicate of Bug #36507: client: connection failure during reconnect causes client to hang added
Actions #5

Updated by Patrick Donnelly over 5 years ago

  • Status changed from New to Duplicate
Actions

Also available in: Atom PDF