Project

General

Profile

Actions

Bug #36666

closed

msg: rejoin message queued but not sent

Added by Patrick Donnelly over 5 years ago. Updated about 5 years ago.

Status:
Duplicate
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
% Done:

0%

Source:
Development
Tags:
Backport:
mimic,luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

mds.0 stuck in rejoin because mds.1 does not receive the sent message. REsults in this failure:

2018-10-28T07:39:10.015 ERROR:tasks.mds_thrash.fs.[cephfs]:exception:
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-pdonnell-testing-20181026.235100/qa/tasks/mds_thrash.py", line 229, in _run
    self.do_thrash()
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-pdonnell-testing-20181026.235100/qa/tasks/mds_thrash.py", line 416, in do_thrash
    status = self.wait_for_stable(rank, gid)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-pdonnell-testing-20181026.235100/qa/tasks/mds_thrash.py", line 320, in wait_for_stable
    raise RuntimeError('timeout waiting for cluster to stabilize')
RuntimeError: timeout waiting for cluster to stabilize

From: /ceph/teuthology-archive/pdonnell-2018-10-27_02:43:49-multimds-wip-pdonnell-testing-20181026.235100-distro-basic-smithi/3190852/teuthology.log

mds.0:

2018-10-28 07:33:05.267 7fdfc76a5700 10 mds.0.340  resolve set is 0
2018-10-28 07:33:05.267 7fdfc96a9700 10 -- 172.21.15.71:6817/2985871366 >> 172.21.15.71:6815/146388207 conn(0x557085370800 legacy :-1 s=STATE_CONNECTION_ESTABLISHED l=0)._try_send sent bytes 136 remaining bytes 0
2018-10-28 07:33:05.267 7fdfc9eaa700 10 NetHandler reconnect reconnect: Operation already in progress
2018-10-28 07:33:05.267 7fdfc76a5700  7 mds.0.cache set_recovery_set 1,2,3,4,5,6
2018-10-28 07:33:05.267 7fdfc76a5700  1 mds.0.340  recovery set is 1,2,3,4,5,6
2018-10-28 07:33:05.267 7fdfc9eaa700 10 -- 172.21.15.71:6817/2985871366 >> 172.21.15.91:6818/1554952229 conn(0x5570849d7c00 legacy :-1 s=STATE_CONNECTING_RE l=0).process nonblock connect inprogress
2018-10-28 07:33:05.267 7fdfc96a9700 10 -- 172.21.15.71:6817/2985871366 >> 172.21.15.71:6815/146388207 conn(0x557085370800 legacy :-1 s=CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=0).handle_my_addr_write connect sent my addr 172.21.15.71:6817/2985871366
2018-10-28 07:33:05.267 7fdfc76a5700 10 mds.0.cache send_slave_resolves
2018-10-28 07:33:05.267 7fdfc76a5700 10 mds.0.cache send_subtree_resolves
2018-10-28 07:33:05.267 7fdfc9eaa700 10 -- 172.21.15.71:6817/2985871366 >> 172.21.15.91:6817/2520693749 conn(0x557085370c00 legacy :-1 s=STATE_CONNECTING_RE l=0).process connect successfully, ready to send banner
2018-10-28 07:33:05.267 7fdfc76a5700 10 mds.0.cache  claim 0x1 []                                                                                                                                                   
2018-10-28 07:33:05.267 7fdfc96a9700 10 -- 172.21.15.71:6817/2985871366 >> 172.21.15.71:6815/146388207 conn(0x557085370800 legacy :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=0).send_connect_message connect_msg.authorizer_len=174 protocol=2
2018-10-28 07:33:05.267 7fdfc76a5700 10 mds.0.cache  claim 0x100 []
2018-10-28 07:33:05.267 7fdfc96a9700 10 -- 172.21.15.71:6817/2985871366 >> 172.21.15.71:6815/146388207 conn(0x557085370800 legacy :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=0).send_connect_message connect sending gseq=11 cseq=0 proto=33
2018-10-28 07:33:05.267 7fdfc76a5700 10 mds.0.cache sending subtee resolve to mds.1      
!!! pruning out messages not relating to connection with mds.1                                                                                                                           
2018-10-28 07:33:05.267 7fdfc76a5700  1 -- 172.21.15.71:6817/2985871366 _send_to--> mds 172.21.15.91:6818/1554952229 -- mds_resolve(2+0 subtrees +0 slave requests) v1 -- ?+0 0x557084a08c80
2018-10-28 07:33:05.267 7fdfc76a5700  1 -- 172.21.15.71:6817/2985871366 --> 172.21.15.91:6818/1554952229 -- mds_resolve(2+0 subtrees +0 slave requests) v1 -- 0x557084a08c80 con 0
2018-10-28 07:33:05.267 7fdfc9eaa700 10 -- 172.21.15.71:6817/2985871366 >> 172.21.15.91:6818/1554952229 conn(0x5570849d7c00 legacy :-1 s=STATE_CONNECTING_RE l=0).process connect successfully, ready to send banner
2018-10-28 07:33:05.267 7fdfc9eaa700 10 -- 172.21.15.71:6817/2985871366 >> 172.21.15.91:6818/1554952229 conn(0x5570849d7c00 legacy :-1 s=STATE_CONNECTION_ESTABLISHED l=0)._try_send sent bytes 9 remaining bytes 0
2018-10-28 07:33:05.267 7fdfc9eaa700 10 -- 172.21.15.71:6817/2985871366 >> 172.21.15.91:6818/1554952229 conn(0x5570849d7c00 legacy :-1 s=CONNECTING pgs=0 cs=1 l=0).handle_client_banner_write connect write banner done: 172.21.15.91:6818/1554952229
2018-10-28 07:33:05.267 7fdfc9eaa700  4 -- 172.21.15.71:6817/2985871366 >> 172.21.15.91:6818/1554952229 conn(0x5570849d7c00 legacy :-1 s=STATE_CONNECTION_ESTABLISHED l=0).handle_write                             
2018-10-28 07:33:05.267 7fdfc9eaa700 10 -- 172.21.15.71:6817/2985871366 >> 172.21.15.91:6818/1554952229 conn(0x5570849d7c00 legacy :-1 s=CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=1 l=0).write_event
2018-10-28 07:33:05.267 7fdfc9eaa700 10 -- 172.21.15.71:6817/2985871366 >> 172.21.15.91:6818/1554952229 conn(0x5570849d7c00 legacy :-1 s=STATE_CONNECTION_ESTABLISHED l=0)._try_send sent bytes 0 remaining bytes 0
2018-10-28 07:33:05.267 7fdfc9eaa700 10 -- 172.21.15.71:6817/2985871366 >> 172.21.15.91:6818/1554952229 conn(0x5570849d7c00 legacy :-1 s=STATE_CONNECTION_ESTABLISHED l=0)._try_send sent bytes 136 remaining bytes 0
2018-10-28 07:33:05.267 7fdfc9eaa700 10 -- 172.21.15.71:6817/2985871366 >> 172.21.15.91:6818/1554952229 conn(0x5570849d7c00 legacy :-1 s=CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=1 l=0).handle_my_addr_write connect sent my addr 172.21.15.71:6817/2985871366
2018-10-28 07:33:05.267 7fdfc9eaa700 10 -- 172.21.15.71:6817/2985871366 >> 172.21.15.91:6818/1554952229 conn(0x5570849d7c00 legacy :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=1 l=0).send_connect_message connect_msg.authorizer_len=174 protocol=2
2018-10-28 07:33:05.267 7fdfc9eaa700 10 -- 172.21.15.71:6817/2985871366 >> 172.21.15.91:6818/1554952229 conn(0x5570849d7c00 legacy :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=1 l=0).send_connect_message connect sending gseq=10 cseq=1 proto=33
2018-10-28 07:33:05.267 7fdfc9eaa700 10 -- 172.21.15.71:6817/2985871366 >> 172.21.15.91:6818/1554952229 conn(0x5570849d7c00 legacy :-1 s=STATE_CONNECTION_ESTABLISHED l=0)._try_send sent bytes 207 remaining bytes 0
2018-10-28 07:33:05.267 7fdfc9eaa700 10 -- 172.21.15.71:6817/2985871366 >> 172.21.15.91:6818/1554952229 conn(0x5570849d7c00 legacy :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=1 l=0).wait_connect_reply_auth reply.authorizer_len=32                                                                                                                                                                                                    2018-10-28 07:33:05.267 7fdfc9eaa700 10 -- 172.21.15.71:6817/2985871366 >> 172.21.15.91:6818/1554952229 conn(0x5570849d7c00 legacy :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=1 l=0).handle_connect_reply_auth connect got auth challenge
2018-10-28 07:33:05.267 7fdfc9eaa700 10 -- 172.21.15.71:6817/2985871366 >> 172.21.15.91:6818/1554952229 conn(0x5570849d7c00 legacy :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=1 l=0).send_connect_message connect_msg.authorizer_len=174 protocol=2
2018-10-28 07:33:05.267 7fdfc9eaa700 10 -- 172.21.15.71:6817/2985871366 >> 172.21.15.91:6818/1554952229 conn(0x5570849d7c00 legacy :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=1 l=0).send_connect_message connect sending gseq=10 cseq=1 proto=33
2018-10-28 07:33:05.267 7fdfc9eaa700 10 -- 172.21.15.71:6817/2985871366 >> 172.21.15.91:6818/1554952229 conn(0x5570849d7c00 legacy :-1 s=STATE_CONNECTION_ESTABLISHED l=0)._try_send sent bytes 207 remaining bytes 0
2018-10-28 07:33:05.268 7fdfc9eaa700 10 -- 172.21.15.71:6817/2985871366 >> 172.21.15.91:6818/1554952229 conn(0x5570849d7c00 legacy :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=1 l=0).wait_connect_reply_auth reply.authorizer_len=36
2018-10-28 07:33:05.268 7fdfc9eaa700  0 -- 172.21.15.71:6817/2985871366 >> 172.21.15.91:6818/1554952229 conn(0x5570849d7c00 legacy :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=1 l=0).handle_connect_reply_2 connect got RESETSESSION
2018-10-28 07:33:05.268 7fdfc9eaa700 10 -- 172.21.15.71:6817/2985871366 >> 172.21.15.91:6818/1554952229 conn(0x5570849d7c00 legacy :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=1 l=0).session_reset started
2018-10-28 07:33:05.268 7fdfc9eaa700 10 -- 172.21.15.71:6817/2985871366 >> 172.21.15.91:6818/1554952229 conn(0x5570849d7c00 legacy :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=1 l=0).discard_out_queue started
2018-10-28 07:33:05.268 7fdfc9eaa700 10 -- 172.21.15.71:6817/2985871366 >> 172.21.15.91:6818/1554952229 conn(0x5570849d7c00 legacy :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=0).send_connect_message connect_msg.authorizer_len=174 protocol=2
2018-10-28 07:33:05.268 7fdfc9eaa700 10 -- 172.21.15.71:6817/2985871366 >> 172.21.15.91:6818/1554952229 conn(0x5570849d7c00 legacy :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=0).send_connect_message connect sending gseq=10 cseq=0 proto=33
2018-10-28 07:33:05.268 7fdfc9eaa700 10 -- 172.21.15.71:6817/2985871366 >> 172.21.15.91:6818/1554952229 conn(0x5570849d7c00 legacy :-1 s=STATE_CONNECTION_ESTABLISHED l=0)._try_send sent bytes 207 remaining bytes 0
2018-10-28 07:33:05.268 7fdfc9eaa700 10 -- 172.21.15.71:6817/2985871366 >> 172.21.15.91:6818/1554952229 conn(0x5570849d7c00 legacy :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=0).wait_connect_reply_auth reply.authorizer_len=36
2018-10-28 07:33:05.268 7fdfc9eaa700 10 -- 172.21.15.71:6817/2985871366 >> 172.21.15.91:6818/1554952229 conn(0x5570849d7c00 legacy :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=0).handle_connect_reply_2 got CEPH_MSGR_TAG_SEQ, reading acked_seq and writing in_seq
2018-10-28 07:33:05.268 7fdfc9eaa700  2 -- 172.21.15.71:6817/2985871366 >> 172.21.15.91:6818/1554952229 conn(0x5570849d7c00 legacy :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=0).handle_ack_seq got newly_acked_seq 0 vs out_seq 0
2018-10-28 07:33:05.268 7fdfc9eaa700 10 -- 172.21.15.71:6817/2985871366 >> 172.21.15.91:6818/1554952229 conn(0x5570849d7c00 legacy :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=0).discard_requeued_up_to 0
2018-10-28 07:33:05.268 7fdfc9eaa700 10 -- 172.21.15.71:6817/2985871366 >> 172.21.15.91:6818/1554952229 conn(0x5570849d7c00 legacy :-1 s=STATE_CONNECTION_ESTABLISHED l=0)._try_send sent bytes 8 remaining bytes 0
2018-10-28 07:33:05.268 7fdfc9eaa700 10 -- 172.21.15.71:6817/2985871366 >> 172.21.15.91:6818/1554952229 conn(0x5570849d7c00 legacy :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=0).handle_in_seq_write send in_seq done                                                                                                                                                                                                               2018-10-28 07:33:05.268 7fdfc9eaa700 10 -- 172.21.15.71:6817/2985871366 >> 172.21.15.91:6818/1554952229 conn(0x5570849d7c00 legacy :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=40 cs=1 l=0).client_ready connect success 1, lossy = 0, features 4611087854031142911
2018-10-28 07:33:05.268 7fdfc9eaa700 10 -- 172.21.15.71:6817/2985871366 >> 172.21.15.91:6818/1554952229 conn(0x5570849d7c00 legacy :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=40 cs=1 l=0).client_ready setting up session_security with auth 0x557083d03b30                                                                                                                                                                                 2018-10-28 07:39:47.594 7fdfc9eaa700  1 -- 172.21.15.71:6817/2985871366 >> 172.21.15.91:6818/1554952229 conn(0x5570849d7c00 legacy :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_bulk peer close file descriptor 36
2018-10-28 07:39:47.594 7fdfc9eaa700  1 -- 172.21.15.71:6817/2985871366 >> 172.21.15.91:6818/1554952229 conn(0x5570849d7c00 legacy :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_until read failed
2018-10-28 07:39:47.594 7fdfc9eaa700  1 -- 172.21.15.71:6817/2985871366 >> 172.21.15.91:6818/1554952229 conn(0x5570849d7c00 legacy :-1 s=OPENED pgs=40 cs=1 l=0).handle_message read tag failed
2018-10-28 07:39:47.594 7fdfc9eaa700 10 -- 172.21.15.71:6817/2985871366 >> 172.21.15.91:6818/1554952229 conn(0x5570849d7c00 legacy :-1 s=OPENED pgs=40 cs=1 l=0).fault with nothing to send, going to standby

From: /ceph/teuthology-archive/pdonnell-2018-10-27_02:43:49-multimds-wip-pdonnell-testing-20181026.235100-distro-basic-smithi/3190852/remote/smithi071/log/ceph-mds.k-s.log.gz

mds.1:

2018-10-28 07:33:05.268 7f6de9a1a700  1 mds.l-s Updating MDS map to version 341 from mon.1
2018-10-28 07:33:05.268 7f6deca20700 10  Processor -- accept listen_fd=18
2018-10-28 07:33:05.268 7f6deca20700 10  Processor -- accept accepted incoming on sd 33
2018-10-28 07:33:05.268 7f6deca20700 10 -- 172.21.15.91:6818/1554952229 >> - conn(0x555a5966b000 legacy :-1 s=STATE_NONE l=0).accept sd=33 on 172.21.15.71:39284/0
2018-10-28 07:33:05.268 7f6de9a1a700 10 mds.l-s      my compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2}
2018-10-28 07:33:05.268 7f6de9a1a700 10 mds.l-s  mdsmap compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
2018-10-28 07:33:05.268 7f6de9a1a700 10 mds.l-s map says I am 172.21.15.91:6818/1554952229 mds.1.313 state up:active
2018-10-28 07:33:05.268 7f6de9a1a700 10 mds.l-s handle_mds_map: handling map as rank 1
2018-10-28 07:33:05.268 7f6de9a1a700 10 mds.1.313  resolve set is 0
2018-10-28 07:33:05.268 7f6deca20700  1 -- 172.21.15.91:6818/1554952229 >> - conn(0x555a5966b000 legacy :6818 s=ACCEPTING pgs=0 cs=0 l=0).send_server_banner sd=33 172.21.15.71:39284/0
2018-10-28 07:33:05.268 7f6de9a1a700  7 mds.1.cache set_recovery_set 0,2,3,4,5,6
2018-10-28 07:33:05.268 7f6de9a1a700  1 mds.1.313  recovery set is 0,2,3,4,5,6
2018-10-28 07:33:05.268 7f6de9a1a700 10 mds.1.cache send_slave_resolves
2018-10-28 07:33:05.268 7f6de9a1a700 10 mds.1.snapclient sync
2018-10-28 07:33:05.268 7f6de9a1a700 10 mds.1.snapclient refresh want 1
2018-10-28 07:33:05.268 7f6de9a1a700 10 mds.1.cache send_resolves waiting for snapclient cache to sync
!!! Pruning out messages except those from mds.0
2018-10-28 07:33:05.268 7f6deca20700 10 -- 172.21.15.91:6818/1554952229 >> - conn(0x555a5966ac00 legacy :6818 s=ACCEPTING pgs=0 cs=0 l=0).handle_client_banner accept peer addr is 172.21.15.71:6817/2985871366
2018-10-28 07:33:05.268 7f6deca20700 10 -- 172.21.15.91:6818/1554952229 >> 172.21.15.71:6817/2985871366 conn(0x555a5966ac00 legacy :6818 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept of host_type 2, policy.lossy=0 policy.server=0 policy.standby=1 policy.resetcheck=0
2018-10-28 07:33:05.268 7f6deca20700 10 -- 172.21.15.91:6818/1554952229 >> 172.21.15.71:6817/2985871366 conn(0x555a5966ac00 legacy :6818 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept my proto 33, their proto 33
2018-10-28 07:33:05.268 7f6deca20700 10 -- 172.21.15.91:6818/1554952229 >> 172.21.15.71:6817/2985871366 conn(0x555a5966ac00 legacy :6818 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 authorizor_protocol 2 len 174
2018-10-28 07:33:05.268 7f6deca20700 10 -- 172.21.15.91:6818/1554952229 >> 172.21.15.71:6817/2985871366 conn(0x555a5966ac00 legacy :6818 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2: challenging authorizer
2018-10-28 07:33:05.268 7f6deca20700 10 -- 172.21.15.91:6818/1554952229 >> 172.21.15.71:6817/2985871366 conn(0x555a5966ac00 legacy :6818 s=STATE_CONNECTION_ESTABLISHED l=0)._try_send sent bytes 58 remaining bytes 0
2018-10-28 07:33:05.268 7f6deca20700 10 -- 172.21.15.91:6818/1554952229 >> 172.21.15.71:6817/2985871366 conn(0x555a5966ac00 legacy :6818 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept of host_type 2, policy.lossy=0 policy.server=0 policy.standby=1 policy.resetcheck=0
2018-10-28 07:33:05.268 7f6deca20700 10 -- 172.21.15.91:6818/1554952229 >> 172.21.15.71:6817/2985871366 conn(0x555a5966ac00 legacy :6818 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept my proto 33, their proto 33
2018-10-28 07:33:05.268 7f6deca20700 10 -- 172.21.15.91:6818/1554952229 >> 172.21.15.71:6817/2985871366 conn(0x555a5966ac00 legacy :6818 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 authorizor_protocol 2 len 174
2018-10-28 07:33:05.268 7f6deca20700 10 mds.l-s  new session 0x555a59668e00 for mds.7343 172.21.15.71:6817/2985871366 con 0x555a5966ac00
2018-10-28 07:33:05.268 7f6deca20700 10 -- 172.21.15.91:6818/1554952229 >> 172.21.15.71:6817/2985871366 conn(0x555a5966ac00 legacy :6818 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept setting up session_security.
2018-10-28 07:33:05.268 7f6deca20700  0 -- 172.21.15.91:6818/1554952229 >> 172.21.15.71:6817/2985871366 conn(0x555a5966ac00 legacy :6818 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept we reset (peer sent cseq 1), sending RESETSESSION
2018-10-28 07:33:05.268 7f6deca20700 10 -- 172.21.15.91:6818/1554952229 >> 172.21.15.71:6817/2985871366 conn(0x555a5966ac00 legacy :6818 s=STATE_CONNECTION_ESTABLISHED l=0)._try_send sent bytes 62 remaining bytes 0
2018-10-28 07:33:05.268 7f6deca20700 10 -- 172.21.15.91:6818/1554952229 >> 172.21.15.71:6817/2985871366 conn(0x555a5966ac00 legacy :6818 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept of host_type 2, policy.lossy=0 policy.server=0 policy.standby=1 policy.resetcheck=0
2018-10-28 07:33:05.268 7f6deca20700 10 -- 172.21.15.91:6818/1554952229 >> 172.21.15.71:6817/2985871366 conn(0x555a5966ac00 legacy :6818 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept my proto 33, their proto 33
2018-10-28 07:33:05.268 7f6deca20700 10 -- 172.21.15.91:6818/1554952229 >> 172.21.15.71:6817/2985871366 conn(0x555a5966ac00 legacy :6818 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 authorizor_protocol 2 len 174
2018-10-28 07:33:05.268 7f6deca20700 10 mds.l-s  new session 0x555a59679500 for mds.7343 172.21.15.71:6817/2985871366 con 0x555a5966ac00
2018-10-28 07:33:05.268 7f6deca20700 10 -- 172.21.15.91:6818/1554952229 >> 172.21.15.71:6817/2985871366 conn(0x555a5966ac00 legacy :6818 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept setting up session_security.
2018-10-28 07:33:05.268 7f6deca20700 10 -- 172.21.15.91:6818/1554952229 >> 172.21.15.71:6817/2985871366 conn(0x555a5966ac00 legacy :6818 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept new session
2018-10-28 07:33:05.268 7f6deca20700 10 -- 172.21.15.91:6818/1554952229 >> 172.21.15.71:6817/2985871366 conn(0x555a5966ac00 legacy :6818 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=10 cs=1 l=0).open accept success, connect_seq = 1 in_seq=0, sending READY
2018-10-28 07:33:05.268 7f6deca20700 10 -- 172.21.15.91:6818/1554952229 >> 172.21.15.71:6817/2985871366 conn(0x555a5966ac00 legacy :6818 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=10 cs=1 l=0).open accept features 4611087854031142911 authorizer_protocol 2
2018-10-28 07:33:05.268 7f6deca20700 10 -- 172.21.15.91:6818/1554952229 >> 172.21.15.71:6817/2985871366 conn(0x555a5966ac00 legacy :6818 s=STATE_CONNECTION_ESTABLISHED l=0)._try_send sent bytes 70 remaining bytes 0
2018-10-28 07:33:05.268 7f6de9a1a700 10 mds.l-s ms_handle_accept 172.21.15.71:6817/2985871366 con 0x555a5966ac00 session 0x555a59679500
2018-10-28 07:33:05.268 7f6deca20700  2 -- 172.21.15.91:6818/1554952229 >> 172.21.15.71:6817/2985871366 conn(0x555a5966ac00 legacy :6818 s=ACCEPTING_HANDLED_CONNECT_MSG pgs=10 cs=1 l=0).handle_seq accept get newly_acked_seq 0
2018-10-28 07:33:05.268 7f6deca20700 10 -- 172.21.15.91:6818/1554952229 >> 172.21.15.71:6817/2985871366 conn(0x555a5966ac00 legacy :6818 s=ACCEPTING_HANDLED_CONNECT_MSG pgs=10 cs=1 l=0).discard_requeued_up_to 0

From: /ceph/teuthology-archive/pdonnell-2018-10-27_02:43:49-multimds-wip-pdonnell-testing-20181026.235100-distro-basic-smithi/3190852/remote/smithi091/log/ceph-mds.l-s.log.gz


Related issues 2 (1 open1 closed)

Related to Messengers - Bug #36540: msg: messages are queued but not sentNew

Actions
Is duplicate of Messengers - Bug #36612: msg/async: connection stallResolved10/28/2018

Actions
Actions #1

Updated by Patrick Donnelly over 5 years ago

  • Related to Bug #36540: msg: messages are queued but not sent added
Actions #2

Updated by Zheng Yan over 5 years ago

socket failure injection caused mds.0 to send connect message with connect_seq == 1

2018-10-28 07:33:05.267 7fdfc9eaa700  0 -- 172.21.15.71:6817/2985871366 >> 172.21.15.91:6818/1554952229 conn(0x5570849d7c00 legacy :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_until injecting socket failure
...

2018-10-28 07:33:05.267 7fdfc9eaa700  1 -- 172.21.15.71:6817/2985871366 >> 172.21.15.91:6818/1554952229 conn(0x5570849d7c00 legacy :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_bulk peer close file descriptor 33
....

2018-10-28 07:33:05.267 7fdfc9eaa700  1 -- 172.21.15.71:6817/2985871366 >> 172.21.15.91:6818/1554952229 conn(0x5570849d7c00 legacy :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_until read failed
....
2018-10-28 07:33:05.267 7fdfc9eaa700  1 -- 172.21.15.71:6817/2985871366 >> 172.21.15.91:6818/1554952229 conn(0x5570849d7c00 legacy :-1 s=CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=0).handle_server_banner_and_identify read banner and identify addresses failed
...

2018-10-28 07:33:05.267 7fdfc9eaa700  0 -- 172.21.15.71:6817/2985871366 >> 172.21.15.91:6818/1554952229 conn(0x5570849d7c00 legacy :-1 s=CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=0).fault initiating reconnect
...

2018-10-28 07:33:05.267 7fdfc9eaa700 10 -- 172.21.15.71:6817/2985871366 >> 172.21.15.91:6818/1554952229 conn(0x5570849d7c00 legacy :-1 s=STATE_CONNECTING_RE l=0).process nonblock connect inprogress

mds.1 received the reconnect message. It sent RESETSESSION back because connect_seq is not 0.

2018-10-28 07:33:05.268 7f6deca20700  0 -- 172.21.15.91:6818/1554952229 >> 172.21.15.71:6817/2985871366 conn(0x555a5966ac00 legacy :6818 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept we reset (peer sent cseq 1), sending RESETSESSION

relevant code:

CtPtr ProtocolV1::handle_connect_message_2() {
....

  else if (!replacing && connect_msg.connect_seq > 0) {
    // we reset, and they are opening a new session
    ldout(cct, 0) << __func__ << " accept we reset (peer sent cseq " 
                  << connect_msg.connect_seq << "), sending RESETSESSION" 
                  << dendl;
    return send_connect_message_reply(CEPH_MSGR_TAG_RESETSESSION, reply,
                                      authorizer_reply);
  } else {
    // new session
    ldout(cct, 10) << __func__ << " accept new session" << dendl;
    existing = nullptr;
    return open(reply, authorizer_reply);
  }
}

mds.0 received the RESETSESSION. It discarded the out_q.

2018-10-28 07:33:05.268 7fdfc9eaa700 10 -- 172.21.15.71:6817/2985871366 >> 172.21.15.91:6818/1554952229 conn(0x5570849d7c00 legacy :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=1 l=0).session_reset started

Actions #3

Updated by Zheng Yan over 5 years ago

The events that trigger this issue is clear, but I don't have clear idea how to fix it.

Actions #4

Updated by Sage Weil over 5 years ago

  • Status changed from New to Duplicate

dup of #36612 which is now fixed!

Actions #5

Updated by Sage Weil over 5 years ago

  • Is duplicate of Bug #36612: msg/async: connection stall added
Actions #6

Updated by Sage Weil over 5 years ago

  • Project changed from Ceph to RADOS
Actions #7

Updated by Greg Farnum about 5 years ago

  • Project changed from RADOS to Messengers
Actions

Also available in: Atom PDF