Project

General

Profile

Bug #15849

msg/async: connection race hang

Added by Sage Weil over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
msgr
Target version:
-
Start date:
05/11/2016
Due date:
% Done:

0%

Source:
other
Tags:
Backport:
jewel
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

osd.301 and osd.344 are tryign to talk to each other.

osd.301 log includes (this is debug ms = 10)

2016-05-11 18:50:03.223034 7fd4c3e60700 10 -- 128.142.162.80:7064/1743276 >> :/0 conn(0x7fd4f9bd5800 sd=-1 :-1 s=STATE_NONE pgs=0 cs=0 l=0).accept sd=424
2016-05-11 18:50:03.457931 7fd4b7ff0700 10 -- 128.142.162.80:7064/1743276 >> 128.142.162.145:6966/1660664 conn(0x7fd4fd640000 sd=-1 :-1 s=STATE_NONE pgs=0 cs=0 l=0)._connect csq=0
2016-05-11 18:50:03.457972 7fd4b7ff0700 10 -- 128.142.162.80:7064/1743276 get_connection osd.344 128.142.162.145:6966/1660664 new 0x7fd4fd640000
2016-05-11 18:50:03.457995 7fd4b7ff0700 20 osd.301 36721 share_map_peer 0x7fd4fd640000 don't know epoch, doing nothing
2016-05-11 18:50:03.458028 7fd4b7ff0700  1 -- 128.142.162.80:7064/1743276 >> 128.142.162.145:6966/1660664 conn(0x7fd4fd640000 sd=-1 :-1 s=STATE_CONNECTING pgs=0 cs=0 l=0). == tx == 0x7fd4f99a2600 pg_query(4.795b epoch 36721) v3
2016-05-11 18:50:03.755307 7fd4c3e60700 10 -- 128.142.162.80:7064/1743276 >> 128.142.162.145:6966/1660664 conn(0x7fd4fd640000 sd=1809 :-1 s=STATE_CONNECTING_RE pgs=0 cs=0 l=0).handle_write started.
2016-05-11 18:50:03.882483 7fd4c365f700  1 -- 128.142.162.80:7064/1743276 >> :/0 conn(0x7fd4f9bd5800 sd=424 :7064 s=STATE_ACCEPTING pgs=0 cs=0 l=0)._process_connection sd=424 128.142.162.145:38612/0
2016-05-11 18:50:03.882506 7fd4c365f700 10 -- 128.142.162.80:7064/1743276 >> :/0 conn(0x7fd4f9bd5800 sd=424 :7064 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0)._process_connection write banner and addr done: :/0
2016-05-11 18:50:03.999769 7fd4c365f700 10 -- 128.142.162.80:7064/1743276 >> :/0 conn(0x7fd4f9bd5800 sd=424 :7064 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0)._process_connection accept peer addr is 128.142.162.145:6966/1660664
2016-05-11 18:50:03.999776 7fd4c365f700 10 -- 128.142.162.80:7064/1743276 >> 128.142.162.145:6966/1660664 conn(0x7fd4f9bd5800 sd=424 :7064 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0)._process_connection accept of host_type 4, policy.lossy=0 policy.server=0 policy.standby=1 policy.resetcheck=0
2016-05-11 18:50:03.999783 7fd4c365f700 10 -- 128.142.162.80:7064/1743276 >> 128.142.162.145:6966/1660664 conn(0x7fd4f9bd5800 sd=424 :7064 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept my proto 10, their proto 10
2016-05-11 18:50:03.999831 7fd4c365f700 10 osd.301 36721  new session 0x7fd4f9cbabe0 con=0x7fd4f9bd5800 addr=128.142.162.145:6966/1660664
2016-05-11 18:50:03.999853 7fd4c365f700 10 -- 128.142.162.80:7064/1743276 >> 128.142.162.145:6966/1660664 conn(0x7fd4f9bd5800 sd=424 :7064 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept setting up session_security.
2016-05-11 18:50:03.999862 7fd4c365f700 10 -- 128.142.162.80:7064/1743276 >> 128.142.162.145:6966/1660664 conn(0x7fd4f9bd5800 sd=424 :7064 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept existing 0x7fd4fd640000.gseq 0 <= 3819, looks ok
2016-05-11 18:50:03.999868 7fd4c365f700  0 -- 128.142.162.80:7064/1743276 >> 128.142.162.145:6966/1660664 conn(0x7fd4f9bd5800 sd=424 :7064 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 0 vs existing csq=0 existing_state=STATE_CONNECTING_RE
2016-05-11 18:50:03.999874 7fd4c365f700 10 -- 128.142.162.80:7064/1743276 >> 128.142.162.145:6966/1660664 conn(0x7fd4f9bd5800 sd=424 :7064 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connection race, existing 0x7fd4fd640000.cseq 0 == 0, sending WAIT

note that the outgoing connection doesn't make any progress..

on osd.344, we see

2016-05-11 18:50:00.238331 7f53e7c56700 10 -- 128.142.162.145:6966/1660664 >> 128.142.162.80:7064/1743276 conn(0x7f541b1c2800 sd=-1 :-1 s=STATE_NONE pgs=0 cs=0 l=0)._connect csq=0
2016-05-11 18:50:00.238380 7f53e7c56700 10 -- 128.142.162.145:6966/1660664 get_connection osd.301 128.142.162.80:7064/1743276 new 0x7f541b1c2800
2016-05-11 18:50:00.238399 7f53e7c56700 20 osd.344 36721 share_map_peer 0x7f541b1c2800 don't know epoch, doing nothing
2016-05-11 18:50:00.238423 7f53e7c56700  1 -- 128.142.162.145:6966/1660664 >> 128.142.162.80:7064/1743276 conn(0x7f541b1c2800 sd=-1 :-1 s=STATE_CONNECTING pgs=0 cs=0 l=0). == tx == 0x7f542cd64ee0 pg_notify(4.795b(6) epoch 36721) v5
2016-05-11 18:50:00.270710 7f53f41ab700 10 -- 128.142.162.145:6966/1660664 >> 128.142.162.80:7064/1743276 conn(0x7f541b1c2800 sd=1878 :-1 s=STATE_CONNECTING_RE pgs=0 cs=0 l=0).handle_write started.
2016-05-11 18:50:03.907699 7f53f41ab700 10 -- 128.142.162.145:6966/1660664 >> 128.142.162.80:7064/1743276 conn(0x7f541b1c2800 sd=1878 :-1 s=STATE_CONNECTING_WAIT_BANNER pgs=0 cs=0 l=0)._process_connection get banner, ready to send banner
2016-05-11 18:50:03.907735 7f53f41ab700 10 -- 128.142.162.145:6966/1660664 >> 128.142.162.80:7064/1743276 conn(0x7f541b1c2800 sd=1878 :-1 s=STATE_CONNECTING_WAIT_IDENTIFY_PEER pgs=0 cs=0 l=0)._process_connection connect write banner done: 128.142.162.80:7064/1743276
2016-05-11 18:50:03.907751 7f53f41ab700 10 -- 128.142.162.145:6966/1660664 >> 128.142.162.80:7064/1743276 conn(0x7f541b1c2800 sd=1878 :-1 s=STATE_CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=0)._process_connection connect sent my addr 128.142.162.145:6966/1660664
2016-05-11 18:50:03.907797 7f53f41ab700 10 -- 128.142.162.145:6966/1660664 >> 128.142.162.80:7064/1743276 conn(0x7f541b1c2800 sd=1878 :-1 s=STATE_CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=0)._process_connection connect_msg.authorizer_len=174 protocol=2
2016-05-11 18:50:03.907808 7f53f41ab700 10 -- 128.142.162.145:6966/1660664 >> 128.142.162.80:7064/1743276 conn(0x7f541b1c2800 sd=1878 :-1 s=STATE_CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=0)._process_connection connect sending gseq=3819 cseq=0 proto=10
2016-05-11 18:50:03.907864 7f53f41ab700 10 -- 128.142.162.145:6966/1660664 >> 128.142.162.80:7064/1743276 conn(0x7f541b1c2800 sd=1878 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY pgs=0 cs=0 l=0).handle_write started.
2016-05-11 18:50:04.038005 7f53f41ab700 10 -- 128.142.162.145:6966/1660664 >> 128.142.162.80:7064/1743276 conn(0x7f541b1c2800 sd=1878 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=0)._process_connection reply.authorizer_len=36
2016-05-11 18:50:04.038053 7f53f41ab700  3 -- 128.142.162.145:6966/1660664 >> 128.142.162.80:7064/1743276 conn(0x7f541b1c2800 sd=1878 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=0).handle_connect_reply connect got WAIT (connection race)
2016-05-11 18:50:04.038063 7f53f41ab700 10 -- 128.142.162.145:6966/1660664 >> 128.142.162.80:7064/1743276 conn(0x7f541b1c2800 sd=1878 :-1 s=STATE_WAIT pgs=0 cs=0 l=0).handle_write started.

and then we wait indefinitely, since the 'winning' connection doesn't complete.


Related issues

Copied to Ceph - Backport #16380: jewel: msg/async: connection race hang Resolved

History

#1 Updated by Sage Weil over 2 years ago

  • Description updated (diff)
  • Category set to msgr
  • Status changed from New to Verified
  • Priority changed from Normal to Urgent

#2 Updated by Haomai Wang over 2 years ago

it looks something wrong with nonblock connect logic, I would like to see the whole log file so I can find out the reconnect error message log.(https://github.com/ceph/ceph/blob/master/src/msg/async/net_handler.cc#L140)

#4 Updated by Samuel Just over 2 years ago

  • Status changed from Verified to Need Review
  • Assignee set to Haomai Wang

#5 Updated by Haomai Wang over 2 years ago

  • Status changed from Need Review to Pending Backport

pending backport

#6 Updated by Haomai Wang over 2 years ago

  • Backport set to Jewel

#7 Updated by Nathan Cutler over 2 years ago

  • Backport changed from Jewel to jewel

#8 Updated by Nathan Cutler over 2 years ago

#9 Updated by Loic Dachary over 2 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF