Actions
Bug #15849
closedmsg/async: connection race hang
% Done:
0%
Source:
other
Tags:
Backport:
jewel
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
osd.301 and osd.344 are tryign to talk to each other.
osd.301 log includes (this is debug ms = 10)
2016-05-11 18:50:03.223034 7fd4c3e60700 10 -- 128.142.162.80:7064/1743276 >> :/0 conn(0x7fd4f9bd5800 sd=-1 :-1 s=STATE_NONE pgs=0 cs=0 l=0).accept sd=424 2016-05-11 18:50:03.457931 7fd4b7ff0700 10 -- 128.142.162.80:7064/1743276 >> 128.142.162.145:6966/1660664 conn(0x7fd4fd640000 sd=-1 :-1 s=STATE_NONE pgs=0 cs=0 l=0)._connect csq=0 2016-05-11 18:50:03.457972 7fd4b7ff0700 10 -- 128.142.162.80:7064/1743276 get_connection osd.344 128.142.162.145:6966/1660664 new 0x7fd4fd640000 2016-05-11 18:50:03.457995 7fd4b7ff0700 20 osd.301 36721 share_map_peer 0x7fd4fd640000 don't know epoch, doing nothing 2016-05-11 18:50:03.458028 7fd4b7ff0700 1 -- 128.142.162.80:7064/1743276 >> 128.142.162.145:6966/1660664 conn(0x7fd4fd640000 sd=-1 :-1 s=STATE_CONNECTING pgs=0 cs=0 l=0). == tx == 0x7fd4f99a2600 pg_query(4.795b epoch 36721) v3 2016-05-11 18:50:03.755307 7fd4c3e60700 10 -- 128.142.162.80:7064/1743276 >> 128.142.162.145:6966/1660664 conn(0x7fd4fd640000 sd=1809 :-1 s=STATE_CONNECTING_RE pgs=0 cs=0 l=0).handle_write started. 2016-05-11 18:50:03.882483 7fd4c365f700 1 -- 128.142.162.80:7064/1743276 >> :/0 conn(0x7fd4f9bd5800 sd=424 :7064 s=STATE_ACCEPTING pgs=0 cs=0 l=0)._process_connection sd=424 128.142.162.145:38612/0 2016-05-11 18:50:03.882506 7fd4c365f700 10 -- 128.142.162.80:7064/1743276 >> :/0 conn(0x7fd4f9bd5800 sd=424 :7064 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0)._process_connection write banner and addr done: :/0 2016-05-11 18:50:03.999769 7fd4c365f700 10 -- 128.142.162.80:7064/1743276 >> :/0 conn(0x7fd4f9bd5800 sd=424 :7064 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0)._process_connection accept peer addr is 128.142.162.145:6966/1660664 2016-05-11 18:50:03.999776 7fd4c365f700 10 -- 128.142.162.80:7064/1743276 >> 128.142.162.145:6966/1660664 conn(0x7fd4f9bd5800 sd=424 :7064 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0)._process_connection accept of host_type 4, policy.lossy=0 policy.server=0 policy.standby=1 policy.resetcheck=0 2016-05-11 18:50:03.999783 7fd4c365f700 10 -- 128.142.162.80:7064/1743276 >> 128.142.162.145:6966/1660664 conn(0x7fd4f9bd5800 sd=424 :7064 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept my proto 10, their proto 10 2016-05-11 18:50:03.999831 7fd4c365f700 10 osd.301 36721 new session 0x7fd4f9cbabe0 con=0x7fd4f9bd5800 addr=128.142.162.145:6966/1660664 2016-05-11 18:50:03.999853 7fd4c365f700 10 -- 128.142.162.80:7064/1743276 >> 128.142.162.145:6966/1660664 conn(0x7fd4f9bd5800 sd=424 :7064 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept setting up session_security. 2016-05-11 18:50:03.999862 7fd4c365f700 10 -- 128.142.162.80:7064/1743276 >> 128.142.162.145:6966/1660664 conn(0x7fd4f9bd5800 sd=424 :7064 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept existing 0x7fd4fd640000.gseq 0 <= 3819, looks ok 2016-05-11 18:50:03.999868 7fd4c365f700 0 -- 128.142.162.80:7064/1743276 >> 128.142.162.145:6966/1660664 conn(0x7fd4f9bd5800 sd=424 :7064 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 0 vs existing csq=0 existing_state=STATE_CONNECTING_RE 2016-05-11 18:50:03.999874 7fd4c365f700 10 -- 128.142.162.80:7064/1743276 >> 128.142.162.145:6966/1660664 conn(0x7fd4f9bd5800 sd=424 :7064 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connection race, existing 0x7fd4fd640000.cseq 0 == 0, sending WAIT
note that the outgoing connection doesn't make any progress..
on osd.344, we see
2016-05-11 18:50:00.238331 7f53e7c56700 10 -- 128.142.162.145:6966/1660664 >> 128.142.162.80:7064/1743276 conn(0x7f541b1c2800 sd=-1 :-1 s=STATE_NONE pgs=0 cs=0 l=0)._connect csq=0 2016-05-11 18:50:00.238380 7f53e7c56700 10 -- 128.142.162.145:6966/1660664 get_connection osd.301 128.142.162.80:7064/1743276 new 0x7f541b1c2800 2016-05-11 18:50:00.238399 7f53e7c56700 20 osd.344 36721 share_map_peer 0x7f541b1c2800 don't know epoch, doing nothing 2016-05-11 18:50:00.238423 7f53e7c56700 1 -- 128.142.162.145:6966/1660664 >> 128.142.162.80:7064/1743276 conn(0x7f541b1c2800 sd=-1 :-1 s=STATE_CONNECTING pgs=0 cs=0 l=0). == tx == 0x7f542cd64ee0 pg_notify(4.795b(6) epoch 36721) v5 2016-05-11 18:50:00.270710 7f53f41ab700 10 -- 128.142.162.145:6966/1660664 >> 128.142.162.80:7064/1743276 conn(0x7f541b1c2800 sd=1878 :-1 s=STATE_CONNECTING_RE pgs=0 cs=0 l=0).handle_write started. 2016-05-11 18:50:03.907699 7f53f41ab700 10 -- 128.142.162.145:6966/1660664 >> 128.142.162.80:7064/1743276 conn(0x7f541b1c2800 sd=1878 :-1 s=STATE_CONNECTING_WAIT_BANNER pgs=0 cs=0 l=0)._process_connection get banner, ready to send banner 2016-05-11 18:50:03.907735 7f53f41ab700 10 -- 128.142.162.145:6966/1660664 >> 128.142.162.80:7064/1743276 conn(0x7f541b1c2800 sd=1878 :-1 s=STATE_CONNECTING_WAIT_IDENTIFY_PEER pgs=0 cs=0 l=0)._process_connection connect write banner done: 128.142.162.80:7064/1743276 2016-05-11 18:50:03.907751 7f53f41ab700 10 -- 128.142.162.145:6966/1660664 >> 128.142.162.80:7064/1743276 conn(0x7f541b1c2800 sd=1878 :-1 s=STATE_CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=0)._process_connection connect sent my addr 128.142.162.145:6966/1660664 2016-05-11 18:50:03.907797 7f53f41ab700 10 -- 128.142.162.145:6966/1660664 >> 128.142.162.80:7064/1743276 conn(0x7f541b1c2800 sd=1878 :-1 s=STATE_CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=0)._process_connection connect_msg.authorizer_len=174 protocol=2 2016-05-11 18:50:03.907808 7f53f41ab700 10 -- 128.142.162.145:6966/1660664 >> 128.142.162.80:7064/1743276 conn(0x7f541b1c2800 sd=1878 :-1 s=STATE_CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=0)._process_connection connect sending gseq=3819 cseq=0 proto=10 2016-05-11 18:50:03.907864 7f53f41ab700 10 -- 128.142.162.145:6966/1660664 >> 128.142.162.80:7064/1743276 conn(0x7f541b1c2800 sd=1878 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY pgs=0 cs=0 l=0).handle_write started. 2016-05-11 18:50:04.038005 7f53f41ab700 10 -- 128.142.162.145:6966/1660664 >> 128.142.162.80:7064/1743276 conn(0x7f541b1c2800 sd=1878 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=0)._process_connection reply.authorizer_len=36 2016-05-11 18:50:04.038053 7f53f41ab700 3 -- 128.142.162.145:6966/1660664 >> 128.142.162.80:7064/1743276 conn(0x7f541b1c2800 sd=1878 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=0).handle_connect_reply connect got WAIT (connection race) 2016-05-11 18:50:04.038063 7f53f41ab700 10 -- 128.142.162.145:6966/1660664 >> 128.142.162.80:7064/1743276 conn(0x7f541b1c2800 sd=1878 :-1 s=STATE_WAIT pgs=0 cs=0 l=0).handle_write started.
and then we wait indefinitely, since the 'winning' connection doesn't complete.
Actions