Project

General

Profile

Bug #15849

Updated by Sage Weil almost 8 years ago

osd.301 and osd.344 are tryign to talk to each other. 

 osd.301 log includes (this is debug ms = 10) 
 <pre> 
 2016-05-11 18:50:03.223034 7fd4c3e60700 10 -- 128.142.162.80:7064/1743276 >> :/0 conn(0x7fd4f9bd5800 sd=-1 :-1 s=STATE_NONE pgs=0 cs=0 l=0).accept sd=424 
 2016-05-11 18:50:03.457931 7fd4b7ff0700 10 -- 128.142.162.80:7064/1743276 >> 128.142.162.145:6966/1660664 conn(0x7fd4fd640000 sd=-1 :-1 s=STATE_NONE pgs=0 cs=0 l=0)._connect csq=0 
 2016-05-11 18:50:03.457972 7fd4b7ff0700 10 -- 128.142.162.80:7064/1743276 get_connection osd.344 128.142.162.145:6966/1660664 new 0x7fd4fd640000 
 2016-05-11 18:50:03.457995 7fd4b7ff0700 20 osd.301 36721 share_map_peer 0x7fd4fd640000 don't know epoch, doing nothing 
 2016-05-11 18:50:03.458028 7fd4b7ff0700    1 -- 128.142.162.80:7064/1743276 >> 128.142.162.145:6966/1660664 conn(0x7fd4fd640000 sd=-1 :-1 s=STATE_CONNECTING pgs=0 cs=0 l=0). == tx == 0x7fd4f99a2600 pg_query(4.795b epoch 36721) v3 
 2016-05-11 18:50:03.755307 7fd4c3e60700 10 -- 128.142.162.80:7064/1743276 >> 128.142.162.145:6966/1660664 conn(0x7fd4fd640000 sd=1809 :-1 s=STATE_CONNECTING_RE pgs=0 cs=0 l=0).handle_write started. 
 2016-05-11 18:50:03.882483 7fd4c365f700    1 -- 128.142.162.80:7064/1743276 >> :/0 conn(0x7fd4f9bd5800 sd=424 :7064 s=STATE_ACCEPTING pgs=0 cs=0 l=0)._process_connection sd=424 128.142.162.145:38612/0 
 2016-05-11 18:50:03.882506 7fd4c365f700 10 -- 128.142.162.80:7064/1743276 >> :/0 conn(0x7fd4f9bd5800 sd=424 :7064 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0)._process_connection write banner and addr done: :/0 
 2016-05-11 18:50:03.999769 7fd4c365f700 10 -- 128.142.162.80:7064/1743276 >> :/0 conn(0x7fd4f9bd5800 sd=424 :7064 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0)._process_connection accept peer addr is 128.142.162.145:6966/1660664 
 2016-05-11 18:50:03.999776 7fd4c365f700 10 -- 128.142.162.80:7064/1743276 >> 128.142.162.145:6966/1660664 conn(0x7fd4f9bd5800 sd=424 :7064 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0)._process_connection accept of host_type 4, policy.lossy=0 policy.server=0 policy.standby=1 policy.resetcheck=0 
 2016-05-11 18:50:03.999783 7fd4c365f700 10 -- 128.142.162.80:7064/1743276 >> 128.142.162.145:6966/1660664 conn(0x7fd4f9bd5800 sd=424 :7064 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept my proto 10, their proto 10 
 2016-05-11 18:50:03.999831 7fd4c365f700 10 osd.301 36721    new session 0x7fd4f9cbabe0 con=0x7fd4f9bd5800 addr=128.142.162.145:6966/1660664 
 2016-05-11 18:50:03.999853 7fd4c365f700 10 -- 128.142.162.80:7064/1743276 >> 128.142.162.145:6966/1660664 conn(0x7fd4f9bd5800 sd=424 :7064 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept setting up session_security. 
 2016-05-11 18:50:03.999862 7fd4c365f700 10 -- 128.142.162.80:7064/1743276 >> 128.142.162.145:6966/1660664 conn(0x7fd4f9bd5800 sd=424 :7064 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept existing 0x7fd4fd640000.gseq 0 <= 3819, looks ok 
 2016-05-11 18:50:03.999868 7fd4c365f700    0 -- 128.142.162.80:7064/1743276 >> 128.142.162.145:6966/1660664 conn(0x7fd4f9bd5800 sd=424 :7064 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 0 vs existing csq=0 existing_state=STATE_CONNECTING_RE 
 2016-05-11 18:50:03.999874 7fd4c365f700 10 -- 128.142.162.80:7064/1743276 >> 128.142.162.145:6966/1660664 conn(0x7fd4f9bd5800 sd=424 :7064 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connection race, existing 0x7fd4fd640000.cseq 0 == 0, sending WAIT 
 </pre> 
 note that the outgoing connection doesn't make any progress.. 

 on osd.344, we see 
 <pre> 
 2016-05-11 18:50:00.238331 7f53e7c56700 10 -- 128.142.162.145:6966/1660664 >> 128.142.162.80:7064/1743276 conn(0x7f541b1c2800 sd=-1 :-1 s=STATE_NONE pgs=0 cs=0 l=0)._connect csq=0 
 2016-05-11 18:50:00.238380 7f53e7c56700 10 -- 128.142.162.145:6966/1660664 get_connection osd.301 128.142.162.80:7064/1743276 new 0x7f541b1c2800 
 2016-05-11 18:50:00.238399 7f53e7c56700 20 osd.344 36721 share_map_peer 0x7f541b1c2800 don't know epoch, doing nothing 
 2016-05-11 18:50:00.238423 7f53e7c56700    1 -- 128.142.162.145:6966/1660664 >> 128.142.162.80:7064/1743276 conn(0x7f541b1c2800 sd=-1 :-1 s=STATE_CONNECTING pgs=0 cs=0 l=0). == tx == 0x7f542cd64ee0 pg_notify(4.795b(6) epoch 36721) v5 
 2016-05-11 18:50:00.270710 7f53f41ab700 10 -- 128.142.162.145:6966/1660664 >> 128.142.162.80:7064/1743276 conn(0x7f541b1c2800 sd=1878 :-1 s=STATE_CONNECTING_RE pgs=0 cs=0 l=0).handle_write started. 
 2016-05-11 18:50:03.907699 7f53f41ab700 10 -- 128.142.162.145:6966/1660664 >> 128.142.162.80:7064/1743276 conn(0x7f541b1c2800 sd=1878 :-1 s=STATE_CONNECTING_WAIT_BANNER pgs=0 cs=0 l=0)._process_connection get banner, ready to send banner 
 2016-05-11 18:50:03.907735 7f53f41ab700 10 -- 128.142.162.145:6966/1660664 >> 128.142.162.80:7064/1743276 conn(0x7f541b1c2800 sd=1878 :-1 s=STATE_CONNECTING_WAIT_IDENTIFY_PEER pgs=0 cs=0 l=0)._process_connection connect write banner done: 128.142.162.80:7064/1743276 
 2016-05-11 18:50:03.907751 7f53f41ab700 10 -- 128.142.162.145:6966/1660664 >> 128.142.162.80:7064/1743276 conn(0x7f541b1c2800 sd=1878 :-1 s=STATE_CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=0)._process_connection connect sent my addr 128.142.162.145:6966/1660664 
 2016-05-11 18:50:03.907797 7f53f41ab700 10 -- 128.142.162.145:6966/1660664 >> 128.142.162.80:7064/1743276 conn(0x7f541b1c2800 sd=1878 :-1 s=STATE_CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=0)._process_connection connect_msg.authorizer_len=174 protocol=2 
 2016-05-11 18:50:03.907808 7f53f41ab700 10 -- 128.142.162.145:6966/1660664 >> 128.142.162.80:7064/1743276 conn(0x7f541b1c2800 sd=1878 :-1 s=STATE_CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=0)._process_connection connect sending gseq=3819 cseq=0 proto=10 
 2016-05-11 18:50:03.907864 7f53f41ab700 10 -- 128.142.162.145:6966/1660664 >> 128.142.162.80:7064/1743276 conn(0x7f541b1c2800 sd=1878 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY pgs=0 cs=0 l=0).handle_write started. 
 2016-05-11 18:50:04.038005 7f53f41ab700 10 -- 128.142.162.145:6966/1660664 >> 128.142.162.80:7064/1743276 conn(0x7f541b1c2800 sd=1878 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=0)._process_connection reply.authorizer_len=36 
 2016-05-11 18:50:04.038053 7f53f41ab700    3 -- 128.142.162.145:6966/1660664 >> 128.142.162.80:7064/1743276 conn(0x7f541b1c2800 sd=1878 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=0).handle_connect_reply connect got WAIT (connection race) 
 2016-05-11 18:50:04.038063 7f53f41ab700 10 -- 128.142.162.145:6966/1660664 >> 128.142.162.80:7064/1743276 conn(0x7f541b1c2800 sd=1878 :-1 s=STATE_WAIT pgs=0 cs=0 l=0).handle_write started. 
 </pre> 

 and then we wait indefinitely, since the 'winning' connection doesn't complete. 

Back