Project

General

Profile

Bug #21143

bad RESETSESSION between OSDs?

Added by Sage Weil almost 2 years ago. Updated 6 months ago.

Status:
Duplicate
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
Start date:
08/26/2017
Due date:
% Done:

0%

Source:
Tags:
Backport:
luminous,mimic
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:

Description

osd.5

2017-08-26 06:31:20.549756 7f08062dc700  1 -- 172.21.15.59:6810/1214 >> 172.21.15.10:6801/7193 conn(0x7f081866b000 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=0).handle_connect_reply connect got WAIT (connection race)
2017-08-26 06:31:20.549780 7f08062dc700  1 -- 172.21.15.59:6810/1214 >> 172.21.15.10:6801/7193 conn(0x7f081866b000 :-1 s=STATE_WAIT pgs=0 cs=0 l=0).process enter wait state, failing
2017-08-26 06:31:20.550007 7f08062dc700 10 osd.5 13  new session 0x7f08164c2a00 con=0x7f08186c4800 addr=172.21.15.10:6801/7193
2017-08-26 06:31:20.550061 7f08062dc700  0 -- 172.21.15.59:6810/1214 >> 172.21.15.10:6801/7193 conn(0x7f08186c4800 :6810 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 0 vs existing csq=0 existing_state=STATE_CONNECTING
2017-08-26 06:31:20.550081 7f08062dc700  1 -- 172.21.15.59:6810/1214 >> 172.21.15.10:6801/7193 conn(0x7f08186c4800 :6810 s=STATE_CLOSED pgs=0 cs=0 l=0).handle_connect_msg stop myself to swap existing
2017-08-26 06:31:20.550262 7f08062dc700  1 -- 172.21.15.59:6810/1214 >> 172.21.15.10:6801/7193 conn(0x7f081866b000 :-1 s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=0 cs=0 l=0).read_bulk peer close file descriptor 69
2017-08-26 06:31:20.550283 7f08062dc700  1 -- 172.21.15.59:6810/1214 >> 172.21.15.10:6801/7193 conn(0x7f081866b000 :-1 s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=0 cs=0 l=0).read_until read failed
2017-08-26 06:31:20.550290 7f08062dc700  1 -- 172.21.15.59:6810/1214 >> 172.21.15.10:6801/7193 conn(0x7f081866b000 :-1 s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=0 cs=0 l=0)._process_connection read connect msg failed
2017-08-26 06:31:22.802199 7f08062dc700  1 -- 172.21.15.59:6810/1214 >> 172.21.15.10:6801/7193 conn(0x7f081869b000 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY pgs=0 cs=1 l=0).read_bulk peer close file descriptor 62
2017-08-26 06:31:22.802218 7f08062dc700  1 -- 172.21.15.59:6810/1214 >> 172.21.15.10:6801/7193 conn(0x7f081869b000 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY pgs=0 cs=1 l=0).read_until read failed
2017-08-26 06:31:22.802224 7f08062dc700  1 -- 172.21.15.59:6810/1214 >> 172.21.15.10:6801/7193 conn(0x7f081869b000 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY pgs=0 cs=1 l=0)._process_connection read connect reply failed
2017-08-26 06:31:24.002095 7f07f79c6700 10 osd.5 14 send_incremental_map 13 -> 14 to 0x7f081869b000 172.21.15.10:6801/7193
2017-08-26 06:31:24.002110 7f07f79c6700  1 -- 172.21.15.59:6810/1214 --> 172.21.15.10:6801/7193 -- osd_map(14..14 src has 1..14) v3 -- 0x7f0815c3ef80 con 0
2017-08-26 06:31:24.002139 7f07f79c6700  1 -- 172.21.15.59:6810/1214 --> 172.21.15.10:6801/7193 -- pg_query(1.1 epoch 14) v4 -- 0x7f0818742240 con 0
2017-08-26 06:31:24.002922 7f08062dc700  0 -- 172.21.15.59:6810/1214 >> 172.21.15.10:6801/7193 conn(0x7f081869b000 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=1 l=0).handle_connect_reply connect got RESETSESSION
2017-08-26 06:31:24.003300 7f08062dc700 10 osd.5 14  new session (outgoing) 0x7f08186f7400 con=0x7f081869b000 addr=172.21.15.10:6801/7193
2017-08-26 06:31:29.003937 7f08062dc700  1 -- 172.21.15.59:6810/1214 >> 172.21.15.10:6801/7193 conn(0x7f081869b000 :-1 s=STATE_OPEN pgs=15 cs=1 l=0).read_bulk peer close file descriptor 62
2017-08-26 06:31:29.003960 7f08062dc700  1 -- 172.21.15.59:6810/1214 >> 172.21.15.10:6801/7193 conn(0x7f081869b000 :-1 s=STATE_OPEN pgs=15 cs=1 l=0).read_until read failed
2017-08-26 06:31:29.003971 7f08062dc700  1 -- 172.21.15.59:6810/1214 >> 172.21.15.10:6801/7193 conn(0x7f081869b000 :-1 s=STATE_OPEN pgs=15 cs=1 l=0).process read tag failed

osd.0
[root@smithi010 sage]# grep 172.21.15.59:6810/1214 /var/log/ceph/ceph-osd.0.log 
2017-08-26 06:31:20.552223 7fa032922700 10 osd.0 13  new session 0x7fa042853e00 con=0x7fa044a50800 addr=172.21.15.59:6810/1214
2017-08-26 06:31:20.552290 7fa032922700  0 -- 172.21.15.10:6801/7193 >> 172.21.15.59:6810/1214 conn(0x7fa044a50800 :6801 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 0 vs existing csq=0 existing_state=STATE_CONNECTING_RE
2017-08-26 06:31:20.552615 7fa032922700  1 -- 172.21.15.10:6801/7193 >> 172.21.15.59:6810/1214 conn(0x7fa044a50800 :6801 s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=0 cs=0 l=0).read_bulk peer close file descriptor 70
2017-08-26 06:31:20.552643 7fa032922700  1 -- 172.21.15.10:6801/7193 >> 172.21.15.59:6810/1214 conn(0x7fa044a50800 :6801 s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=0 cs=0 l=0).read_until read failed
2017-08-26 06:31:20.552649 7fa032922700  1 -- 172.21.15.10:6801/7193 >> 172.21.15.59:6810/1214 conn(0x7fa044a50800 :6801 s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=0 cs=0 l=0)._process_connection read connect msg failed
2017-08-26 06:31:20.552830 7fa032922700  0 -- 172.21.15.10:6801/7193 >> 172.21.15.59:6810/1214 conn(0x7fa044a2f800 :-1 s=STATE_CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=0)._try_send injecting socket failure
2017-08-26 06:31:20.552854 7fa032922700  1 -- 172.21.15.10:6801/7193 >> 172.21.15.59:6810/1214 conn(0x7fa044a2f800 :-1 s=STATE_CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=0)._try_send send error: (32) Broken pipe
2017-08-26 06:31:22.804525 7fa032121700 10 osd.0 13  new session 0x7fa042855000 con=0x7fa044a31000 addr=172.21.15.59:6810/1214
2017-08-26 06:31:22.804580 7fa032121700  0 -- 172.21.15.10:6801/7193 >> 172.21.15.59:6810/1214 conn(0x7fa044a31000 :6801 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 0 vs existing csq=0 existing_state=STATE_STANDBY
2017-08-26 06:31:22.804757 7fa032121700  0 -- 172.21.15.10:6801/7193 >> 172.21.15.59:6810/1214 conn(0x7fa044a31000 :6801 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).read_until injecting socket failure
2017-08-26 06:31:22.804897 7fa032121700  0 -- 172.21.15.10:6801/7193 >> 172.21.15.59:6810/1214 conn(0x7fa044a31000 :6801 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 1 vs existing csq=0 existing_state=STATE_STANDBY
2017-08-26 06:31:22.804917 7fa032121700  1 -- 172.21.15.10:6801/7193 >> 172.21.15.59:6810/1214 conn(0x7fa044a31000 :6801 s=STATE_CLOSED pgs=0 cs=0 l=0).handle_connect_msg stop myself to swap existing
2017-08-26 06:31:22.804972 7fa032121700  1 -- 172.21.15.10:6801/7193 >> 172.21.15.59:6810/1214 conn(0x7fa044a2f800 :-1 s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=0 cs=0 l=0)._try_send send error: (32) Broken pipe
2017-08-26 06:31:24.005324 7fa032922700 10 osd.0 14  new session 0x7fa042853800 con=0x7fa044a95000 addr=172.21.15.59:6810/1214
2017-08-26 06:31:24.005415 7fa032922700  0 -- 172.21.15.10:6801/7193 >> 172.21.15.59:6810/1214 conn(0x7fa044a95000 :6801 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept we reset (peer sent cseq 1), sending RESETSESSION
2017-08-26 06:31:29.006400 7fa032922700  1 -- 172.21.15.10:6801/7193 >> 172.21.15.59:6810/1214 conn(0x7fa044a95000 :6801 s=STATE_OPEN pgs=15 cs=1 l=0).tick idle(5000859) more than 5000000 us, mark self fault.

the result is that the query is lost and the pg is stuck in creating state.

/a/sage-2017-08-25_18:16:29-rados-wip-sage-testing-luminous-20170825a-distro-basic-smithi/1563246


Related issues

Related to Messengers - Bug #22570: out of order caused by letting old msg from down peer be processed to RESETSESSION Resolved 01/05/2018
Related to Messengers - Bug #36612: msg/async: connection stall Pending Backport 10/28/2018

History

#1 Updated by Haomai Wang almost 2 years ago

https://github.com/ceph/ceph/pull/16009

this pr gives a brief about reason. it's really rare, so I don't do it immediately

#2 Updated by Haomai Wang almost 2 years ago

  • Assignee set to Haomai Wang

#3 Updated by Yuri Weinstein almost 2 years ago

Haomai Wang wrote:

https://github.com/ceph/ceph/pull/16009

this pr gives a brief about reason. it's really rare, so I don't do it immediately

merged

#4 Updated by Kefu Chai almost 2 years ago

@yuri, this PR is not merged. or i misunderstand your comment here?

#5 Updated by Sage Weil over 1 year ago

  • Related to Bug #22570: out of order caused by letting old msg from down peer be processed to RESETSESSION added

#6 Updated by Sage Weil over 1 year ago

  • Priority changed from Urgent to High

#7 Updated by Sage Weil 10 months ago

2018-09-19 03:32:13.823 7fcf8a93c700  1 -- 172.21.15.155:6805/141848 --> 172.21.15.119:6802/92161 -- pg_query(1.67 epoch 866) v4 -- 0x55d736867c00 con 0
2018-09-19 03:32:13.823 7fcfa89c9700  1 -- 172.21.15.155:6805/141848 >> 172.21.15.119:6802/92161 conn(0x55d7332ed880 legacy :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_bulk peer close file descriptor 48
2018-09-19 03:32:13.823 7fcfa89c9700  1 -- 172.21.15.155:6805/141848 >> 172.21.15.119:6802/92161 conn(0x55d7332ed880 legacy :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_until read failed
2018-09-19 03:32:13.823 7fcfa89c9700  1 -- 172.21.15.155:6805/141848 >> 172.21.15.119:6802/92161 conn(0x55d7332ed880 legacy :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=0).handle_connect_reply_1 read connect reply failed
2018-09-19 03:32:13.824 7fcfa89c9700  0 -- 172.21.15.155:6805/141848 >> 172.21.15.119:6802/92161 conn(0x55d7332ed880 legacy :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=0).fault initiating reconnect
2018-09-19 03:32:13.826 7fcfa89c9700  0 -- 172.21.15.155:6805/141848 >> 172.21.15.119:6802/92161 conn(0x55d7332ed880 legacy :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=1 l=0).handle_connect_reply_2 connect got RESETSESSION

/a/sage-2018-09-19_02:04:04-rados-wip-sage3-testing-2018-09-18-1746-distro-basic-smithi/3039958

leads to stalled peering

#8 Updated by Sage Weil 10 months ago

  • Priority changed from High to Urgent

/a/sage-2018-09-22_02:47:58-rados-master-distro-basic-smithi/3053124

seeing more of this!

#9 Updated by Neha Ojha 10 months ago

  • Backport set to luminous,mimic

#10 Updated by Sage Weil 6 months ago

  • Status changed from Verified to Duplicate

dup of #36612 i think

#11 Updated by Sage Weil 6 months ago

  • Related to Bug #36612: msg/async: connection stall added

Also available in: Atom PDF