Project

General

Profile

Actions

Bug #15443

closed

msg/async: stopped rx msgs, dup replace race with delay injection

Added by Sage Weil about 8 years ago. Updated about 5 years ago.

Status:
Rejected
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

last rx is

2016-04-09 14:26:51.357822 7f054c4fb700 20 osd.4 38 should_share_map osd.0 172.21.15.56:6801/2840 38
2016-04-09 14:26:51.358343 7f056fdf2700  1 -- 172.21.15.13:6805/10365 >> 172.21.15.56:6801/2840 conn(0x7f058b5e3800 sd=71 :6805 s=STATE_OPEN pgs=28 cs=5 l=0). == tx == 0x7f058e20e000 osd_repop_reply(client.4119.0:10916 12.5 ondisk, result = 0) v1
2016-04-09 14:26:52.302707 7f0578603700  1 -- 172.21.15.13:6805/10365 >> 172.21.15.56:6801/2840 conn(0x7f058b5e3800 sd=71 :6805 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=28 cs=5 l=0). == rx == osd.0 seq 357 0x7f058ccc9600 osd_repop(client.4119.0:11326 12.5) v1

and later we see

2016-04-09 14:42:25.942130 7f0578e04700 10 osd.4 51  new session 0x7f058ca330c0 con=0x7f058d8d1800 addr=172.21.15.56:6801/2840
2016-04-09 14:42:25.944237 7f0578e04700  0 -- 172.21.15.13:6805/10365 >> 172.21.15.56:6801/2840 conn(0x7f058d8d1800 sd=59 :6805 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 6 vs existing csq=5 existing_state=STATE_OPEN
2016-04-09 14:42:25.946455 7f0578e04700  1 -- 172.21.15.13:6805/10365 >> 172.21.15.56:6801/2840 conn(0x7f058d8d1800 sd=71 :6805 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg stop myself to swap existing
2016-04-09 14:42:25.946513 7f0578e04700  1 -- 172.21.15.13:6805/10365 >> 172.21.15.56:6801/2840 conn(0x7f058d8d1800 sd=71 :6805 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0)._stop
2016-04-09 14:57:26.050343 7f0578e04700 10 osd.4 51  new session 0x7f058c3d0560 con=0x7f058c5ec000 addr=172.21.15.56:6801/2840
2016-04-09 14:57:26.052449 7f0578e04700  1 -- 172.21.15.13:6805/10365 >> 172.21.15.56:6801/2840 conn(0x7f058c5ec000 sd=143 :6805 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg existing racing replace or mark_down happened while replacing. existing_state=STATE_ACCEPTING_WAIT_CONNECT_MSG
2016-04-09 14:57:26.056813 7f0578e04700  1 -- 172.21.15.13:6805/10365 >> 172.21.15.56:6801/2840 conn(0x7f058c5ec000 sd=143 :6805 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg existing racing replace or mark_down happened while replacing. existing_state=STATE_ACCEPTING_WAIT_CONNECT_MSG
2016-04-09 14:57:26.061247 7f0578e04700  1 -- 172.21.15.13:6805/10365 >> 172.21.15.56:6801/2840 conn(0x7f058c5ec000 sd=143 :6805 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg existing racing replace or mark_down happened while replacing. existing_state=STATE_ACCEPTING_WAIT_CONNECT_MSG
2016-04-09 14:57:26.065688 7f0578e04700  1 -- 172.21.15.13:6805/10365 >> 172.21.15.56:6801/2840 conn(0x7f058c5ec000 sd=143 :6805 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg existing racing replace or mark_down happened while replacing. existing_state=STATE_ACCEPTING_WAIT_CONNECT_MSG
2016-04-09 14:57:26.070311 7f0578e04700  1 -- 172.21.15.13:6805/10365 >> 172.21.15.56:6801/2840 conn(0x7f058c5ec000 sd=143 :6805 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg existing racing replace or mark_down happened while replacing. existing_state=STATE_ACCEPTING_WAIT_CONNECT_MSG
2016-04-09 14:57:26.074918 7f0578e04700  1 -- 172.21.15.13:6805/10365 >> 172.21.15.56:6801/2840 conn(0x7f058c5ec000 sd=143 :6805 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg existing racing replace or mark_down happened while replacing. existing_state=STATE_ACCEPTING_WAIT_CONNECT_MSG
2016-04-09 14:57:26.079469 7f0578e04700  1 -- 172.21.15.13:6805/10365 >> 172.21.15.56:6801/2840 conn(0x7f058c5ec000 sd=143 :6805 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg existing racing replace or mark_down happened while replacing. existing_state=STATE_ACCEPTING_WAIT_CONNECT_MSG

/a/sage-2016-04-09_05:29:12-rados-wip-sage-testing-distro-basic-smithi/117990

this is with https://github.com/ceph/ceph/pull/8471

Actions #1

Updated by Haomai Wang about 8 years ago

from the limited logs, it's a simple msgr connect to async msgr. While the first accept, async msgr reply to peer to retry. But no more messages comes. From the later connecting requests, I'm not sure the actual reason....

A potential fix is create timeout event for replacing process which may not my want. I suggest we improve debug ms to 10/10 to see...

Actions #2

Updated by Haomai Wang about 8 years ago

http://pulpito.ceph.com/haomai-2016-04-11_05:13:23-rados-wip-haomai-testing---basic-multi
http://pulpito.ceph.com/haomai-2016-04-11_05:35:09-rados-wip-haomai-testing---basic-multi

After merging https://github.com/ceph/ceph/pull/8520, I schedule two jobs to verify. Although I know this failed case may be not related to the specified job and it's not easy to reproduce.

Actions #3

Updated by Haomai Wang about 8 years ago

  • Category set to msgr
  • Status changed from New to Need More Info
  • Assignee set to Haomai Wang
  • Priority changed from Urgent to High

I think we can left this, and verify another bug(delay entry dead lock). Then we can see reproduce job if appearing

Actions #4

Updated by Haomai Wang almost 8 years ago

  • Status changed from Need More Info to Rejected
Actions #5

Updated by Greg Farnum about 5 years ago

  • Project changed from Ceph to Messengers
  • Category deleted (msgr)
Actions

Also available in: Atom PDF