Project

General

Profile

Actions

Bug #37710

closed

msg/simple: writer stall

Added by Sage Weil over 5 years ago. Updated over 4 years ago.

Status:
Won't Fix
Priority:
High
Assignee:
-
Category:
SimpleMessenger
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

on one end,


2018-12-19 06:16:02.368 7f1253ae5700 10 -- 172.21.15.73:6813/13511 >> 172.21.15.65:6814/12885 pipe(0x103c4800 sd=101 :6813 s=2 pgs=9 cs=5 l=0 c=0xf6b2700).writer: state = open policy.server=0
2018-12-19 06:16:02.368 7f1253ae5700 10 -- 172.21.15.73:6813/13511 >> 172.21.15.65:6814/12885 pipe(0x103c4800 sd=101 :6813 s=2 pgs=9 cs=5 l=0 c=0xf6b2700).write_ack 395
2018-12-19 06:16:02.368 7f1253ae5700  1 -- 172.21.15.73:6813/13511 >> 172.21.15.65:6814/12885 pipe(0x103c4800 sd=101 :6813 s=2 pgs=9 cs=5 l=0 c=0xf6b2700).do_sendmsg error (32) Broken pipe
2018-12-19 06:16:02.368 7f1253ae5700  2 -- 172.21.15.73:6813/13511 >> 172.21.15.65:6814/12885 pipe(0x103c4800 sd=101 :6813 s=2 pgs=9 cs=5 l=0 c=0xf6b2700).writer couldn't write ack, (32) Broken pipe
2018-12-19 06:16:02.368 7f1253ae5700  2 -- 172.21.15.73:6813/13511 >> 172.21.15.65:6814/12885 pipe(0x103c4800 sd=101 :6813 s=2 pgs=9 cs=5 l=0 c=0xf6b2700).fault (32) Broken pipe
2018-12-19 06:16:02.368 7f1253ae5700 10 -- 172.21.15.73:6813/13511 >> 172.21.15.65:6814/12885 pipe(0x103c4800 sd=101 :6813 s=2 pgs=9 cs=5 l=0 c=0xf6b2700).requeue_sent PGlog(1.57 log log((0'0,0'0], crt=0'0) pi ([0,0] intervals=) e192/191) v5 for resend seq 400 (400)
2018-12-19 06:16:02.368 7f1253ae5700 10 -- 172.21.15.73:6813/13511 >> 172.21.15.65:6814/12885 pipe(0x103c4800 sd=101 :6813 s=2 pgs=9 cs=5 l=0 c=0xf6b2700).requeue_sent PGlog(1.21 log log((0'0,0'0], crt=0'0) pi ([0,0] intervals=) e192/191) v5 for resend seq 399 (399)
2018-12-19 06:16:02.368 7f1253ae5700 10 -- 172.21.15.73:6813/13511 >> 172.21.15.65:6814/12885 pipe(0x103c4800 sd=101 :6813 s=2 pgs=9 cs=5 l=0 c=0xf6b2700).requeue_sent PGlog(1.7 log log((0'0,0'0], crt=0'0) pi ([0,0] intervals=) e192/191) v5 for resend seq 398 (398)
2018-12-19 06:16:02.368 7f1253ae5700 10 -- 172.21.15.73:6813/13511 >> 172.21.15.65:6814/12885 pipe(0x103c4800 sd=101 :6813 s=2 pgs=9 cs=5 l=0 c=0xf6b2700).requeue_sent PGlog(1.4 log log((0'0,0'0], crt=0'0) pi ([0,0] intervals=) e192/191) v5 for resend seq 397 (397)
2018-12-19 06:16:02.368 7f1253ae5700 10 -- 172.21.15.73:6813/13511 >> 172.21.15.65:6814/12885 pipe(0x103c4800 sd=101 :6813 s=2 pgs=9 cs=5 l=0 c=0xf6b2700).requeue_sent PGlog(1.4a log log((0'0,0'0], crt=0'0) pi ([0,0] intervals=) e192/191) v5 for resend seq 396 (396)
2018-12-19 06:16:02.368 7f1253ae5700  0 -- 172.21.15.73:6813/13511 >> 172.21.15.65:6814/12885 pipe(0x103c4800 sd=101 :6813 s=2 pgs=9 cs=5 l=0 c=0xf6b2700).fault, initiating reconnect
...
2018-12-19 06:16:02.372 7f1253ae5700  3 -- 172.21.15.73:6813/13511 >> 172.21.15.65:6814/12885 pipe(0x103c4800 sd=101 :53422 s=1 pgs=9 cs=6 l=0 c=0xf6b2700).connect got WAIT (connection race)

on the other end (osd.7),

2018-12-19 06:16:02.360 7f3d789e4700 10 -- 172.21.15.65:6814/12885 connect_to 172.21.15.73:6813/13511 existing 0xbdea480
2018-12-19 06:16:02.360 7f3d789e4700  1 -- 172.21.15.65:6814/12885 --> 172.21.15.73:6813/13511 -- pg_info((query:192 sent:192 1.21( empty local-lis/les=191/192 n=0 ec=45/15 lis/c 191/189 les/c/f 192/190/0 191/191/191))=([0,0] intervals=) epoch 192) v5 -- 0x13793840 con 0
2018-12-19 06:16:02.360 7f3d789e4700 20 -- 172.21.15.65:6814/12885 >> 172.21.15.73:6813/13511 conn(0xbdea480 legacy :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=24 cs=6 l=0).prepare_send_message m pg_info((query:192 sent:192 1.21( empty local-lis/les=191/192 n=0 ec=45/15 lis/c 191/189 les/c/f 192/190/0 191/191/191))=([0,0
] intervals=) epoch 192) v5
2018-12-19 06:16:02.360 7f3d789e4700 20 -- 172.21.15.65:6814/12885 >> 172.21.15.73:6813/13511 conn(0xbdea480 legacy :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=24 cs=6 l=0).prepare_send_message encoding features 4611087854031142911 0x13793840 pg_info((query:192 sent:192 1.21( empty local-lis/les=191/192 n=0 ec=45/15 lis/
c 191/189 les/c/f 192/190/0 191/191/191))=([0,0] intervals=) epoch 192) v5
2018-12-19 06:16:02.360 7f3d789e4700  5 -- 172.21.15.65:6814/12885 >> 172.21.15.73:6813/13511 conn(0xbdea480 legacy :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=24 cs=6 l=0).send_message clear encoded buffer previous 4611087854031142911 != 4611087854031142911
2018-12-19 06:16:02.360 7f3d789e4700 15 -- 172.21.15.65:6814/12885 >> 172.21.15.73:6813/13511 conn(0xbdea480 legacy :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=24 cs=6 l=0).send_message inline write is denied, reschedule m=0x13793840
2018-12-19 06:16:02.360 7f3d97e8f700  4 -- 172.21.15.65:6814/12885 >> 172.21.15.73:6813/13511 conn(0xbdea480 legacy :-1 s=STATE_CONNECTION_ESTABLISHED l=0).handle_write
2018-12-19 06:16:02.360 7f3d97e8f700 10 -- 172.21.15.65:6814/12885 >> 172.21.15.73:6813/13511 conn(0xbdea480 legacy :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=24 cs=6 l=0).write_event
2018-12-19 06:16:02.360 7f3d97e8f700 10 -- 172.21.15.65:6814/12885 >> 172.21.15.73:6813/13511 conn(0xbdea480 legacy :-1 s=STATE_CONNECTION_ESTABLISHED l=0)._try_send sent bytes 0 remaining bytes 0

then we see the accept and we send wait:
2018-12-19 06:16:02.368 7f3d97e8f700 10 -- 172.21.15.65:6814/12885 >> - conn(0x12421600 legacy :6814 s=ACCEPTING pgs=0 cs=0 l=0).handle_client_banner accept peer addr is 172.21.15.73:6813/13511
...
2018-12-19 06:16:02.368 7f3d97e8f700 10 -- 172.21.15.65:6814/12885 >> 172.21.15.73:6813/13511 conn(0x12421600 legacy :6814 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept connection race, existing 0xbdea480.cseq 6 == 6, sending WAIT

but the original connection doesn't work...
2018-12-19 06:16:03.000 7f3d6eabc700 10 -- 172.21.15.65:6814/12885 connect_to 172.21.15.73:6813/13511 existing 0xbdea480
2018-12-19 06:16:03.000 7f3d6f6c8700 10 -- 172.21.15.65:6814/12885 connect_to 172.21.15.73:6813/13511 existing 0xbdea480
2018-12-19 06:16:04.100 7f3d719d6700 10 -- 172.21.15.65:6814/12885 connect_to 172.21.15.73:6813/13511 existing 0xbdea480
2018-12-19 06:16:04.100 7f3d719d6700 10 osd.7 193 send_incremental_map 192 -> 193 to 0xbdea480 172.21.15.73:6813/13511
2018-12-19 06:16:04.100 7f3d719d6700  1 -- 172.21.15.65:6814/12885 --> 172.21.15.73:6813/13511 -- osd_map(193..193 src has 1..193) v4 -- 0x12ffc000 con 0
2018-12-19 06:16:04.100 7f3d719d6700 15 -- 172.21.15.65:6814/12885 >> 172.21.15.73:6813/13511 conn(0xbdea480 legacy :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=24 cs=6 l=0).send_message inline write is denied, reschedule m=0x12ffc000
2018-12-19 06:16:04.100 7f3d719d6700  1 -- 172.21.15.65:6814/12885 --> 172.21.15.73:6813/13511 -- pg_notify((query:193 sent:193 1.57( empty local-lis/les=162/163 n=0 ec=74/15 lis/c 162/162 les/c/f 163/163/0 191/191/191))=([162,190] intervals=([162,190] acting 3,5,7)) epoch 193) v6 -- 0x172fb0e0 con 0
2018-12-19 06:16:04.100 7f3d97e8f700  4 -- 172.21.15.65:6814/12885 >> 172.21.15.73:6813/13511 conn(0xbdea480 legacy :-1 s=STATE_CONNECTION_ESTABLISHED l=0).handle_write
2018-12-19 06:16:04.100 7f3d719d6700 20 -- 172.21.15.65:6814/12885 >> 172.21.15.73:6813/13511 conn(0xbdea480 legacy :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=24 cs=6 l=0).prepare_send_message m pg_notify((query:193 sent:193 1.57( empty local-lis/les=162/163 n=0 ec=74/15 lis/c 162/162 les/c/f 163/163/0 191/191/191))=([162,190] intervals=([162,190] acting 3,5,7)) epoch 193) v6
2018-12-19 06:16:04.100 7f3d97e8f700 10 -- 172.21.15.65:6814/12885 >> 172.21.15.73:6813/13511 conn(0xbdea480 legacy :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=24 cs=6 l=0).write_event
2018-12-19 06:16:04.100 7f3d719d6700 20 -- 172.21.15.65:6814/12885 >> 172.21.15.73:6813/13511 conn(0xbdea480 legacy :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=24 cs=6 l=0).prepare_send_message encoding features 4611087854031142911 0x172fb0e0 pg_notify((query:193 sent:193 1.57( empty local-lis/les=162/163 n=0 ec=74/15 lis/c 162/162 les/c/f 163/163/0 191/191/191))=([162,190] intervals=([162,190] acting 3,5,7)) epoch 193) v6
2018-12-19 06:16:04.100 7f3d97e8f700 10 -- 172.21.15.65:6814/12885 >> 172.21.15.73:6813/13511 conn(0xbdea480 legacy :-1 s=STATE_CONNECTION_ESTABLISHED l=0)._try_send sent bytes 0 remaining bytes 0
2018-12-19 06:16:04.100 7f3d719d6700  5 -- 172.21.15.65:6814/12885 >> 172.21.15.73:6813/13511 conn(0xbdea480 legacy :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=24 cs=6 l=0).send_message clear encoded buffer previous 4611087854031142911 != 4611087854031142911
2018-12-19 06:16:04.100 7f3d719d6700 15 -- 172.21.15.65:6814/12885 >> 172.21.15.73:6813/13511 conn(0xbdea480 legacy :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=24 cs=6 l=0).send_message inline write is denied, reschedule m=0x172fb0e0
2018-12-19 06:16:04.100 7f3d97e8f700  4 -- 172.21.15.65:6814/12885 >> 172.21.15.73:6813/13511 conn(0xbdea480 legacy :-1 s=STATE_CONNECTION_ESTABLISHED l=0).handle_write
2018-12-19 06:16:04.100 7f3d97e8f700 10 -- 172.21.15.65:6814/12885 >> 172.21.15.73:6813/13511 conn(0xbdea480 legacy :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=24 cs=6 l=0).write_event
2018-12-19 06:16:04.100 7f3d97e8f700 10 -- 172.21.15.65:6814/12885 >> 172.21.15.73:6813/13511 conn(0xbdea480 legacy :-1 s=STATE_CONNECTION_ESTABLISHED l=0)._try_send sent bytes 0 remaining bytes 0

which repeats for each message

/a/sage-2018-12-19_03:42:25-rados-wip-sage2-testing-2018-12-18-1335-distro-basic-smithi/3376345

Actions #1

Updated by Sage Weil over 5 years ago

this looks like some sort of deadlock. the thread on osd.3 processing the reconnection:

2018-12-19 06:16:02.364 7f12529d4700 10 -- 172.21.15.73:6813/13511 >> 172.21.15.65:6814/12885 pipe(0xf502000 sd=99 :6813 s=0 pgs=0 cs=0 l=0 c=0x11d9ac40).accept peer addr is 172.21.15.65:6814/12885
2018-12-19 06:16:02.364 7f12529d4700 20 -- 172.21.15.73:6813/13511 >> 172.21.15.65:6814/12885 pipe(0xf502000 sd=99 :6813 s=0 pgs=0 cs=0 l=0 c=0x11d9ac40).accept got peer connect_seq 6 global_seq 9
2018-12-19 06:16:02.364 7f12529d4700 10 -- 172.21.15.73:6813/13511 >> 172.21.15.65:6814/12885 pipe(0xf502000 sd=99 :6813 s=0 pgs=0 cs=0 l=0 c=0x11d9ac40).accept of host_type 4, policy.lossy=0 policy.server=0 policy.standby=1 policy.resetcheck=0
2018-12-19 06:16:02.364 7f12529d4700 10 -- 172.21.15.73:6813/13511 >> 172.21.15.65:6814/12885 pipe(0xf502000 sd=99 :6813 s=0 pgs=0 cs=0 l=0 c=0x11d9ac40).accept my proto 10, their proto 10
2018-12-19 06:16:02.364 7f12529d4700 10 -- 172.21.15.73:6813/13511 >> 172.21.15.65:6814/12885 pipe(0xf502000 sd=99 :6813 s=0 pgs=0 cs=0 l=0 c=0x11d9ac40).accept: challenging authorizer 32 bytes
2018-12-19 06:16:02.364 7f12529d4700 20 -- 172.21.15.73:6813/13511 >> 172.21.15.65:6814/12885 pipe(0xf502000 sd=99 :6813 s=0 pgs=0 cs=0 l=0 c=0x11d9ac40).accept got peer connect_seq 6 global_seq 9
2018-12-19 06:16:02.364 7f12529d4700 10 -- 172.21.15.73:6813/13511 >> 172.21.15.65:6814/12885 pipe(0xf502000 sd=99 :6813 s=0 pgs=0 cs=0 l=0 c=0x11d9ac40).accept of host_type 4, policy.lossy=0 policy.server=0 policy.standby=1 policy.resetcheck=0
2018-12-19 06:16:02.364 7f12529d4700 10 -- 172.21.15.73:6813/13511 >> 172.21.15.65:6814/12885 pipe(0xf502000 sd=99 :6813 s=0 pgs=0 cs=0 l=0 c=0x11d9ac40).accept my proto 10, their proto 10
2018-12-19 06:16:02.364 7f12529d4700 10 osd.3 192 ms_handle_authentication new session 0x14851400 con 0x11d9ac40 entity osd.7 addr 172.21.15.65:6814/12885
2018-12-19 06:16:02.364 7f12529d4700 10 osd.3 192 ms_handle_authentication session 0x14851400 osd.7 has caps osdcap[grant(*)] 'allow *'
2018-12-19 06:16:02.364 7f12529d4700 10 -- 172.21.15.73:6813/13511 >> 172.21.15.65:6814/12885 pipe(0xf502000 sd=99 :6813 s=0 pgs=0 cs=0 l=0 c=0x11d9ac40).accept:  setting up session_security.

and then nothing from the thread or connection id again. we should normally see something like
2018-12-19 06:11:15.124 7f12535e0700 10 -- 172.21.15.73:6813/13511 >> 172.21.15.65:6806/12883 pipe(0xf62a000 sd=99 :6813 s=0 pgs=0 cs=0 l=0 c=0xf6b5d40).accept existing 0xfbea800.gseq 24 <= 32, looks ok
but here we don't see it

possibilities include

msgr->lock.Lock();
pipe_lock.Lock();
existing->pipe_lock.Lock(true);
Actions #2

Updated by Sage Weil about 5 years ago

  • Subject changed from msg/async: writer stall to msg/simple: writer stall
Actions #3

Updated by Sage Weil about 5 years ago

  • Priority changed from Urgent to High
Actions #4

Updated by Greg Farnum about 5 years ago

  • Project changed from RADOS to Messengers
Actions #5

Updated by Greg Farnum about 5 years ago

  • Category set to SimpleMessenger
Actions #6

Updated by Sage Weil over 4 years ago

  • Status changed from 12 to Won't Fix

simplemessenger removed from code base

Actions

Also available in: Atom PDF