- we are not detecting resets
- down osd is talking to a surviving osd, repeatedly marked down
- at some point, we get a socket reset with different timing adn teh acceptor (surviving osd) decides to reconnect...
-59> 2013-07-14 16:55:04.020976 7fb2484dd700 10 -- 10.214.132.39:6809/30786 >> :/0 pipe(0x3757280 sd=116 :0 s=0 pgs=0 cs=0 l=0 c=0x37a3dc0).accept
-58> 2013-07-14 16:55:04.021031 7fb2484dd700 1 -- 10.214.132.39:6809/30786 >> :/0 pipe(0x3757280 sd=116 :6809 s=0 pgs=0 cs=0 l=0 c=0x37a3dc0).accept sd=116 10.214.132.39:57539/0
-57> 2013-07-14 16:55:04.021104 7fb2484dd700 10 -- 10.214.132.39:6809/30786 >> 10.214.132.39:6801/30785 pipe(0x3757280 sd=116 :6809 s=0 pgs=0 cs=0 l=0 c=0x37a3dc0).accept peer addr is 10.214.132.39:6801/30785
-56> 2013-07-14 16:55:04.021177 7fb2484dd700 20 -- 10.214.132.39:6809/30786 >> 10.214.132.39:6801/30785 pipe(0x3757280 sd=116 :6809 s=0 pgs=0 cs=0 l=0 c=0x37a3dc0).accept got peer connect_seq 3876 global_seq 3242
-55> 2013-07-14 16:55:04.021193 7fb2484dd700 10 -- 10.214.132.39:6809/30786 >> 10.214.132.39:6801/30785 pipe(0x3757280 sd=116 :6809 s=0 pgs=0 cs=0 l=0 c=0x37a3dc0).accept of host_type 4, policy.lossy=0
-54> 2013-07-14 16:55:04.021201 7fb2484dd700 10 -- 10.214.132.39:6809/30786 >> 10.214.132.39:6801/30785 pipe(0x3757280 sd=116 :6809 s=0 pgs=0 cs=0 l=0 c=0x37a3dc0).accept my proto 10, their proto 10
-53> 2013-07-14 16:55:04.021274 7fb2484dd700 10 -- 10.214.132.39:6809/30786 >> 10.214.132.39:6801/30785 pipe(0x3757280 sd=116 :6809 s=0 pgs=0 cs=0 l=0 c=0x37a3dc0).accept: setting up session_security.
-52> 2013-07-14 16:55:04.021290 7fb2484dd700 10 -- 10.214.132.39:6809/30786 >> 10.214.132.39:6801/30785 pipe(0x3757280 sd=116 :6809 s=0 pgs=0 cs=0 l=0 c=0x37a3dc0).accept new session
-51> 2013-07-14 16:55:04.021298 7fb2484dd700 10 -- 10.214.132.39:6809/30786 >> 10.214.132.39:6801/30785 pipe(0x3757280 sd=116 :6809 s=2 pgs=3242 cs=3877 l=0 c=0x37a3dc0).accept success, connect_seq = 3877, sending READY
-50> 2013-07-14 16:55:04.021306 7fb2484dd700 10 -- 10.214.132.39:6809/30786 >> 10.214.132.39:6801/30785 pipe(0x3757280 sd=116 :6809 s=2 pgs=3242 cs=3877 l=0 c=0x37a3dc0).accept features 34359738367
-49> 2013-07-14 16:55:04.021319 7fb2484dd700 10 -- 10.214.132.39:6809/30786 >> 10.214.132.39:6801/30785 pipe(0x3757280 sd=116 :6809 s=2 pgs=3242 cs=3877 l=0 c=0x37a3dc0).register_pipe
-48> 2013-07-14 16:55:04.021366 7fb2484dd700 10 -- 10.214.132.39:6809/30786 >> 10.214.132.39:6801/30785 pipe(0x3757280 sd=116 :6809 s=2 pgs=3242 cs=3877 l=0 c=0x37a3dc0).discard_requeued_up_to 0
-47> 2013-07-14 16:55:04.021389 7fb2484dd700 10 -- 10.214.132.39:6809/30786 >> 10.214.132.39:6801/30785 pipe(0x3757280 sd=116 :6809 s=2 pgs=3242 cs=3877 l=0 c=0x37a3dc0).accept starting writer, state open
-46> 2013-07-14 16:55:04.021427 7fb2484dd700 20 -- 10.214.132.39:6809/30786 >> 10.214.132.39:6801/30785 pipe(0x3757280 sd=116 :6809 s=2 pgs=3242 cs=3877 l=0 c=0x37a3dc0).accept done
-45> 2013-07-14 16:55:04.021458 7fb2484dd700 20 -- 10.214.132.39:6809/30786 >> 10.214.132.39:6801/30785 pipe(0x3757280 sd=116 :6809 s=2 pgs=3242 cs=3877 l=0 c=0x37a3dc0).reader reading tag...
-44> 2013-07-14 16:55:04.021475 7fb2484dd700 2 -- 10.214.132.39:6809/30786 >> 10.214.132.39:6801/30785 pipe(0x3757280 sd=116 :6809 s=2 pgs=3242 cs=3877 l=0 c=0x37a3dc0).reader couldn't read tag, Success
-43> 2013-07-14 16:55:04.021491 7fb2484dd700 2 -- 10.214.132.39:6809/30786 >> 10.214.132.39:6801/30785 pipe(0x3757280 sd=116 :6809 s=2 pgs=3242 cs=3877 l=0 c=0x37a3dc0).fault 0: Success
-42> 2013-07-14 16:55:04.021535 7fb2484dd700 0 -- 10.214.132.39:6809/30786 >> 10.214.132.39:6801/30785 pipe(0x3757280 sd=116 :6809 s=2 pgs=3242 cs=3877 l=0 c=0x37a3dc0).fault, initiating reconnect
despite this code in fault():
if (policy.standby && !is_queued()) {
ldout(msgr->cct,0) << "fault with nothing to send, going to standby" << dendl;
state = STATE_STANDBY;
return;
}
but i can't figure out why. queue would be empty, keepalive=false, and policy.standby should be true.
adding more debug output and reproducing again. i got 2 hits in ~50 attempts in sage-bug-5626-a. this was ubuntu@teuthology:/a/sage-bug-5626-a/66942