Bug #5626
msg/Pipe.cc: 1012: FAILED assert(m)
% Done:
0%
Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
this is in the reconnect resent seq code that i just changed.
job was
ubuntu@teuthology:/a/teuthology-2013-07-14_01:00:14-rados-next-testing-basic/66668$ cat orig.config.yaml kernel: kdb: true sha1: 365b57b1317524bb0cdd15859a224ba1ab58d1d7 machine_type: plana nuke-on-error: true overrides: admin_socket: branch: next ceph: conf: global: ms inject socket failures: 500 mon: debug mon: 20 debug ms: 20 debug paxos: 20 osd: osd op thread timeout: 60 fs: btrfs log-whitelist: - slow request sha1: df45b167cfe262c46367e812c79e65698804ef5d install: ceph: sha1: df45b167cfe262c46367e812c79e65698804ef5d s3tests: branch: next workunit: sha1: df45b167cfe262c46367e812c79e65698804ef5d roles: - - mon.0 - mon.1 - mon.2 - mds.a - osd.0 - osd.1 - osd.2 - client.0 tasks: - chef: null - clock.check: null - install: null - ceph: log-whitelist: - wrongly marked me down - had wrong client addr - workunit: clients: all: - cephtool - mon/pool_ops.sh
History
#1 Updated by Sage Weil over 10 years ago
- we are not detecting resets
- down osd is talking to a surviving osd, repeatedly marked down
- at some point, we get a socket reset with different timing adn teh acceptor (surviving osd) decides to reconnect...
-59> 2013-07-14 16:55:04.020976 7fb2484dd700 10 -- 10.214.132.39:6809/30786 >> :/0 pipe(0x3757280 sd=116 :0 s=0 pgs=0 cs=0 l=0 c=0x37a3dc0).accept -58> 2013-07-14 16:55:04.021031 7fb2484dd700 1 -- 10.214.132.39:6809/30786 >> :/0 pipe(0x3757280 sd=116 :6809 s=0 pgs=0 cs=0 l=0 c=0x37a3dc0).accept sd=116 10.214.132.39:57539/0 -57> 2013-07-14 16:55:04.021104 7fb2484dd700 10 -- 10.214.132.39:6809/30786 >> 10.214.132.39:6801/30785 pipe(0x3757280 sd=116 :6809 s=0 pgs=0 cs=0 l=0 c=0x37a3dc0).accept peer addr is 10.214.132.39:6801/30785 -56> 2013-07-14 16:55:04.021177 7fb2484dd700 20 -- 10.214.132.39:6809/30786 >> 10.214.132.39:6801/30785 pipe(0x3757280 sd=116 :6809 s=0 pgs=0 cs=0 l=0 c=0x37a3dc0).accept got peer connect_seq 3876 global_seq 3242 -55> 2013-07-14 16:55:04.021193 7fb2484dd700 10 -- 10.214.132.39:6809/30786 >> 10.214.132.39:6801/30785 pipe(0x3757280 sd=116 :6809 s=0 pgs=0 cs=0 l=0 c=0x37a3dc0).accept of host_type 4, policy.lossy=0 -54> 2013-07-14 16:55:04.021201 7fb2484dd700 10 -- 10.214.132.39:6809/30786 >> 10.214.132.39:6801/30785 pipe(0x3757280 sd=116 :6809 s=0 pgs=0 cs=0 l=0 c=0x37a3dc0).accept my proto 10, their proto 10 -53> 2013-07-14 16:55:04.021274 7fb2484dd700 10 -- 10.214.132.39:6809/30786 >> 10.214.132.39:6801/30785 pipe(0x3757280 sd=116 :6809 s=0 pgs=0 cs=0 l=0 c=0x37a3dc0).accept: setting up session_security. -52> 2013-07-14 16:55:04.021290 7fb2484dd700 10 -- 10.214.132.39:6809/30786 >> 10.214.132.39:6801/30785 pipe(0x3757280 sd=116 :6809 s=0 pgs=0 cs=0 l=0 c=0x37a3dc0).accept new session -51> 2013-07-14 16:55:04.021298 7fb2484dd700 10 -- 10.214.132.39:6809/30786 >> 10.214.132.39:6801/30785 pipe(0x3757280 sd=116 :6809 s=2 pgs=3242 cs=3877 l=0 c=0x37a3dc0).accept success, connect_seq = 3877, sending READY -50> 2013-07-14 16:55:04.021306 7fb2484dd700 10 -- 10.214.132.39:6809/30786 >> 10.214.132.39:6801/30785 pipe(0x3757280 sd=116 :6809 s=2 pgs=3242 cs=3877 l=0 c=0x37a3dc0).accept features 34359738367 -49> 2013-07-14 16:55:04.021319 7fb2484dd700 10 -- 10.214.132.39:6809/30786 >> 10.214.132.39:6801/30785 pipe(0x3757280 sd=116 :6809 s=2 pgs=3242 cs=3877 l=0 c=0x37a3dc0).register_pipe -48> 2013-07-14 16:55:04.021366 7fb2484dd700 10 -- 10.214.132.39:6809/30786 >> 10.214.132.39:6801/30785 pipe(0x3757280 sd=116 :6809 s=2 pgs=3242 cs=3877 l=0 c=0x37a3dc0).discard_requeued_up_to 0 -47> 2013-07-14 16:55:04.021389 7fb2484dd700 10 -- 10.214.132.39:6809/30786 >> 10.214.132.39:6801/30785 pipe(0x3757280 sd=116 :6809 s=2 pgs=3242 cs=3877 l=0 c=0x37a3dc0).accept starting writer, state open -46> 2013-07-14 16:55:04.021427 7fb2484dd700 20 -- 10.214.132.39:6809/30786 >> 10.214.132.39:6801/30785 pipe(0x3757280 sd=116 :6809 s=2 pgs=3242 cs=3877 l=0 c=0x37a3dc0).accept done -45> 2013-07-14 16:55:04.021458 7fb2484dd700 20 -- 10.214.132.39:6809/30786 >> 10.214.132.39:6801/30785 pipe(0x3757280 sd=116 :6809 s=2 pgs=3242 cs=3877 l=0 c=0x37a3dc0).reader reading tag... -44> 2013-07-14 16:55:04.021475 7fb2484dd700 2 -- 10.214.132.39:6809/30786 >> 10.214.132.39:6801/30785 pipe(0x3757280 sd=116 :6809 s=2 pgs=3242 cs=3877 l=0 c=0x37a3dc0).reader couldn't read tag, Success -43> 2013-07-14 16:55:04.021491 7fb2484dd700 2 -- 10.214.132.39:6809/30786 >> 10.214.132.39:6801/30785 pipe(0x3757280 sd=116 :6809 s=2 pgs=3242 cs=3877 l=0 c=0x37a3dc0).fault 0: Success -42> 2013-07-14 16:55:04.021535 7fb2484dd700 0 -- 10.214.132.39:6809/30786 >> 10.214.132.39:6801/30785 pipe(0x3757280 sd=116 :6809 s=2 pgs=3242 cs=3877 l=0 c=0x37a3dc0).fault, initiating reconnect
despite this code in fault():
if (policy.standby && !is_queued()) { ldout(msgr->cct,0) << "fault with nothing to send, going to standby" << dendl; state = STATE_STANDBY; return; }
but i can't figure out why. queue would be empty, keepalive=false, and policy.standby should be true.
adding more debug output and reproducing again. i got 2 hits in ~50 attempts in sage-bug-5626-a. this was ubuntu@teuthology:/a/sage-bug-5626-a/66942
#2 Updated by Sage Weil over 10 years ago
- Status changed from 12 to In Progress
#3 Updated by Sage Weil over 10 years ago
- Status changed from In Progress to Fix Under Review
#4 Updated by Sage Weil over 10 years ago
- Status changed from Fix Under Review to Resolved