Actions
Bug #4824
closedmsgr: crash in submit_message
% Done:
0%
Source:
Development
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
2013-04-25 13:37:56.821109 7f38faffd700 1 -- 10.214.131.27:6789/0 dq <== mon.0 10.214.131.26:6789/0 29217 ==== route(log(last 47) v1 tid 18608) v2 ==== 256+0+0 (2364127403 0 0) 0x7f38ec032f90 con 0x263dac0 2013-04-25 13:37:56.821124 7f38faffd700 20 mon.a@1(peon) e1 have connection 2013-04-25 13:37:56.821129 7f38faffd700 20 mon.a@1(peon) e1 ms_dispatch existing session MonSession: mon.0 10.214.131.26:6789/0 is openallow * for mon.0 10.214.131.26:6789/0 2013-04-25 13:37:56.821137 7f38faffd700 20 mon.a@1(peon) e1 caps allow * 2013-04-25 13:37:56.821139 7f38faffd700 10 mon.a@1(peon) e1 handle_route log(last 47) v1 to unknown.0 :/0 2013-04-25 13:37:56.821145 7f38faffd700 1 -- 10.214.131.27:6789/0 --> 10.214.131.27:6790/0 -- log(last 47) v1 -- ?+0 0x7f38ec0331d0 con 0x7f38e62de4d0 2013-04-25 13:37:56.821153 7f38faffd700 20 -- 10.214.131.27:6789/0 submit_message log(last 47) v1 remote, 10.214.131.27:6790/0, new pipe. 2013-04-25 13:37:56.821158 7f38faffd700 10 -- 10.214.131.27:6789/0 connect_rank to 10.214.131.27:6790/0, creating pipe and registering 2013-04-25 13:37:56.821221 7f38faffd700 10 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e51794c0 sd=-1 :0 s=1 pgs=0 cs=0 l=0).register_pipe 2013-04-25 13:37:56.821287 7f38f8bef700 10 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e51794c0 sd=-1 :0 s=1 pgs=0 cs=0 l=0).writer: state = connecting policy.server=0 2013-04-25 13:37:56.821309 7f38f8bef700 10 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e51794c0 sd=-1 :0 s=1 pgs=0 cs=0 l=0).connect 0 2013-04-25 13:37:56.821342 7f38f8bef700 10 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e51794c0 sd=6 :0 s=1 pgs=0 cs=0 l=0).connecting to 10.214.131.27:6790/0 2013-04-25 13:37:56.821737 7f38f8bef700 20 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e51794c0 sd=6 :39107 s=1 pgs=0 cs=0 l=0).connect read peer addr 10.214.131.27:6790/0 on socket 6 2013-04-25 13:37:56.821763 7f38f8bef700 20 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e51794c0 sd=6 :39107 s=1 pgs=0 cs=0 l=0).connect peer addr for me is 10.214.131.27:39107/0 2013-04-25 13:37:56.821817 7f38f8bef700 10 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e51794c0 sd=6 :39107 s=1 pgs=0 cs=0 l=0).connect sent my addr 10.214.131.27:6789/0 2013-04-25 13:37:56.821838 7f38f8bef700 10 mon.a@1(peon) e1 ms_get_authorizer for mon 2013-04-25 13:37:56.821934 7f38f8bef700 10 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e51794c0 sd=6 :39107 s=1 pgs=0 cs=0 l=0).connect.authorizer_len=158 protocol=2 2013-04-25 13:37:56.821956 7f38f8bef700 10 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e51794c0 sd=6 :39107 s=1 pgs=0 cs=0 l=0).connect sending gseq=443 cseq=0 proto=10 2013-04-25 13:37:56.821991 7f38f8bef700 20 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e51794c0 sd=6 :39107 s=1 pgs=0 cs=0 l=0).connect wrote (self +) cseq, waiting for reply 2013-04-25 13:37:56.822308 7f38f9af6700 2 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e80038c0 sd=22 :6789 s=2 pgs=426 cs=5 l=0).reader couldn't read tag, Transport endpoint is not connected 2013-04-25 13:37:56.822347 7f38f9af6700 2 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e80038c0 sd=22 :6789 s=2 pgs=426 cs=5 l=0).fault 107: Transport endpoint is not connected 2013-04-25 13:37:56.822416 7f38f9af6700 0 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e80038c0 sd=22 :6789 s=2 pgs=426 cs=5 l=0).fault with nothing to send, going to standby 2013-04-25 13:37:56.822432 7f38f9af6700 20 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e80038c0 sd=22 :6789 s=3 pgs=426 cs=5 l=0).reader sleeping during reconnect|standby 2013-04-25 13:37:56.822488 7f38f8bef700 20 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e51794c0 sd=6 :39107 s=1 pgs=0 cs=0 l=0).connect got reply tag 13 connect_seq 1 global_seq 438 proto 10 flags 0 2013-04-25 13:37:56.822506 7f38f8bef700 10 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e51794c0 sd=6 :39107 s=1 pgs=0 cs=0 l=0).reply.authorizer_len=36 2013-04-25 13:37:56.822541 7f38f8bef700 10 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e51794c0 sd=6 :39107 s=1 pgs=0 cs=0 l=0).got CEPH_MSGR_TAG_SEQ, reading acked_seq and writing in_seq 2013-04-25 13:37:56.822534 7f38f97f4700 10 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e80038c0 sd=22 :6789 s=3 pgs=426 cs=5 l=0).writer: state = standby policy.server=0 2013-04-25 13:37:56.822557 7f38f97f4700 20 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e80038c0 sd=22 :6789 s=3 pgs=426 cs=5 l=0).writer sleeping 2013-04-25 13:37:56.822573 7f38f8bef700 15 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e51794c0 sd=6 :39107 s=1 pgs=0 cs=0 l=0).reader got ack seq 0 2013-04-25 13:37:56.822604 7f38f8bef700 10 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e51794c0 sd=6 :39107 s=2 pgs=438 cs=1 l=0).connect success 1, lossy = 0, features 4294967295 2013-04-25 13:37:56.822623 7f38f8bef700 20 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e51794c0 sd=6 :39107 s=2 pgs=438 cs=1 l=0).connect starting reader 2013-04-25 13:37:56.822681 7f38f8bef700 10 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e51794c0 sd=6 :39107 s=2 pgs=438 cs=1 l=0).writer: state = open policy.server=0 2013-04-25 13:37:56.822697 7f38f8bef700 20 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e51794c0 sd=6 :39107 s=2 pgs=438 cs=1 l=0).writer encoding 1 0x7f38ec0331d0 log(last 47) v1 2013-04-25 13:37:56.822741 7f38f8bef700 20 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e51794c0 sd=6 :39107 s=2 pgs=438 cs=1 l=0).writer signed seq # 1): sig = 2207261951511203922 2013-04-25 13:37:56.822764 7f38f8bef700 20 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e51794c0 sd=6 :39107 s=2 pgs=438 cs=1 l=0).writer sending 1 0x7f38ec0331d0 2013-04-25 13:37:56.822772 7f38f8cf0700 20 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e51794c0 sd=6 :39107 s=2 pgs=438 cs=1 l=0).reader reading tag... 2013-04-25 13:37:56.822802 7f38f8bef700 10 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e51794c0 sd=6 :39107 s=2 pgs=438 cs=1 l=0).writer: state = open policy.server=0 2013-04-25 13:37:56.822816 7f38f8bef700 20 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e51794c0 sd=6 :39107 s=2 pgs=438 cs=1 l=0).writer sleeping 2013-04-25 13:37:56.823030 7f38faffd700 -1 msg/Pipe.cc: In function 'void Pipe::register_pipe()' thread 7f38faffd700 time 2013-04-25 13:37:56.821234 msg/Pipe.cc: 1086: FAILED assert(existing == __null) ceph version 0.60-657-g488d728 (488d72897493017ae1884cb780846d6386a70fad) 1: (Pipe::register_pipe()+0x2f0) [0x6bea50] 2: (SimpleMessenger::connect_rank(entity_addr_t const&, int, Connection*, Message*)+0x193) [0x639513] 3: (SimpleMessenger::submit_message(Message*, Connection*, entity_addr_t const&, int, bool)+0xcb3) [0x63abd3] 4: (SimpleMessenger::_send_message(Message*, Connection*, bool)+0xaa) [0x63b0ba] 5: (Monitor::handle_route(MRoute*)+0x435) [0x4b87f5] 6: (Monitor::_ms_dispatch(Message*)+0xbbb) [0x4c20db] 7: (Monitor::ms_dispatch(Message*)+0x32) [0x4d8eb2] 8: (DispatchQueue::entry()+0x769) [0x6ae319] 9: (DispatchQueue::DispatchThread::entry()+0xd) [0x63c60d] 10: (()+0x7e9a) [0x7f3903d51e9a] 11: (clone()+0x6d) [0x7f390256eccd] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
this route is to a twice-forwarded message.. wip-mon-fwd fixes that, but why it is crashing is still unclear. doing more debugging. the con should already have a pipe, so it is not clear why it is reopening a new Pipe to the peer mon. there have been no faults on it.
this currently reproduced about 50% of the time on the wip-4948-b branch with teh following job
flab:teuthology 01:27 PM $ cat e.yaml kernel: kdb: true branch: testing nuke-on-error: true overrides: ceph: conf: global: ms inject socket failures: 5000 mon: debug ms: 20 debug mon: 20 debug paxos: 20 fs: ext4 log-whitelist: - slow request branch: wip-4748-b s3tests: branch: next workunit: branch: next roles: - - mon.a - mon.c - osd.0 - osd.1 - osd.2 - - mon.b - mds.a - osd.3 - osd.4 - osd.5 - client.0 tasks: - chef: null - clock.check: null - install: null - ceph: log-whitelist: - wrongly marked me down - objects unfound and apparently lost - thrashosds: chance_pgnum_grow: 1 chance_pgpnum_fix: 1 timeout: 1200 - rados: clients: - client.0 objects: 50 op_weights: delete: 50 read: 100 rollback: 50 snap_create: 50 snap_remove: 50 write: 100 ops: 4000
(previously commit 488d72897493017ae1884cb780846d6386a70fad)
Updated by Sage Weil about 11 years ago
- Status changed from In Progress to Resolved
Actions