Project

General

Profile

Actions

Bug #4824

closed

msgr: crash in submit_message

Added by Sage Weil almost 11 years ago. Updated almost 11 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2013-04-25 13:37:56.821109 7f38faffd700  1 -- 10.214.131.27:6789/0 dq <== mon.0 10.214.131.26:6789/0 29217 ==== route(log(last 47) v1 tid 18608) v2 ==== 256+0+0 (2364127403 0 0) 0x7f38ec032f90 con 0x263dac0
2013-04-25 13:37:56.821124 7f38faffd700 20 mon.a@1(peon) e1 have connection
2013-04-25 13:37:56.821129 7f38faffd700 20 mon.a@1(peon) e1 ms_dispatch existing session MonSession: mon.0 10.214.131.26:6789/0 is openallow * for mon.0 10.214.131.26:6789/0
2013-04-25 13:37:56.821137 7f38faffd700 20 mon.a@1(peon) e1  caps allow *
2013-04-25 13:37:56.821139 7f38faffd700 10 mon.a@1(peon) e1 handle_route log(last 47) v1 to unknown.0 :/0
2013-04-25 13:37:56.821145 7f38faffd700  1 -- 10.214.131.27:6789/0 --> 10.214.131.27:6790/0 -- log(last 47) v1 -- ?+0 0x7f38ec0331d0 con 0x7f38e62de4d0
2013-04-25 13:37:56.821153 7f38faffd700 20 -- 10.214.131.27:6789/0 submit_message log(last 47) v1 remote, 10.214.131.27:6790/0, new pipe.
2013-04-25 13:37:56.821158 7f38faffd700 10 -- 10.214.131.27:6789/0 connect_rank to 10.214.131.27:6790/0, creating pipe and registering
2013-04-25 13:37:56.821221 7f38faffd700 10 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e51794c0 sd=-1 :0 s=1 pgs=0 cs=0 l=0).register_pipe
2013-04-25 13:37:56.821287 7f38f8bef700 10 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e51794c0 sd=-1 :0 s=1 pgs=0 cs=0 l=0).writer: state = connecting policy.server=0
2013-04-25 13:37:56.821309 7f38f8bef700 10 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e51794c0 sd=-1 :0 s=1 pgs=0 cs=0 l=0).connect 0
2013-04-25 13:37:56.821342 7f38f8bef700 10 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e51794c0 sd=6 :0 s=1 pgs=0 cs=0 l=0).connecting to 10.214.131.27:6790/0
2013-04-25 13:37:56.821737 7f38f8bef700 20 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e51794c0 sd=6 :39107 s=1 pgs=0 cs=0 l=0).connect read peer addr 10.214.131.27:6790/0 on socket 6
2013-04-25 13:37:56.821763 7f38f8bef700 20 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e51794c0 sd=6 :39107 s=1 pgs=0 cs=0 l=0).connect peer addr for me is 10.214.131.27:39107/0
2013-04-25 13:37:56.821817 7f38f8bef700 10 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e51794c0 sd=6 :39107 s=1 pgs=0 cs=0 l=0).connect sent my addr 10.214.131.27:6789/0
2013-04-25 13:37:56.821838 7f38f8bef700 10 mon.a@1(peon) e1 ms_get_authorizer for mon
2013-04-25 13:37:56.821934 7f38f8bef700 10 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e51794c0 sd=6 :39107 s=1 pgs=0 cs=0 l=0).connect.authorizer_len=158 protocol=2
2013-04-25 13:37:56.821956 7f38f8bef700 10 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e51794c0 sd=6 :39107 s=1 pgs=0 cs=0 l=0).connect sending gseq=443 cseq=0 proto=10
2013-04-25 13:37:56.821991 7f38f8bef700 20 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e51794c0 sd=6 :39107 s=1 pgs=0 cs=0 l=0).connect wrote (self +) cseq, waiting for reply
2013-04-25 13:37:56.822308 7f38f9af6700  2 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e80038c0 sd=22 :6789 s=2 pgs=426 cs=5 l=0).reader couldn't read tag, Transport endpoint is not connected
2013-04-25 13:37:56.822347 7f38f9af6700  2 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e80038c0 sd=22 :6789 s=2 pgs=426 cs=5 l=0).fault 107: Transport endpoint is not connected
2013-04-25 13:37:56.822416 7f38f9af6700  0 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e80038c0 sd=22 :6789 s=2 pgs=426 cs=5 l=0).fault with nothing to send, going to standby
2013-04-25 13:37:56.822432 7f38f9af6700 20 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e80038c0 sd=22 :6789 s=3 pgs=426 cs=5 l=0).reader sleeping during reconnect|standby
2013-04-25 13:37:56.822488 7f38f8bef700 20 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e51794c0 sd=6 :39107 s=1 pgs=0 cs=0 l=0).connect got reply tag 13 connect_seq 1 global_seq 438 proto 10 flags 0
2013-04-25 13:37:56.822506 7f38f8bef700 10 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e51794c0 sd=6 :39107 s=1 pgs=0 cs=0 l=0).reply.authorizer_len=36
2013-04-25 13:37:56.822541 7f38f8bef700 10 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e51794c0 sd=6 :39107 s=1 pgs=0 cs=0 l=0).got CEPH_MSGR_TAG_SEQ, reading acked_seq and writing in_seq
2013-04-25 13:37:56.822534 7f38f97f4700 10 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e80038c0 sd=22 :6789 s=3 pgs=426 cs=5 l=0).writer: state = standby policy.server=0
2013-04-25 13:37:56.822557 7f38f97f4700 20 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e80038c0 sd=22 :6789 s=3 pgs=426 cs=5 l=0).writer sleeping
2013-04-25 13:37:56.822573 7f38f8bef700 15 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e51794c0 sd=6 :39107 s=1 pgs=0 cs=0 l=0).reader got ack seq 0
2013-04-25 13:37:56.822604 7f38f8bef700 10 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e51794c0 sd=6 :39107 s=2 pgs=438 cs=1 l=0).connect success 1, lossy = 0, features 4294967295
2013-04-25 13:37:56.822623 7f38f8bef700 20 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e51794c0 sd=6 :39107 s=2 pgs=438 cs=1 l=0).connect starting reader
2013-04-25 13:37:56.822681 7f38f8bef700 10 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e51794c0 sd=6 :39107 s=2 pgs=438 cs=1 l=0).writer: state = open policy.server=0
2013-04-25 13:37:56.822697 7f38f8bef700 20 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e51794c0 sd=6 :39107 s=2 pgs=438 cs=1 l=0).writer encoding 1 0x7f38ec0331d0 log(last 47) v1
2013-04-25 13:37:56.822741 7f38f8bef700 20 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e51794c0 sd=6 :39107 s=2 pgs=438 cs=1 l=0).writer signed seq # 1): sig = 2207261951511203922
2013-04-25 13:37:56.822764 7f38f8bef700 20 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e51794c0 sd=6 :39107 s=2 pgs=438 cs=1 l=0).writer sending 1 0x7f38ec0331d0
2013-04-25 13:37:56.822772 7f38f8cf0700 20 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e51794c0 sd=6 :39107 s=2 pgs=438 cs=1 l=0).reader reading tag...
2013-04-25 13:37:56.822802 7f38f8bef700 10 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e51794c0 sd=6 :39107 s=2 pgs=438 cs=1 l=0).writer: state = open policy.server=0
2013-04-25 13:37:56.822816 7f38f8bef700 20 -- 10.214.131.27:6789/0 >> 10.214.131.27:6790/0 pipe(0x7f38e51794c0 sd=6 :39107 s=2 pgs=438 cs=1 l=0).writer sleeping
2013-04-25 13:37:56.823030 7f38faffd700 -1 msg/Pipe.cc: In function 'void Pipe::register_pipe()' thread 7f38faffd700 time 2013-04-25 13:37:56.821234
msg/Pipe.cc: 1086: FAILED assert(existing == __null)

 ceph version 0.60-657-g488d728 (488d72897493017ae1884cb780846d6386a70fad)
 1: (Pipe::register_pipe()+0x2f0) [0x6bea50]
 2: (SimpleMessenger::connect_rank(entity_addr_t const&, int, Connection*, Message*)+0x193) [0x639513]
 3: (SimpleMessenger::submit_message(Message*, Connection*, entity_addr_t const&, int, bool)+0xcb3) [0x63abd3]
 4: (SimpleMessenger::_send_message(Message*, Connection*, bool)+0xaa) [0x63b0ba]
 5: (Monitor::handle_route(MRoute*)+0x435) [0x4b87f5]
 6: (Monitor::_ms_dispatch(Message*)+0xbbb) [0x4c20db]
 7: (Monitor::ms_dispatch(Message*)+0x32) [0x4d8eb2]
 8: (DispatchQueue::entry()+0x769) [0x6ae319]
 9: (DispatchQueue::DispatchThread::entry()+0xd) [0x63c60d]
 10: (()+0x7e9a) [0x7f3903d51e9a]
 11: (clone()+0x6d) [0x7f390256eccd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


this route is to a twice-forwarded message.. wip-mon-fwd fixes that, but why it is crashing is still unclear. doing more debugging. the con should already have a pipe, so it is not clear why it is reopening a new Pipe to the peer mon. there have been no faults on it.

this currently reproduced about 50% of the time on the wip-4948-b branch with teh following job

flab:teuthology 01:27 PM $ cat e.yaml 
kernel:
  kdb: true
  branch: testing
nuke-on-error: true
overrides:
  ceph:
    conf:
      global:
        ms inject socket failures: 5000
      mon:
        debug ms: 20
        debug mon: 20
        debug paxos: 20
    fs: ext4
    log-whitelist:
    - slow request
    branch: wip-4748-b
  s3tests:
    branch: next
  workunit:
    branch: next
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
- - mon.b
  - mds.a
  - osd.3
  - osd.4
  - osd.5
  - client.0
tasks:
- chef: null
- clock.check: null
- install: null
- ceph:
    log-whitelist:
    - wrongly marked me down
    - objects unfound and apparently lost
- thrashosds:
    chance_pgnum_grow: 1
    chance_pgpnum_fix: 1
    timeout: 1200
- rados:
    clients:
    - client.0
    objects: 50
    op_weights:
      delete: 50
      read: 100
      rollback: 50
      snap_create: 50
      snap_remove: 50
      write: 100
    ops: 4000


(previously commit 488d72897493017ae1884cb780846d6386a70fad)

Actions #1

Updated by Sage Weil almost 11 years ago

  • Status changed from New to In Progress
Actions #2

Updated by Sage Weil almost 11 years ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF