Project

General

Profile

Bug #5626

msg/Pipe.cc: 1012: FAILED assert(m)

Added by Sage Weil over 10 years ago. Updated over 10 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

this is in the reconnect resent seq code that i just changed.

job was

ubuntu@teuthology:/a/teuthology-2013-07-14_01:00:14-rados-next-testing-basic/66668$ cat orig.config.yaml 
kernel:
  kdb: true
  sha1: 365b57b1317524bb0cdd15859a224ba1ab58d1d7
machine_type: plana
nuke-on-error: true
overrides:
  admin_socket:
    branch: next
  ceph:
    conf:
      global:
        ms inject socket failures: 500
      mon:
        debug mon: 20
        debug ms: 20
        debug paxos: 20
      osd:
        osd op thread timeout: 60
    fs: btrfs
    log-whitelist:
    - slow request
    sha1: df45b167cfe262c46367e812c79e65698804ef5d
  install:
    ceph:
      sha1: df45b167cfe262c46367e812c79e65698804ef5d
  s3tests:
    branch: next
  workunit:
    sha1: df45b167cfe262c46367e812c79e65698804ef5d
roles:
- - mon.0
  - mon.1
  - mon.2
  - mds.a
  - osd.0
  - osd.1
  - osd.2
  - client.0
tasks:
- chef: null
- clock.check: null
- install: null
- ceph:
    log-whitelist:
    - wrongly marked me down
    - had wrong client addr
- workunit:
    clients:
      all:
      - cephtool
      - mon/pool_ops.sh

History

#1 Updated by Sage Weil over 10 years ago

- we are not detecting resets
- down osd is talking to a surviving osd, repeatedly marked down
- at some point, we get a socket reset with different timing adn teh acceptor (surviving osd) decides to reconnect...

   -59> 2013-07-14 16:55:04.020976 7fb2484dd700 10 -- 10.214.132.39:6809/30786 >> :/0 pipe(0x3757280 sd=116 :0 s=0 pgs=0 cs=0 l=0 c=0x37a3dc0).accept
   -58> 2013-07-14 16:55:04.021031 7fb2484dd700  1 -- 10.214.132.39:6809/30786 >> :/0 pipe(0x3757280 sd=116 :6809 s=0 pgs=0 cs=0 l=0 c=0x37a3dc0).accept sd=116 10.214.132.39:57539/0
   -57> 2013-07-14 16:55:04.021104 7fb2484dd700 10 -- 10.214.132.39:6809/30786 >> 10.214.132.39:6801/30785 pipe(0x3757280 sd=116 :6809 s=0 pgs=0 cs=0 l=0 c=0x37a3dc0).accept peer addr is 10.214.132.39:6801/30785
   -56> 2013-07-14 16:55:04.021177 7fb2484dd700 20 -- 10.214.132.39:6809/30786 >> 10.214.132.39:6801/30785 pipe(0x3757280 sd=116 :6809 s=0 pgs=0 cs=0 l=0 c=0x37a3dc0).accept got peer connect_seq 3876 global_seq 3242
   -55> 2013-07-14 16:55:04.021193 7fb2484dd700 10 -- 10.214.132.39:6809/30786 >> 10.214.132.39:6801/30785 pipe(0x3757280 sd=116 :6809 s=0 pgs=0 cs=0 l=0 c=0x37a3dc0).accept of host_type 4, policy.lossy=0
   -54> 2013-07-14 16:55:04.021201 7fb2484dd700 10 -- 10.214.132.39:6809/30786 >> 10.214.132.39:6801/30785 pipe(0x3757280 sd=116 :6809 s=0 pgs=0 cs=0 l=0 c=0x37a3dc0).accept my proto 10, their proto 10
   -53> 2013-07-14 16:55:04.021274 7fb2484dd700 10 -- 10.214.132.39:6809/30786 >> 10.214.132.39:6801/30785 pipe(0x3757280 sd=116 :6809 s=0 pgs=0 cs=0 l=0 c=0x37a3dc0).accept:  setting up session_security.
   -52> 2013-07-14 16:55:04.021290 7fb2484dd700 10 -- 10.214.132.39:6809/30786 >> 10.214.132.39:6801/30785 pipe(0x3757280 sd=116 :6809 s=0 pgs=0 cs=0 l=0 c=0x37a3dc0).accept new session
   -51> 2013-07-14 16:55:04.021298 7fb2484dd700 10 -- 10.214.132.39:6809/30786 >> 10.214.132.39:6801/30785 pipe(0x3757280 sd=116 :6809 s=2 pgs=3242 cs=3877 l=0 c=0x37a3dc0).accept success, connect_seq = 3877, sending READY
   -50> 2013-07-14 16:55:04.021306 7fb2484dd700 10 -- 10.214.132.39:6809/30786 >> 10.214.132.39:6801/30785 pipe(0x3757280 sd=116 :6809 s=2 pgs=3242 cs=3877 l=0 c=0x37a3dc0).accept features 34359738367
   -49> 2013-07-14 16:55:04.021319 7fb2484dd700 10 -- 10.214.132.39:6809/30786 >> 10.214.132.39:6801/30785 pipe(0x3757280 sd=116 :6809 s=2 pgs=3242 cs=3877 l=0 c=0x37a3dc0).register_pipe
   -48> 2013-07-14 16:55:04.021366 7fb2484dd700 10 -- 10.214.132.39:6809/30786 >> 10.214.132.39:6801/30785 pipe(0x3757280 sd=116 :6809 s=2 pgs=3242 cs=3877 l=0 c=0x37a3dc0).discard_requeued_up_to 0
   -47> 2013-07-14 16:55:04.021389 7fb2484dd700 10 -- 10.214.132.39:6809/30786 >> 10.214.132.39:6801/30785 pipe(0x3757280 sd=116 :6809 s=2 pgs=3242 cs=3877 l=0 c=0x37a3dc0).accept starting writer, state open
   -46> 2013-07-14 16:55:04.021427 7fb2484dd700 20 -- 10.214.132.39:6809/30786 >> 10.214.132.39:6801/30785 pipe(0x3757280 sd=116 :6809 s=2 pgs=3242 cs=3877 l=0 c=0x37a3dc0).accept done
   -45> 2013-07-14 16:55:04.021458 7fb2484dd700 20 -- 10.214.132.39:6809/30786 >> 10.214.132.39:6801/30785 pipe(0x3757280 sd=116 :6809 s=2 pgs=3242 cs=3877 l=0 c=0x37a3dc0).reader reading tag...
   -44> 2013-07-14 16:55:04.021475 7fb2484dd700  2 -- 10.214.132.39:6809/30786 >> 10.214.132.39:6801/30785 pipe(0x3757280 sd=116 :6809 s=2 pgs=3242 cs=3877 l=0 c=0x37a3dc0).reader couldn't read tag, Success
   -43> 2013-07-14 16:55:04.021491 7fb2484dd700  2 -- 10.214.132.39:6809/30786 >> 10.214.132.39:6801/30785 pipe(0x3757280 sd=116 :6809 s=2 pgs=3242 cs=3877 l=0 c=0x37a3dc0).fault 0: Success
   -42> 2013-07-14 16:55:04.021535 7fb2484dd700  0 -- 10.214.132.39:6809/30786 >> 10.214.132.39:6801/30785 pipe(0x3757280 sd=116 :6809 s=2 pgs=3242 cs=3877 l=0 c=0x37a3dc0).fault, initiating reconnect

despite this code in fault():

  if (policy.standby && !is_queued()) {
    ldout(msgr->cct,0) << "fault with nothing to send, going to standby" << dendl;
    state = STATE_STANDBY;
    return;
  }

but i can't figure out why. queue would be empty, keepalive=false, and policy.standby should be true.

adding more debug output and reproducing again. i got 2 hits in ~50 attempts in sage-bug-5626-a. this was ubuntu@teuthology:/a/sage-bug-5626-a/66942

#2 Updated by Sage Weil over 10 years ago

  • Status changed from 12 to In Progress

#3 Updated by Sage Weil over 10 years ago

  • Status changed from In Progress to Fix Under Review

#4 Updated by Sage Weil over 10 years ago

  • Status changed from Fix Under Review to Resolved

Also available in: Atom PDF