Project

General

Profile

Bug #4279

mon: received but didn't forward osd boot message

Added by Sage Weil about 11 years ago. Updated about 11 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Joao Eduardo Luis
Category:
Monitor
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

on a cluster of 6 osds, we get osd boot messages from ~4 osds, and forward only some of them to the leader. this reproduced 2 times out of 200 runs from this job:

nuke-on-error: true
overrides:
  ceph:
    conf:
      global:
        ms inject delay max: 1
        ms inject delay probability: 0.005
        ms inject delay type: osd
        ms inject socket failures: 2500
      osd:
        debug filestore: 20
        debug ms: 1
        debug osd: 20
        debug journal: 20
    fs: xfs
    log-whitelist:
    - slow request
    branch: wip-pglog
roles:
- - mon.a
  - mon.b
  - osd.0
  - osd.1
  - osd.2
- - mon.c
  - mds.a
  - osd.3
  - osd.4
  - osd.5
  - client.0
tasks:
- chef: null
- clock: null
- install:
- ceph:
    log-whitelist:
    - wrongly marked me down
    - objects unfound and apparently lost
- thrashosds:
    timeout: 1200
- rados:
    clients:
    - client.0
    objects: 500
    op_weights:
      delete: 50
      read: 100
      rollback: 50
      snap_create: 50
      snap_remove: 50
      write: 100
    ops: 4000

History

#1 Updated by Sage Weil about 11 years ago

  • Priority changed from High to Urgent

I suspect this also happend with run ubuntu@teuthology:/a/samuelj-2013-03-08_16:03:20-regression-wip_omap_snaps-testing-basic/18696

#2 Updated by Joao Eduardo Luis about 11 years ago

These logs appear to be missing. Either that or I'm at a complete loss on where to find them. teuthology's archive directory don't really have anything else besides teuthology's logs and files, and the remote servers where the test was run appear to have been reused/cleaned up. Any ideas where I can find the logs?

#3 Updated by Sage Weil about 11 years ago

yeah :( i thin we need to wait for it to happen again. i tried ot reproduce it explicitly with a targetted job and got nothing after 700 iterations.

#4 Updated by Sage Weil about 11 years ago

  • Assignee changed from Joao Eduardo Luis to Sage Weil

#5 Updated by Sage Weil about 11 years ago

  • Assignee changed from Sage Weil to Joao Eduardo Luis

teuthology-2013-03-10_01:00:05-regression-master-testing-gcov/20546
- blogbench workload hung, waiting for osds to be up during ceph_health

#6 Updated by Joao Eduardo Luis about 11 years ago

Those logs appear to have gone AWAL as well, so I've been trying to reproduce this for approx. 24 hours now without any joy.

#7 Updated by Sage Weil about 11 years ago

i'm going to crank up mon logs across all the qa runs so that next time this happens we will ahve it.

the trick is that the job hangs, so we'll have to be careful about nuking hung runs.

#8 Updated by Sage Weil about 11 years ago

  • Status changed from 12 to In Progress

#9 Updated by Joao Eduardo Luis about 11 years ago

  • Status changed from In Progress to Need More Info

Managed to get a bunch of runs in which one of the monitors didn't forward one osd_boot message.

The thing is, none of those runs hung. They always ended up going on with business as usual. Haven't been able to reproduce this bug yet.

On another note, something caught my eye. All the subset of runs (that ended up not forwarding one osd_boot message) followed exactly the same pattern:

- osd.X sends osd_boot to mon.peon1
- mon.peon1 forwards osd_boot to mon.leader
- osd.X sends osd_boot to mon.peon2 two seconds after sending the first osd_boot message
- mon.peon2 drops osd_boot as the osdmap has been updated in the meantime and osd.X is up -- message considered a duplicate

Furthermore, in all the runs this happened, the osd_boot message dropped by mon.peon2 is always the second osd_boot ever received. Might have something to do with teuthology being really predictable. Nothing appears to break, or be broken, aside from the fact that this is the only osd_boot message sent twice for different monitors in such a short span of time.

#10 Updated by Sage Weil about 11 years ago

  • Status changed from Need More Info to Resolved

7aec13f749035b9bef5e398c1ac3d56ceec8eb81 and two follow-on commits.

Also available in: Atom PDF