Project

General

Profile

Bug #3629

test_mon_workloadgen.cc: 766: FAILED assert(m->fsid == monc.get_fsid())

Added by Sage Weil over 11 years ago. Updated about 11 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Joao Eduardo Luis
Category:
Monitor
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2012-12-15T17:11:24.139 INFO:teuthology.task.workunit.client.0.err:     0> 2012-12-15 17:10:52.163988 7fcdb57fa700 -1 test/mon/test_mon_workloadgen.cc: In function 'void OSDStub::handle_osd_map(MOSDMap*)' thread 7fcdb57fa700 time 2012-12-15 17:10:52.159800
2012-12-15T17:11:24.139 INFO:teuthology.task.workunit.client.0.err:test/mon/test_mon_workloadgen.cc: 766: FAILED assert(m->fsid == monc.get_fsid())
2012-12-15T17:11:24.139 INFO:teuthology.task.workunit.client.0.err:
2012-12-15T17:11:24.139 INFO:teuthology.task.workunit.client.0.err: ceph version 0.55-303-g1ec70aa (1ec70aa0dde820f1fd6fdedba2369b841bf6ca7f)
2012-12-15T17:11:24.140 INFO:teuthology.task.workunit.client.0.err: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x95) [0x7d803d]
2012-12-15T17:11:24.140 INFO:teuthology.task.workunit.client.0.err: 2: (OSDStub::handle_osd_map(MOSDMap*)+0x191) [0x690db1]
2012-12-15T17:11:24.140 INFO:teuthology.task.workunit.client.0.err: 3: (OSDStub::ms_dispatch(Message*)+0x185) [0x6924ff]
2012-12-15T17:11:24.140 INFO:teuthology.task.workunit.client.0.err: 4: (Messenger::ms_deliver_dispatch(Message*)+0x9b) [0x84adcd]
2012-12-15T17:11:24.141 INFO:teuthology.task.workunit.client.0.err: 5: (DispatchQueue::entry()+0x549) [0x84a535]
2012-12-15T17:11:24.141 INFO:teuthology.task.workunit.client.0.err: 6: (DispatchQueue::DispatchThread::entry()+0x1c) [0x7c2330]
2012-12-15T17:11:24.141 INFO:teuthology.task.workunit.client.0.err: 7: (Thread::_entry_func(void*)+0x23) [0x7ca80d]
2012-12-15T17:11:24.141 INFO:teuthology.task.workunit.client.0.err: 8: (()+0x7e9a) [0x7fce06cf1e9a]
2012-12-15T17:11:24.141 INFO:teuthology.task.workunit.client.0.err: 9: (clone()+0x6d) [0x7fce057174bd]
2012-12-15T17:11:24.142 INFO:teuthology.task.workunit.client.0.err: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

job was
ubuntu@teuthology:/a/sage-2012-12-15_15:52:41-regression-next-testing-basic/15952$ cat orig.config.yaml 
kernel:
  kdb: true
  sha1: ec18aeecd4de479601363849d489668d8f12410c
nuke-on-error: true
overrides:
  ceph:
    conf:
      global:
        ms inject socket failures: 5000
    fs: btrfs
    log-whitelist:
    - slow request
    sha1: 1ec70aa0dde820f1fd6fdedba2369b841bf6ca7f
  s3tests:
    branch: next
  workunit:
    sha1: 1ec70aa0dde820f1fd6fdedba2369b841bf6ca7f
roles:
- - mon.a
  - mon.b
  - mon.c
  - osd.0
  - osd.1
  - mds.0
  - client.0
tasks:
- chef: null
- clock: null
- ceph: null
- mon_thrash:
    revive_delay: 20
    thrash_delay: 1
- ceph-fuse: null
- workunit:
    clients:
      all:
      - mon/workloadgen.sh
    env:
      DURATION: '600'
      LOADGEN_NUM_OSDS: '5'
      TEST_CEPH_CONF: /tmp/cephtest/ceph.conf
      VERBOSE: '1'

Associated revisions

Revision b30ab517 (diff)
Added by Joao Eduardo Luis about 11 years ago

test: mon: workloadgen: assert if monmap's fsid is zero after authenticate

Fixes: #3629

Signed-off-by: Joao Eduardo Luis <>

Revision 3610e72e (diff)
Added by Joao Eduardo Luis about 11 years ago

mon: OSDMonitor: only share osdmap with up OSDs

Try to share the map with a randomly picked OSD; if the picked monitor is
not 'up', then try to find the nearest 'up' OSD in the map by doing a
backward and a forward linear search on the map -- this would be O(n) in
the worst case scenario, as we only do a single iteration starting on the
picked position, incrementing and decrementing two different iterators
until we find an appropriate OSD or we exhaust the map.

Fixes: #3629
Backport: bobtail

Signed-off-by: Joao Eduardo Luis <>
Reviewed-by: Sage Weil <>

Revision 95677fc5 (diff)
Added by Joao Eduardo Luis about 11 years ago

mon: OSDMonitor: only share osdmap with up OSDs

Try to share the map with a randomly picked OSD; if the picked monitor is
not 'up', then try to find the nearest 'up' OSD in the map by doing a
backward and a forward linear search on the map -- this would be O(n) in
the worst case scenario, as we only do a single iteration starting on the
picked position, incrementing and decrementing two different iterators
until we find an appropriate OSD or we exhaust the map.

Fixes: #3629
Backport: bobtail

Signed-off-by: Joao Eduardo Luis <>
Reviewed-by: Sage Weil <>
(cherry picked from commit 3610e72e4f9117af712f34a2e12c5e9537a5746f)

History

#1 Updated by Joao Eduardo Luis over 11 years ago

  • Status changed from New to Fix Under Review

Pushed a fix to wip-3629.

After looking into what the OSD does in this case and go through the code, I realized that there's a chance that we may end up receiving MOSDMap messages before we receive the monmap from the monitors. The proposed fix now returns iff the message's fsid differs from the monclient's monmap fsid AND the monmap's fsid is zero.

Sage, does this make sense to you?

#2 Updated by Joao Eduardo Luis over 11 years ago

  • Category set to Monitor

I've gone through the logs again and again, as well as through the code. The logs only show the last couple hundred log messages, and it appears as if some intermediate messages are missing.

As far as I can tell, the osd stub that triggered the assert didn't receive a monmap from the monitors, and appears to have received an osdmap without sending a MOSDBoot message; besides, the monitor that sent the osdmap appeared to do so shortly after an election, soon after dropping a couple of auth messages. The lack of debug on the monitor's side makes it impossible to match the dropped auth messages to the osd stub, but it's fair to assume the stub's auth message was one of the dropped ones according to the approximate timestamp -- far from accurate though.

Still, given that the monclient will block waiting for the authentication to finish before continuing to boot, and given there's no indication of an osd_boot message in the logs, there is not much to act on.

However, I pushed a new patch to wip-3629 that basically forces the stub to wait for a monmap after authenticating, and asserts if the monmap's fsid is zero. I hate to jump to conclusions when information is scarce, but given the available information -- or lack thereof in this case, as there is no indication that the stub received the monmap as it happened with the remaining stubs in the log --, waiting for the monmap seems appropriate.

Suggestions are welcome.

#3 Updated by Ian Colle about 11 years ago

  • Priority changed from High to Normal

#4 Updated by Joao Eduardo Luis about 11 years ago

  • Status changed from Fix Under Review to Resolved

Also available in: Atom PDF