Project

General

Profile

Bug #2086

msgr: msg/SimpleMessenger.h: 203: FAILED assert(!i->second->is_on_list())

Added by Sage Weil about 12 years ago. Updated almost 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2012-02-19T21:14:36.223 INFO:teuthology.task.ceph.mds.a.err:2012-02-19 21:14:36.223372 7fd5f0f04700 mds.0.1 *** got signal Terminated ***
2012-02-19T21:14:36.276 INFO:teuthology.task.rados.rados.0.out:finishing write tid 2 to sepia553617-5
2012-02-19T21:14:36.329 INFO:teuthology.task.rados.rados.0.out:Writing sepia553617-6 from 504238 to 1107482 tid 2 ranges are [0~54,504238~603244,1576790~527448]
2012-02-19T21:14:36.396 INFO:teuthology.task.ceph.mon.a.err:2012-02-19 21:14:36.393929 7f9fbb04e700 mon.a@0(leader) e1 *** Got Signal Terminated ***
2012-02-19T21:14:36.408 INFO:teuthology.task.rados.rados.0.out:finishing write tid 1 to sepia553617-3
2012-02-19T21:14:36.439 INFO:teuthology.task.rados.rados.0.out:Writing sepia553617-6 from 1576790 to 2104238 tid 3 ranges are [0~54,504238~603244,1576790~527448]
2012-02-19T21:14:36.440 INFO:teuthology.task.rados.rados.0.out:7: Writing initial 7
2012-02-19T21:14:36.440 INFO:teuthology.task.rados.rados.0.out:waiting_on = 4
2012-02-19T21:14:36.440 INFO:teuthology.task.rados.rados.0.out:Writing sepia553617-7 from 0 to 54 tid 1 ranges are [0~54,696620~575049,1903404~770015,3073654~649760]
2012-02-19T21:14:36.577 INFO:teuthology.task.rados.rados.0.out:Writing sepia553617-7 from 696620 to 1271669 tid 2 ranges are [0~54,696620~575049,1903404~770015,3073654~649760]
2012-02-19T21:14:36.706 INFO:teuthology.task.rados.rados.0.out:finishing write tid 1 to sepia553617-6
2012-02-19T21:14:36.707 INFO:teuthology.task.rados.rados.0.out:finishing write tid 2 to sepia553617-6
2012-02-19T21:14:36.732 INFO:teuthology.task.rados.rados.0.out:Writing sepia553617-7 from 1903404 to 2673419 tid 3 ranges are [0~54,696620~575049,1903404~770015,3073654~649760]
2012-02-19T21:14:36.850 INFO:teuthology.task.rados.rados.0.out:Writing sepia553617-7 from 3073654 to 3723414 tid 4 ranges are [0~54,696620~575049,1903404~770015,3073654~649760]
2012-02-19T21:14:36.851 INFO:teuthology.task.rados.rados.0.out:8: Writing initial 8
2012-02-19T21:14:36.851 INFO:teuthology.task.rados.rados.0.out:waiting_on = 4
2012-02-19T21:14:36.851 INFO:teuthology.task.rados.rados.0.out:Writing sepia553617-8 from 0 to 54 tid 1 ranges are [0~54,792874~698277,2167936~688787,3273856~319018]
2012-02-19T21:14:36.866 INFO:teuthology.task.ceph.mon.a.err:msg/SimpleMessenger.h: In function 'virtual SimpleMessenger::Pipe::~Pipe()' thread 7f9fbeeeb780 time 2012-02-19 21:14:36.862302
2012-02-19T21:14:36.866 INFO:teuthology.task.ceph.mon.a.err:msg/SimpleMessenger.h: 203: FAILED assert(!i->second->is_on_list())
2012-02-19T21:14:36.867 INFO:teuthology.task.ceph.mon.a.err: ceph version 0.41-401-g76e88d1 (commit:76e88d10a0e1e08bccec2a6e6393ab72d97e6cdb)
2012-02-19T21:14:36.867 INFO:teuthology.task.ceph.mon.a.err: 1: (SimpleMessenger::Pipe::~Pipe()+0x199) [0x4669b9]
2012-02-19T21:14:36.867 INFO:teuthology.task.ceph.mon.a.err: 2: (SimpleMessenger::~SimpleMessenger()+0x31) [0x5520d1]
2012-02-19T21:14:36.867 INFO:teuthology.task.ceph.mon.a.err: 3: (main()+0x3026) [0x461486]
2012-02-19T21:14:36.867 INFO:teuthology.task.ceph.mon.a.err: 4: (__libc_start_main()+0xfe) [0x7f9fbd28ad8e]
2012-02-19T21:14:36.868 INFO:teuthology.task.ceph.mon.a.err: 5: /tmp/cephtest/binary/usr/local/bin/ceph-mon() [0x45e1f9]
2012-02-19T21:14:36.868 INFO:teuthology.task.ceph.mon.a.err: ceph version 0.41-401-g76e88d1 (commit:76e88d10a0e1e08bccec2a6e6393ab72d97e6cdb)
2012-02-19T21:14:36.868 INFO:teuthology.task.ceph.mon.a.err: 1: (SimpleMessenger::Pipe::~Pipe()+0x199) [0x4669b9]
2012-02-19T21:14:36.868 INFO:teuthology.task.ceph.mon.a.err: 2: (SimpleMessenger::~SimpleMessenger()+0x31) [0x5520d1]
2012-02-19T21:14:36.868 INFO:teuthology.task.ceph.mon.a.err: 3: (main()+0x3026) [0x461486]
2012-02-19T21:14:36.869 INFO:teuthology.task.ceph.mon.a.err: 4: (__libc_start_main()+0xfe) [0x7f9fbd28ad8e]
2012-02-19T21:14:36.869 INFO:teuthology.task.ceph.mon.a.err: 5: /tmp/cephtest/binary/usr/local/bin/ceph-mon() [0x45e1f9]
2012-02-19T21:14:36.869 INFO:teuthology.task.ceph.mon.a.err:terminate called after throwing an instance of 'ceph::FailedAssertion'

Associated revisions

Revision 2437ce02 (diff)
Added by Greg Farnum about 12 years ago

msgr: discard the local_pipe's queue on shutdown.

To facilitate this, we do two things:
1) actually identify the number of special code values we pass around
2) use that to prevent trying to put() those non-pointer values in
Pipe::discard_queue().
Then we just call local_pipe.discard_queue() in wait() like happens
(indirectly, via reaping) with all the normal Pipes in rank_pipe.

But this does make me think that we may be approaching the point
where it's appropriate to create a subclass LocalPipe (against a
RemotePipe like our current Pipe implementation is mostly intended
to be).

Should fix #2086.

Signed-off-by: Greg Farnum <>
Reviewed-by: Sage Weil <>

History

#1 Updated by Greg Farnum about 12 years ago

We sure this was run including commit:ebbfdefa120ae93b95780c67027ec9efd4b7b5cd?

#2 Updated by Sage Weil about 12 years ago

it did. probably a race with another thread in connect() or accept() reregistering a new Pipe.. connect() pbly

#3 Updated by Greg Farnum about 12 years ago

The guards for something like that shouldn't be too complicated to set up...actually, I thought they were at one point...

#4 Updated by Greg Farnum about 12 years ago

Okay, looks like the local_pipe doesn't get its message queue cleared...I'm checking the others and looking at how it should be done.

#5 Updated by Greg Farnum about 12 years ago

  • Status changed from New to In Progress
  • Assignee set to Greg Farnum

#6 Updated by Greg Farnum about 12 years ago

  • Status changed from In Progress to 4

wip-2086 should fix this.

Ran a simple test:

./vstart.sh -n -d
./rados -p data bench 10 write
./init-ceph stop

Everything went fine.

#7 Updated by Greg Farnum about 12 years ago

To be clear, I didn't try and generate the actual failure condition that was causing an assert before — that should be possible once we have all our Messenger testing stuff done, but it's not yet and I don't think coming up with another test is worth the effort for a shutdown bug right now. :)

#8 Updated by Sage Weil about 12 years ago

  • Status changed from 4 to 7

#9 Updated by Greg Farnum about 12 years ago

Sage suggested I could just add a local dispatch to the shutdown or wait functions to test this properly...I did, and it's working. :)
Just needs somebody to review and merge!

#10 Updated by Sage Weil about 12 years ago

  • Status changed from 7 to Resolved

merged!

#11 Updated by Greg Farnum almost 5 years ago

  • Project changed from Ceph to Messengers
  • Category deleted (msgr)
  • Target version deleted (v0.43)

Also available in: Atom PDF