Project

General

Profile

Actions

Bug #1093

closed

msgr: race conditon with replaced pipe's connection_state

Added by Josh Durgin almost 13 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

When a non-lossy connection is replaced, the messenger sets its connection_state to NULL while holding the pipe_lock. However, this variable is read when the pipe_lock is not held in Pipe::read_message and Pipe::write_message.

I hit this while running 10 osds on one disk:

 1: (ceph::BackTrace::BackTrace(int)+0x32) [0x8c024a]
 2: ./cosd() [0x97b69e]
 3: (()+0xef60) [0x7f205711ff60]
 4: (Connection::has_feature(int) const+0x18) [0x701f26]
 5: (SimpleMessenger::Pipe::write_message(Message*)+0x283) [0x6fbbef]
 6: (SimpleMessenger::Pipe::writer()+0x852) [0x6f966e]
 7: (SimpleMessenger::Pipe::Writer::entry()+0x21) [0x6e3e4d]
 8: (Thread::_entry_func(void*)+0x28) [0x701231]
 9: (()+0x68ba) [0x7f20571178ba]
 10: (clone()+0x6d) [0x7f2055dac02d]

Relevant logs and core files are vit:/home/joshd/weekend_run/osd.4 and vit:/home/joshd/weekend_run/core.2556.1305416401

Actions #1

Updated by Greg Farnum almost 13 years ago

Wow, that's unexpected. If you look at the source you'll notice that the connection_state is referred to in Pipe::writer() right before it calls into write_message. There's a dout in the way, so maybe they got stuck on that lock or something, but if 10 OSDs died on it I wonder if there might be another cause? I ask because I wouldn't expect these things to block until they get to do_sendmsg.

Anyway, I think the proper answer is just to switch to getting the connection state via the message, which has its own reference. Assuming the ordering on these things is right then do_sendmsg will fail because the sd is invalid and will back right out of ::sendmsg.

Actions #2

Updated by Josh Durgin almost 13 years ago

I was unclear: only one of the OSDs died due to this race. Running 10 on one disk just made this kind of race more likely to be hit.

Actions #3

Updated by Samuel Just almost 13 years ago

  • Target version set to v0.28
Actions #4

Updated by Sage Weil almost 13 years ago

  • Status changed from New to Resolved

commit:73b99163aba7db77aa122eab99780c3d66f0aa91

Actions #5

Updated by Greg Farnum about 5 years ago

  • Project changed from Ceph to Messengers
  • Category deleted (msgr)
  • Target version deleted (v0.28)
Actions

Also available in: Atom PDF