Bug #1093: msgr: race conditon with replaced pipe's connection_state - Messengers - Ceph

Actions

Copy link

Bug #1093

closed

msgr: race conditon with replaced pipe's connection_state

Added by Josh Durgin almost 13 years ago. Updated about 5 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

When a non-lossy connection is replaced, the messenger sets its connection_state to NULL while holding the pipe_lock. However, this variable is read when the pipe_lock is not held in Pipe::read_message and Pipe::write_message.

I hit this while running 10 osds on one disk:

 1: (ceph::BackTrace::BackTrace(int)+0x32) [0x8c024a]
 2: ./cosd() [0x97b69e]
 3: (()+0xef60) [0x7f205711ff60]
 4: (Connection::has_feature(int) const+0x18) [0x701f26]
 5: (SimpleMessenger::Pipe::write_message(Message*)+0x283) [0x6fbbef]
 6: (SimpleMessenger::Pipe::writer()+0x852) [0x6f966e]
 7: (SimpleMessenger::Pipe::Writer::entry()+0x21) [0x6e3e4d]
 8: (Thread::_entry_func(void*)+0x28) [0x701231]
 9: (()+0x68ba) [0x7f20571178ba]
 10: (clone()+0x6d) [0x7f2055dac02d]

Relevant logs and core files are vit:/home/joshd/weekend_run/osd.4 and vit:/home/joshd/weekend_run/core.2556.1305416401

Actions

Copy link

Updated by Greg Farnum almost 13 years ago

Wow, that's unexpected. If you look at the source you'll notice that the connection_state is referred to in Pipe::writer() right before it calls into write_message. There's a dout in the way, so maybe they got stuck on that lock or something, but if 10 OSDs died on it I wonder if there might be another cause? I ask because I wouldn't expect these things to block until they get to do_sendmsg.

Anyway, I think the proper answer is just to switch to getting the connection state via the message, which has its own reference. Assuming the ordering on these things is right then do_sendmsg will fail because the sd is invalid and will back right out of ::sendmsg.

Actions

Copy link