Bug #1093
closedmsgr: race conditon with replaced pipe's connection_state
0%
Description
When a non-lossy connection is replaced, the messenger sets its connection_state to NULL while holding the pipe_lock. However, this variable is read when the pipe_lock is not held in Pipe::read_message and Pipe::write_message.
I hit this while running 10 osds on one disk:
1: (ceph::BackTrace::BackTrace(int)+0x32) [0x8c024a] 2: ./cosd() [0x97b69e] 3: (()+0xef60) [0x7f205711ff60] 4: (Connection::has_feature(int) const+0x18) [0x701f26] 5: (SimpleMessenger::Pipe::write_message(Message*)+0x283) [0x6fbbef] 6: (SimpleMessenger::Pipe::writer()+0x852) [0x6f966e] 7: (SimpleMessenger::Pipe::Writer::entry()+0x21) [0x6e3e4d] 8: (Thread::_entry_func(void*)+0x28) [0x701231] 9: (()+0x68ba) [0x7f20571178ba] 10: (clone()+0x6d) [0x7f2055dac02d]
Relevant logs and core files are vit:/home/joshd/weekend_run/osd.4 and vit:/home/joshd/weekend_run/core.2556.1305416401
Updated by Greg Farnum almost 13 years ago
Wow, that's unexpected. If you look at the source you'll notice that the connection_state is referred to in Pipe::writer() right before it calls into write_message. There's a dout in the way, so maybe they got stuck on that lock or something, but if 10 OSDs died on it I wonder if there might be another cause? I ask because I wouldn't expect these things to block until they get to do_sendmsg.
Anyway, I think the proper answer is just to switch to getting the connection state via the message, which has its own reference. Assuming the ordering on these things is right then do_sendmsg will fail because the sd is invalid and will back right out of ::sendmsg.
Updated by Josh Durgin almost 13 years ago
I was unclear: only one of the OSDs died due to this race. Running 10 on one disk just made this kind of race more likely to be hit.
Updated by Sage Weil almost 13 years ago
- Status changed from New to Resolved
commit:73b99163aba7db77aa122eab99780c3d66f0aa91
Updated by Greg Farnum about 5 years ago
- Project changed from Ceph to Messengers
- Category deleted (
msgr) - Target version deleted (
v0.28)