Bug #1803: msgr: behave better when ending TCP connections - Messengers - Ceph

Actions

Copy link

Bug #1803

closed

msgr: behave better when ending TCP connections

Added by Greg Farnum over 12 years ago. Updated about 5 years ago.

Status:

Won't Fix

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

TV is telling me that if we're not confirming that each side of the connection calls ::shutdown() on the socket, we're not ending our TCP connection properly. Obviously it can work out okay even so, but we want to be good citizens and fixing this up will likely reduce the edge cases where we need to call mark_disposable() on pipes.

Actions

Copy link

Updated by Josh Durgin over 12 years ago

Priority changed from Normal to High

This actually caused a deadlock with ffsb on the kernel client - ffsb ended up with 1006 connections in the CLOSING state, and the osd had 1006 in FIN_WAIT2. This made the osd hit max open file descriptors at 1024. (The other osd crashed for a different reason).

Actions

Copy link

Updated by Greg Farnum over 12 years ago

Assignee set to Greg Farnum

I'm going to see if I can handle this in userspace today — fixing it in the kernel client will be another ticket.

Actions

Copy link

Updated by Greg Farnum over 12 years ago

Status changed from New to In Progress

From the little I'm reading in Unix Network Programming, it looks like we're just doing this wrong — we call shutdown(RD_WR) and then try to read, which never works. And we don't call close() until we get our successful read (or after timeouts when we mark_disposable).
So presumably just fixing that will deal with it.

Actions

Copy link

Updated by Greg Farnum over 12 years ago

And I've flipped back and forth umpteen times today about what's going on. At this point I can conclude that nobody on our end knows, but probably one of close() or shutdown() is actually removing the buffer (probably close()). So the proper fix is going to involve reworking the messenger so that it does separate shutdown calls for SO_WR and then does shutdown() for SO_RD after receiving an EOF from the other side.

Actions

Copy link

Updated by Greg Farnum over 12 years ago

Priority changed from High to Normal

Actions

Copy link

Updated by Greg Farnum over 12 years ago

Status changed from In Progress to New

Actions

Copy link

Updated by Ian Colle about 11 years ago

Assignee deleted (~~Greg Farnum~~)

Actions

Copy link

Updated by Loïc Dachary over 9 years ago

Status changed from New to Resolved

Not sure at which point this problem was fixed but it is doubtful that it stayed around for the past three years unnoticed.

Actions

Copy link

Updated by Greg Farnum over 9 years ago

Status changed from Resolved to New

This has been greatly improved with the addition of our socket timeouts and things, but I don't think it's properly resolved yet. It will get a great deal easier when the messenger doesn't have a thread<->socket relationship.

Actions

Copy link

#10