libceph: many "socket closed" messages
While trying to reproduce a null pointer problem in the client
messenger code I was running xfstests #049 over RBD devices.
In the process, I'd get easily 10-15 messages like this on the
console of the RBD client for each iteration of test 049:
[ 884.475733] libceph: osd1 10.214.131.33:6800 socket closed
I think this frequent socket closing was in fact what led to
the conditions under which the other problems I was chasing
But stepping back from that, I don't believe there's any reason
those sockets should be closing, certainly not at that frequency.
These channels use TCP connections, so there should be no data
loss. I think the point of these ceph connections is to have
the connection survive loss of the TCP connection, but that
by no means suggests we should be suffering TCP connection losses
frequently. Only some sort of hardware event should lead to that,
it seems to me.
I believe these messages indicate that the other end of the socket
connection is closing the socket. And it's possible that the
server side of these connections is doing this for a reason. If
that's the case I question a bit of that design. And if it's not
the case then we should find out what's going on.
#1 Updated by Greg Farnum over 11 years ago
The sockets have a default timeout of 15 minutes, after which they will close — the idea being that if the socket is actually in use it won't close, but this shuts it down after a reasonable period of time to prevent unending resource use buildups.
If the test doesn't take that long, then there probably is a problem that we want to fix.
#5 Updated by Oliver Francke almost 11 years ago
looking for annoying messages I got aware of this ticket... we are currently running:
Linux fcmsnode1 3.6.0 #1 SMP Tue Oct 2 09:54:37 CEST 2012 x86_64 GNU/Linux
ceph version 0.48.2argonaut (commit:3e02b2fad88c2a95d9c0c86878f10d1beb780bfe)
ceph-osd-13.log shows repeatedly:
--- 8-< ---
2012-10-11 10:51:25.878280 7f25c8c83700 0 -- 10.10.10.22:6812/1604 >> 10.10.10.22:0/715655332 pipe(0x6a21800 sd=76 pgs=0 cs=0 l=0).accept peer addr is really 10.10.10.22:0/715655332 (socket is 10.10.10.22:48610/0)
2012-10-11 10:51:25.879084 7f25d08dc700 0 osd.13 1353 pg[6.5( v 1353'2567562 (1353'2566561,1353'2567562] n=1857 ec=390 les/c 1347/1349 1340/1347/1333) [13,33] r=0 lpr=1347 mlcod 1353'2567561 active+clean] watch: ctx->obc=0x6381000 cookie=1 oi.version=2301953 ctx->at_version=1353'2567563
2012-10-11 10:51:25.879133 7f25d08dc700 0 osd.13 1353 pg[6.5( v 1353'2567562 (1353'2566561,1353'2567562] n=1857 ec=390 les/c 1347/1349 1340/1347/1333) [13,33] r=0 lpr=1347 mlcod 1353'2567561 active+clean] watch: oi.user_version=2301951
Would be cool, if you don't mind to come back, even if it leads to a debug-level raising in a productive environment.
Thnx in @vance,