Project

General

Profile

Bug #730

connection resets from kclient

Added by Sage Weil about 13 years ago. Updated about 13 years ago.

Status:
Closed
Priority:
High
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

on ladder0 i see lots of

Jan 20 15:42:10 ladder0 kernel: [698515.268048] libceph:  tid 7931211 timed out on osd1, will reset osd
Jan 20 15:46:40 ladder0 kernel: [698785.444071] libceph:  tid 7940515 timed out on osd0, will reset osd
Jan 20 15:46:40 ladder0 kernel: [698785.451582] libceph: read_partial_message bad seq 7894 expected 2
Jan 20 15:46:40 ladder0 kernel: [698785.457825] libceph: osd0 10.14.0.104:6800 bad crc
Jan 20 15:46:40 ladder0 kernel: [698785.822164] libceph: skipping osd0 10.14.0.104:6800 seq 1 expected 2

not sure if this is a problem on the client or osd end.

History

#1 Updated by Sage Weil about 13 years ago

on the osd side, i see:


2011-01-21 15:50:39.711611 7fd68a3fa910 tcp_read_wait got poll flag ERR or HUP or RDHUP or NVAL 8193
2011-01-21 15:50:39.711631 7fd68a3fa910 -- 10.14.0.104:6800/16827 >> 10.14.0.16:0/4039501355 pipe(0x2aed500 sd=24 pgs=5386 cs=1 l=1).reader couldn't read tag, Transport endpoint is not connected
2011-01-21 15:50:39.711650 7fd68a3fa910 -- 10.14.0.104:6800/16827 >> 10.14.0.16:0/4039501355 pipe(0x2aed500 sd=24 pgs=5386 cs=1 l=1).fault 107: Transport endpoint is not connected
2011-01-21 15:50:39.711669 7fd68a3fa910 -- 10.14.0.104:6800/16827 >> 10.14.0.16:0/4039501355 pipe(0x2aed500 sd=24 pgs=5386 cs=1 l=1).fault on lossy channel, failing
2011-01-21 15:50:39.711682 7fd68a3fa910 -- 10.14.0.104:6800/16827 >> 10.14.0.16:0/4039501355 pipe(0x2aed500 sd=24 pgs=5386 cs=1 l=1).fail
2011-01-21 15:50:39.711692 7fd68a3fa910 -- 10.14.0.104:6800/16827 >> 10.14.0.16:0/4039501355 pipe(0x2aed500 sd=24 pgs=5386 cs=1 l=1).stop
2011-01-21 15:50:39.711704 7fd68a3fa910 -- 10.14.0.104:6800/16827 >> 10.14.0.16:0/4039501355 pipe(0x2aed500 sd=24 pgs=5386 cs=1 l=1).discard_queue
2011-01-21 15:50:39.711720 7fd68a3fa910 -- 10.14.0.104:6800/16827 >> 10.14.0.16:0/4039501355 pipe(0x2aed500 sd=24 pgs=5386 cs=1 l=1). dequeued pipe 

the thing is the client side isn't noticing the disconnect. so, either the osd_client reset handler is broken, or the poll() weirdness on the client side is broken.

#2 Updated by Sage Weil about 13 years ago

Sage Weil wrote:

on the osd side, i see:
[...]
the thing is the client side isn't noticing the disconnect. so, either the osd_client reset handler is broken, or the poll() weirdness on the client side is broken.

that's POLLIN and POLLRDHUP.

#3 Updated by Sage Weil about 13 years ago

  • Target version changed from v0.24.2 to v0.24.3

#4 Updated by Sage Weil about 13 years ago

  • Project changed from Ceph to Linux kernel client
  • Target version deleted (v0.24.3)

#5 Updated by Sage Weil about 13 years ago

  • Target version set to v2.6.38

I'm hoping this is caused by the bad error handling in try_read() and try_write(). Need to do some more testing before sending it upstream.

That doesn't (necessarily) explain the CRC errors, though...

#6 Updated by Sage Weil about 13 years ago

  • Status changed from New to Closed

Looking closer, this appears to (now at least) be due to slow btrfs commits on the OSD (e.g. 30-50 seconds) which make the requests time out normally. The connection resets I'm seeing on the server side (now at least) are due to the kclient timing out and resetting.

I'm not sure if that's also what I was originally seeing or not, but now at least things are working properly wrt the messenger and kclient. Moving to a non-ancient kernel on the playground osds to mitigate the hellish commits...

Also available in: Atom PDF