Project

General

Profile

Actions

Bug #19312

closed

ceph_test_rados_api_watch_notify --gtest_filter=LibRadosWatchNotify.Watch2Timeout fails due to reconnection

Added by Kefu Chai about 7 years ago. Updated about 7 years ago.

Status:
Resolved
Priority:
Immediate
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

see the failed tests at http://pulpito.ceph.com/kchai-2017-03-17_04:29:25-rados-wip-kefu-testing---basic-mira/

with the override of "ms tcp read timeout = 5", the client disconnects after idling for 5 seconds, and reconnects to the osd, but it sends the watch request which resets the timeout for disconnecting the watch.

see the client side log:

2017-03-20 15:17:50.407660 7fd29b7fe700  1 -- 127.0.0.1:0/3018425931 >> 127.0.0.1:6789/0 conn(0x7fd290030710 :-1 s=STATE_CLOSED pgs=157 cs=1 l=1).mark_down
2017-03-20 15:17:50.407711 7fd29b7fe700  1 -- 127.0.0.1:0/3018425931 --> 127.0.0.1:6790/0 -- mon_subscribe({mgrmap=0+,monmap=3+,osdmap=761}) v2 -- 0x7fd29001b460 con 0
2017-03-20 15:17:50.408281 7fd29b7fe700  1 -- 127.0.0.1:0/3018425931 <== mon.1 127.0.0.1:6790/0 4 ==== mgrmap(e 9) v1 ==== 63+0+0 (4036315079 0 0) 0x7fd29c003c20 con 0x7fd290037c80
2017-03-20 15:17:50.613675 7fd2b16f1700  1 -- 127.0.0.1:0/3018425931 >> 127.0.0.1:6812/34685 conn(0x7fd29000c140 :-1 s=STATE_OPEN pgs=782 cs=72 l=0).tick idle(5000024) more than 5000000 us, mark self fault.
2017-03-20 15:17:50.613736 7fd2b16f1700  0 -- 127.0.0.1:0/3018425931 >> 127.0.0.1:6812/34685 conn(0x7fd29000c140 :-1 s=STATE_OPEN pgs=782 cs=72 l=0).fault initiating reconnect
2017-03-20 15:17:51.284100 7fd2b0ef0700  1 -- 127.0.0.1:0/3018425931 >> 127.0.0.1:6804/34641 conn(0x7fd29003ae20 :-1 s=STATE_OPEN pgs=162 cs=1 l=1).tick idle(9776047) more than 5000000 us, mark self fault.
2017-03-20 15:17:51.284119 7fd2b0ef0700  1 -- 127.0.0.1:0/3018425931 >> 127.0.0.1:6804/34641 conn(0x7fd29003ae20 :-1 s=STATE_OPEN pgs=162 cs=1 l=1).fault on lossy channel, failing
2017-03-20 15:17:51.284477 7fd29b7fe700  1 -- 127.0.0.1:0/3018425931 >> 127.0.0.1:6804/34641 conn(0x7fd29003ae20 :-1 s=STATE_CLOSED pgs=162 cs=1 l=1).mark_down
2017-03-20 15:17:51.284738 7fd29b7fe700  1 -- 127.0.0.1:0/3018425931 --> 127.0.0.1:6804/34641 -- osd_op(unknown.0.0:20 193.7 193:ff9d772f:test-rados-api-gen8-41028-2::foo:head [watch reconnect cookie 94846125621760 gen 17] snapc 0=[] ondisk+write+known_if_redirected e760) v8 -- 0x7fd29001d3d0 con 0
2017-03-20 15:17:51.505017 7fd2b0ef0700  1 -- 127.0.0.1:0/3018425931 <== osd.1 127.0.0.1:6804/34641 1 ==== osd_op_reply(20 foo [watch reconnect cookie 94846125621760 gen 17] v760'19 uv1 ondisk = 0) v7 ==== 123+0+0 (1061418294 0 0) 0x7fd294001de0 con 0x7fd29004b850
G2017-03-20 15:17:55.396126 7fd29bfff700  1 -- 127.0.0.1:0/3018425931 --> 127.0.0.1:6804/34641 -- ping magic: 0 v1 -- 0x7fd2740017b0 con 0
2017-03-20 15:17:55.613861 7fd2b16f1700  1 -- 127.0.0.1:0/3018425931 >> 127.0.0.1:6812/34685 conn(0x7fd29000c140 :-1 s=STATE_OPEN pgs=788 cs=74 l=0).read_bulk peer close file descriptor 17

Actions #1

Updated by Kefu Chai about 7 years ago

reproduciable on master. i have not bisect'ed it.

Actions #2

Updated by Kefu Chai about 7 years ago

  • Status changed from New to In Progress
  • Assignee set to Kefu Chai
Actions #3

Updated by Kefu Chai about 7 years ago

  • Status changed from In Progress to Fix Under Review
Actions #4

Updated by Kefu Chai about 7 years ago

  • Status changed from Fix Under Review to Resolved
Actions

Also available in: Atom PDF