Project

General

Profile

Actions

Bug #8806

closed

libceph: must use new tid when watch is resent

Added by Ilya Dryomov almost 10 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

The following can happen:

- kernel client sends watch request
- it is processed, watch is set up, watch->connect() is called, side effects are considered to be complete
- before watch request op is committed, connection reset handler calls watch->disconnect()
- kernel client re-sends watch request
- it gets behind the first one as a dup - the first one hasn't committed yet
- the first watch request op commits
- watch reply is sent to the kernel client, but watch is still disconnected
- kernel client proceeds, until the next notify (e.g. resize) ;)


Files

ceph-osd.1.log (19.7 KB) ceph-osd.1.log Ilya Dryomov, 07/10/2014 10:13 AM
Actions #1

Updated by Ilya Dryomov almost 10 years ago

This results in hard to track failures, because of an unfinished TODO

we give up TODO: we should return an error code

in Notify::do_timeout(), which effectively unblocks clients after a 10 second timeout instead of failing the op (e.g. rbd resize).

Actions #2

Updated by Sage Weil almost 10 years ago

  • Status changed from New to 12
  • Priority changed from Normal to High
Actions #3

Updated by Sage Weil almost 10 years ago

  • Priority changed from High to Urgent
Actions #4

Updated by Sage Weil over 9 years ago

the bug is with the kernel client: it needs to use a new tid when resending the watch. this was partially fixed on the userspace side with commit:5dd68b95b1d2ac0e4972609ca497d4cff28ef351 (for watch), and notify was fixed with commit:c3107009f66bc06b5e14c465142e14120f9a4412. In short:

- resent watch must use a new unique tid
- resent notify should not use a new tid (although it isn't strictly incorrect if it does)

Actions #5

Updated by Sage Weil over 9 years ago

meanwhile, the MWatchNotify message now has a return value encoded at the end (s32) when header.version >= 0. See wip-watch-notify.

Actions #6

Updated by Sage Weil over 9 years ago

  • Project changed from Ceph to rbd
  • Subject changed from watch requests vs connection resets to libceph: must use new tid when watch is resent

the watch resend needs to use a new tid to avoid the dup op detection in the osd. this is how librbd avoids this problem.

Actions #7

Updated by Sage Weil over 9 years ago

  • Assignee set to Ilya Dryomov
Actions #8

Updated by Ilya Dryomov over 9 years ago

  • Status changed from 12 to Fix Under Review

wip-watch-tid-8806

Actions #9

Updated by Ilya Dryomov over 9 years ago

My tests confirmed that wip-watch-tid-8806 fixes this particular krbd bug. However, with thrashosds thrown into the mix, both librbd and krbd fail the fsx workload with -ETIMEDOUT from resize, snapshot create, etc. Haven't investigated yet if it's just the thrasher doing something that watch cannot handle or another bug.

Actions #10

Updated by Ilya Dryomov over 9 years ago

  • Project changed from rbd to Linux kernel client
Actions #11

Updated by Ilya Dryomov over 9 years ago

  • Status changed from Fix Under Review to Resolved
Actions

Also available in: Atom PDF