Bug #8806
libceph: must use new tid when watch is resent
0%
Description
The following can happen:
- kernel client sends watch request
- it is processed, watch is set up, watch->connect() is called, side effects are considered to be complete
- before watch request op is committed, connection reset handler calls watch->disconnect()
- kernel client re-sends watch request
- it gets behind the first one as a dup - the first one hasn't committed yet
- the first watch request op commits
- watch reply is sent to the kernel client, but watch is still disconnected
- kernel client proceeds, until the next notify (e.g. resize) ;)
History
#1 Updated by Ilya Dryomov over 9 years ago
This results in hard to track failures, because of an unfinished TODO
we give up TODO: we should return an error code
in Notify::do_timeout(), which effectively unblocks clients after a 10 second timeout instead of failing the op (e.g. rbd resize).
#2 Updated by Sage Weil over 9 years ago
- Status changed from New to 12
- Priority changed from Normal to High
#3 Updated by Sage Weil over 9 years ago
- Priority changed from High to Urgent
#4 Updated by Sage Weil over 9 years ago
the bug is with the kernel client: it needs to use a new tid when resending the watch. this was partially fixed on the userspace side with commit:5dd68b95b1d2ac0e4972609ca497d4cff28ef351 (for watch), and notify was fixed with commit:c3107009f66bc06b5e14c465142e14120f9a4412. In short:
- resent watch must use a new unique tid
- resent notify should not use a new tid (although it isn't strictly incorrect if it does)
#5 Updated by Sage Weil over 9 years ago
meanwhile, the MWatchNotify message now has a return value encoded at the end (s32) when header.version >= 0. See wip-watch-notify.
#6 Updated by Sage Weil over 9 years ago
- Project changed from Ceph to rbd
- Subject changed from watch requests vs connection resets to libceph: must use new tid when watch is resent
the watch resend needs to use a new tid to avoid the dup op detection in the osd. this is how librbd avoids this problem.
#7 Updated by Sage Weil over 9 years ago
- Assignee set to Ilya Dryomov
#8 Updated by Ilya Dryomov over 9 years ago
- Status changed from 12 to Fix Under Review
wip-watch-tid-8806
#9 Updated by Ilya Dryomov over 9 years ago
My tests confirmed that wip-watch-tid-8806 fixes this particular krbd bug. However, with thrashosds thrown into the mix, both librbd and krbd fail the fsx workload with -ETIMEDOUT from resize, snapshot create, etc. Haven't investigated yet if it's just the thrasher doing something that watch cannot handle or another bug.
#10 Updated by Ilya Dryomov over 9 years ago
- Project changed from rbd to Linux kernel client
#11 Updated by Ilya Dryomov over 9 years ago
- Status changed from Fix Under Review to Resolved