Bug #2796
osd: watch state not reestablished when registration op resent
0%
Description
if the client doesn't get the watch ack and resends, the osd will ignore it as a dup op, and the watch session state is not reestablished.
Associated revisions
objecter: always resend linger registrations
If a linger op (watch) is sent to the OSD and updates the object, and then
the client loses the reply, it will resend the request. The OSD will see
that it is a dup, however, and not set up the in-memory session state for
the watch. This in turn will break the watch (i.e., notifies won't
get delivered).
Instead, always resend linger registration ops, so that we always have a
unique reqid and do the correct session registeration for each session.
- track the tid of the registation op for each LingerOp
- mark registrations ops as should_resend=false; cancel as needed
- when we send a new registration op, cancel the old one to ensure we
ignore the reply. This is needed becuase we resend linger ops on any
pg change, not just a primary change. - drop the first_send arg to send_linger(), as we can now infer that
from register_tid == 0.
The bug was easily reproduced with ms inject socket failures = 500 and the
test_stress_watch utility.
Fixes: #2796
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
objecter: always resend linger registrations
If a linger op (watch) is sent to the OSD and updates the object, and then
the client loses the reply, it will resend the request. The OSD will see
that it is a dup, however, and not set up the in-memory session state for
the watch. This in turn will break the watch (i.e., notifies won't
get delivered).
Instead, always resend linger registration ops, so that we always have a
unique reqid and do the correct session registeration for each session.
- track the tid of the registation op for each LingerOp
- mark registrations ops as should_resend=false; cancel as needed
- when we send a new registration op, cancel the old one to ensure we
ignore the reply. This is needed becuase we resend linger ops on any
pg change, not just a primary change. - drop the first_send arg to send_linger(), as we can now infer that
from register_tid == 0.
The bug was easily reproduced with ms inject socket failures = 500 and the
test_stress_watch utility.
Fixes: #2796
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
History
#1 Updated by Sage Weil about 11 years ago
- Status changed from New to Fix Under Review
- Assignee deleted (
Sage Weil)
#2 Updated by Sage Weil about 11 years ago
- Backport set to argonaut
#3 Updated by Sage Weil about 11 years ago
- Target version set to v0.49
#4 Updated by Sage Weil about 11 years ago
- Status changed from Fix Under Review to 7
#5 Updated by Sage Weil about 11 years ago
- Status changed from 7 to Resolved