Feature #10585

Updated by Josh Durgin over 5 years ago

The interface exposed by librados has everything that needs to be
available to the user and a description of most of the
rados-level semantics [1]. Most of this work will be in
osd_client, and a little bit to make rbd use it.

In rbd, opening an image non-readonly causes a watch to be
established on the header object of the image. For historical
reasons, notifications were originally sent with no payload and
any notification on the image header resulted in re-reading all
the mutable image metadata. In userspace this means incrementing
the ImageCtx::refresh_seq counter, which is checked before each
operation to see if the image metadata needs to be reread. When a
watch is lost, the error callback is called and rbd compensates
for possible missed notifications by incrementing refresh_seq to
reread the header before the next operation.

In hammer and beyond the notify payload is used by images with
the exclusive lock feature bit to proxy management operations to
the lock holder, but that's a separate issue. For now the payload
can continue being ignored by krbd, and krbd doesn't need to send
notifications yet.

These details are handled by ImageWatcher in userspace, in
particular see reregister_watch() for watch error handling [2],
and how notifications are now explicitly
acked (rados_notify_ack()) by rbd.

In terms of the low-level implementation of watch/notify, the
usual MOSDOp message for rados operations is used to
register/unregister watches and send notifications with
watch/notify-specific fields. The client periodically pings osds
serving watches to make sure the connection is alive for any osds
serving watches [3]. The kernel should already be doing
this. What it doesn't do yet is expose when a watch has an error
and needs to be reregistered, and the watch flush mechanism may
need to change as well. Note that in the userspace analogue of
osd_client, the Objecter, watch/notify are called "linger" ops
for historical reasons. Objecter::handle_watch_notify() takes
care of MWatchNotify [4] messages, which are notifications or
watch errors received from the OSD.