Actions
Feature #57109
openwindows: rbd-wnbd SCSI persistent reservations
% Done:
0%
Source:
Tags:
Backport:
Description
rbd-wnbd Persistent Reservation support ======================================= rbd-wnbd allows exposing RBD images as virtual SCSI disks to Windows hosts. Our goal is to support the SCSI Persistent Reservation feature [1], which is commonly leveraged by Active-Active clusters. Specifically, we aim to acommodate Microsoft Failover Clusters [2] and Cluster Shared Volumes [3]. Use cases ========= 1. Hyper-V clustering [4] transparently moves virtual machines to other hosts when detecting node failures. It can also perform load balancing, moving VMs to underutilized hosts [5][6]. Cluster Shared Volumes (CSV) are commonly used to store the clustered virtual machine disk images (VHD/x files). However, Cluster Shared Volumes require SCSI Persistent Reservation support. 2. CSVs can also back Scale-Out File Server shares (highly available SMB shares) [7]. The above mentioned features are heavily used by Hyper-V deployments. Proposed change =============== We propose adding rbd-wnbd support for SCSI Persistent Reservations. The WNBD driver has already been updated to forward PR commands to userspace daemons such as rbd-wnbd, which in turn will have to: * handle SCSI PERSISTENT RESERVE IN commands, used to retrieve the current reservations * handle SCSI PERSISTENT RESERVE OUT commands, used to modify the reservations * return a reservation conflict for IO operations that are forbidden by the current reservations WNBD and MSFC expect SCSI SPC-3 semantics. Storing the PR data ------------------- The following PR data needs to be stored: * PR registrations * key * initiator id (we'll use the hostname) * PR reservations * key * initiator id * reservation type * PR generation As per SCSI specifications, the PR generation shall be incremented when processing the following PERSISTENT RESERVE OUT actions: * REGISTER * CLEAR * PREEMPT * PREEMPT_AND_ABORT * REGISTER_AND_IGNORE_EXISTING_KEY * REGISTER_AND_MOVE The PR generation will not be incremented by the following actions: * RESERVE * RELEASE Similarly to the target_core_rbd implementation, we intend to store the PR data using object extended attributes. The reasons for not using rbd metadata: * no cmpxattr support * used to safely issue updates * used to validate IO operations * no binary data support - we intend to use ceph encoding for convenience and performance reasons The PR data will be stored as an xattr of the image header object. The cmpxattr operation will be used when performing updates to ensure that it hasn't been modified in the meantime. In such cases, the PERSISTENT RESERVE OUT command will be retried. The PR generation will also be stored as an xattr of each image object so that it may be atomically checked using cmpxattr when performing IO operations. Updating each image object PR generation xattr will require an rbd exclusive lock. Note that the PR generation will also have to be set when new objects get created as a result of write operations, in which case we could check other objects (e.g. the header). Validating the PR data ---------------------- Before each IO operation, the current PR data will need to be checked. A reservation conflict will be returned if the IO operation is not allowed by the current reservations. Since the persistent reservations may change between the moment in which they're validated and the moment in which the IO operation is actually performed, we intend to tag each IO operation with the PR generation. By using a cmpxattr operation, we can safely discard pending IO operations if the reservations change in the meantime. Failing to do so might allow writes to be performed after a host has been preempted, which can lead to data corruption. When detecting a PR generation (object xattr) mismatch, we'll read the updated reservations and retry the IO operation if allowed by the current reservation, returning a reservation conflict error to WNBD otherwise. In order to avoid the additional round trip for each IO operation, we might cache the PR data, only refreshing it when detecting xattr mismatches or when receiving a PR OUT command. Enabling the feature -------------------- This feature will initially be considered experimental and disabled by default. Users may enable it by passing "--enable-pr" to rbd-wnbd. Limitations ----------- * only Windows hosts that use rbd-wnbd will be aware of the persistent reservations Public RBD API changes ---------------------- We need each IO operation to atomically validate the PR generation. This is performed using cmpxattr. There are two ways in which the expected PR generation could be passed: * using the image context * would avoid the need of public API changes * the image context is shared across IO requests, which could lead to race conditions. For example, the cached PR generation might be refreshed while having pending IO requests, which would wrongfully use the new tag. We could end up with write operations that proceed despite the host being preempted in the meantime * passing the PR generation as an additional argument to each IO operation * the parameter could be named as "tag" in order to keep this SCSI agnostic * we'd end up having an additional set of public librbd functions (e.g. aio_write3) * each IO request would have its own copy of the expected tag, avoding potential race conditions Alternatives ------------ 1. Some Ceph deployments currently use the iSCSI gateway with Suse's custom "target_core_rbd" module, which supports Persistent Reservations. The PetaSAN solution is another SUSE derivative that leverages this module. Downsides: * the iSCSI gateway poses performance and scalability limitations that can be avoided by using the native RBD Windows client * SUSE Enterprise Storage has been discontinued, which probably means that the custom target_core_rbd module will receive limited support 2. Another option would be to avoid Ceph entirely and use SANs (probably no need to specify the downsides) or Microsoft Storage Spaces [8], which has the following downsides: * scalability concerns - maximum of 16 servers * high licensing costs * can be difficult to consume in mixed Linux/Windows environments 3. CephFS may be used instead of Scale-Out File Server shares. Downsides: * limited ceph-dokan locking support (host level only through Dokany) * no Windows ACL support 4. Exposing CephFS through Samba Downsides: * it might not have full SMB3 support or the clustering features of SoFS * potentially unsafe if other clients access CephFS directly 5. Using passthrough VM disk attachments instead of CSV backed VHDx image files Downsides: * Hyper-V passthrough disk addressing issues [8] * No support for VSS assisted snapshots Work items ========== Phase 1 ------- * accept PR IN / OUT commands * store the PR data as image object xattr * check the PR data before each IO operation * already proposed here: https://github.com/ceph/ceph/pull/48631 Phase 2 ------- * tag IO operations with the PR generation and use cmpxattr to atomically validate the tag * set / update the pr generation xattr against each image object * also need to handle new objects that got created after a write * implement functional tests * some tests might require a multi-node setup (e.g write while being preempted) * consider disabling the rbd cache if PR is enabled Phase 3 ------- * consider implementing some of the SPC-3 commands that weren't mandatory for MSFC support and haven't been implemented already * REPORT CAPABILITIES * READ FULL STATUS * inform clients of PR updates through the SENSE status * PREEMPT AND ABORT reservation action * reconsider LUN reset handling - at the moment WNBD is discarding the IO queue and marking pending IO as aborted but it doesn't notify rbd-wnbd Related work ============ * https://tracker.ceph.com/projects/ceph/wiki/Clustered_SCSI_target_using_RBD Links ===== [1] https://github.com/torvalds/linux/blob/76f598ba7d8e2bfb4855b5298caedd5af0c374a8/Documentation/block/pr.rst [2] https://learn.microsoft.com/en-us/windows-server/failover-clustering/failover-clustering-overview [3] https://learn.microsoft.com/en-us/windows-server/failover-clustering/failover-cluster-csvs [4] https://learn.microsoft.com/en-us/previous-versions/windows/it-pro/windows-server-2012-r2-and-2012/jj863389(v=ws.11) [5] https://learn.microsoft.com/en-us/azure-stack/hci/manage/vm-load-balancing [6] https://techcommunity.microsoft.com/t5/failover-clustering/failover-cluster-vm-load-balancing-in-windows-server-2016/ba-p/372084 [7] https://learn.microsoft.com/en-us/windows-server/failover-clustering/sofs-overview [8] https://github.com/cloudbase/wnbd/issues/61
Updated by Lucian Petrut over 1 year ago
- Status changed from New to Fix Under Review
- Assignee set to Lucian Petrut
- Pull request ID set to 48631
Actions