Feature #57109: windows: rbd-wnbd SCSI persistent reservations - Ceph - Ceph

Actions

Copy link

Feature #57109

open

windows: rbd-wnbd SCSI persistent reservations

Added by Lucian Petrut over 1 year ago. Updated about 1 year ago.

Status:

Fix Under Review

Priority:

Normal

Assignee:

Lucian Petrut

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Reviewed:

Affected Versions:

Pull request ID:

48631

Description

rbd-wnbd Persistent Reservation support
=======================================

rbd-wnbd allows exposing RBD images as virtual SCSI disks to Windows hosts.

Our goal is to support the SCSI Persistent Reservation feature [1], which is
commonly leveraged by Active-Active clusters. Specifically, we aim to
acommodate Microsoft Failover Clusters [2] and Cluster Shared Volumes [3].

Use cases
=========

1. Hyper-V clustering [4] transparently moves virtual machines to other
hosts when detecting node failures. It can also perform load balancing, moving
VMs to underutilized hosts [5][6].

Cluster Shared Volumes (CSV) are commonly used to store the clustered virtual
machine disk images (VHD/x files). However, Cluster Shared Volumes require
SCSI Persistent Reservation support.

2. CSVs can also back Scale-Out File Server shares (highly available SMB
shares) [7].

The above mentioned features are heavily used by Hyper-V deployments.

Proposed change
===============

We propose adding rbd-wnbd support for SCSI Persistent Reservations.

The WNBD driver has already been updated to forward PR commands to
userspace daemons such as rbd-wnbd, which in turn will have to:

* handle SCSI PERSISTENT RESERVE IN commands, used to retrieve the
  current reservations
* handle SCSI PERSISTENT RESERVE OUT commands, used to modify the
  reservations
* return a reservation conflict for IO operations that are forbidden by the
  current reservations

WNBD and MSFC expect SCSI SPC-3 semantics.

Storing the PR data
-------------------

The following PR data needs to be stored:

* PR registrations
    * key
    * initiator id (we'll use the hostname)
* PR reservations
    * key
    * initiator id
    * reservation type
* PR generation

As per SCSI specifications, the PR generation shall be incremented when
processing the following PERSISTENT RESERVE OUT actions:

* REGISTER
* CLEAR
* PREEMPT
* PREEMPT_AND_ABORT
* REGISTER_AND_IGNORE_EXISTING_KEY
* REGISTER_AND_MOVE

The PR generation will not be incremented by the following actions:

* RESERVE
* RELEASE

Similarly to the target_core_rbd implementation, we intend to store the PR data
using object extended attributes. The reasons for not using rbd metadata:

* no cmpxattr support
    * used to safely issue updates
    * used to validate IO operations
* no binary data support - we intend to use ceph encoding for convenience and
  performance reasons

The PR data will be stored as an xattr of the image header object. The cmpxattr
operation will be used when performing updates to ensure that it hasn't been
modified in the meantime. In such cases, the PERSISTENT RESERVE OUT command
will be retried.

The PR generation will also be stored as an xattr of each image object so that
it may be atomically checked using cmpxattr when performing IO operations.

Updating each image object PR generation xattr will require an rbd exclusive
lock. Note that the PR generation will also have to be set when new objects
get created as a result of write operations, in which case we could check
other objects (e.g. the header).

Validating the PR data
----------------------

Before each IO operation, the current PR data will need to be checked. A
reservation conflict will be returned if the IO operation is not allowed by
the current reservations.

Since the persistent reservations may change between the moment in which they're
validated and the moment in which the IO operation is actually performed, we
intend to tag each IO operation with the PR generation. By using a cmpxattr
operation, we can safely discard pending IO operations if the reservations
change in the meantime. Failing to do so might allow writes to be performed
after a host has been preempted, which can lead to data corruption.

When detecting a PR generation (object xattr) mismatch, we'll read the updated
reservations and retry the IO operation if allowed by the current reservation,
returning a reservation conflict error to WNBD otherwise.

In order to avoid the additional round trip for each IO operation, we might
cache the PR data, only refreshing it when detecting xattr mismatches or when
receiving a PR OUT command.

Enabling the feature
--------------------

This feature will initially be considered experimental and disabled by default.
Users may enable it by passing "--enable-pr" to rbd-wnbd.

Limitations
-----------

* only Windows hosts that use rbd-wnbd will be aware of the persistent
  reservations

Public RBD API changes
----------------------

We need each IO operation to atomically validate the PR generation. This is
performed using cmpxattr.

There are two ways in which the expected PR generation could be passed:

* using the image context
    * would avoid the need of public API changes
    * the image context is shared across IO requests, which could lead to race
      conditions. For example, the cached PR generation might be refreshed
      while having pending IO requests, which would wrongfully use the new tag.
      We could end up with write operations that proceed despite the host
      being preempted in the meantime
* passing the PR generation as an additional argument to each IO operation
    * the parameter could be named as "tag" in order to keep this SCSI agnostic
    * we'd end up having an additional set of public librbd functions
      (e.g. aio_write3)
    * each IO request would have its own copy of the expected tag, avoding
      potential race conditions

Alternatives
------------

1. Some Ceph deployments currently use the iSCSI gateway with Suse's custom
"target_core_rbd" module, which supports Persistent Reservations. The PetaSAN
solution is another SUSE derivative that leverages this module.

Downsides:

* the iSCSI gateway poses performance and scalability limitations that can be
  avoided by using the native RBD Windows client
* SUSE Enterprise Storage has been discontinued, which probably means that the
  custom target_core_rbd module will receive limited support

2. Another option would be to avoid Ceph entirely and use SANs (probably no need
to specify the downsides) or Microsoft Storage Spaces [8], which has the
following downsides:

* scalability concerns - maximum of 16 servers
* high licensing costs
* can be difficult to consume in mixed Linux/Windows environments

3. CephFS may be used instead of Scale-Out File Server shares.

Downsides:

* limited ceph-dokan locking support (host level only through Dokany)
* no Windows ACL support

4. Exposing CephFS through Samba

Downsides:

* it might not have full SMB3 support or the clustering features of SoFS
* potentially unsafe if other clients access CephFS directly

5. Using passthrough VM disk attachments instead of CSV backed VHDx image files

Downsides:

* Hyper-V passthrough disk addressing issues [8]
* No support for VSS assisted snapshots

Work items
==========

Phase 1
-------

* accept PR IN / OUT commands
* store the PR data as image object xattr
* check the PR data before each IO operation
* already proposed here: https://github.com/ceph/ceph/pull/48631

Phase 2
-------

* tag IO operations with the PR generation and use cmpxattr to atomically
  validate the tag
* set / update the pr generation xattr against each image object
    * also need to handle new objects that got created after a write
* implement functional tests
    * some tests might require a multi-node setup (e.g write while being
      preempted)
* consider disabling the rbd cache if PR is enabled

Phase 3
-------

* consider implementing some of the SPC-3 commands that weren't mandatory
  for MSFC support and haven't been implemented already
      * REPORT CAPABILITIES
      * READ FULL STATUS
      * inform clients of PR updates through the SENSE status
        * PREEMPT AND ABORT reservation action
        * reconsider LUN reset handling - at the moment WNBD is discarding
          the IO queue and marking pending IO as aborted but it doesn't
          notify rbd-wnbd 

Related work
============

* https://tracker.ceph.com/projects/ceph/wiki/Clustered_SCSI_target_using_RBD

Links
=====

[1] https://github.com/torvalds/linux/blob/76f598ba7d8e2bfb4855b5298caedd5af0c374a8/Documentation/block/pr.rst
[2] https://learn.microsoft.com/en-us/windows-server/failover-clustering/failover-clustering-overview
[3] https://learn.microsoft.com/en-us/windows-server/failover-clustering/failover-cluster-csvs
[4] https://learn.microsoft.com/en-us/previous-versions/windows/it-pro/windows-server-2012-r2-and-2012/jj863389(v=ws.11)
[5] https://learn.microsoft.com/en-us/azure-stack/hci/manage/vm-load-balancing
[6] https://techcommunity.microsoft.com/t5/failover-clustering/failover-cluster-vm-load-balancing-in-windows-server-2016/ba-p/372084
[7] https://learn.microsoft.com/en-us/windows-server/failover-clustering/sofs-overview
[8] https://github.com/cloudbase/wnbd/issues/61