Project

General

Profile

Clustered SCSI target using RBD » History » Revision 3

Revision 2 (Jessica Mack, 07/01/2015 11:01 PM) → Revision 3/7 (Mike Christie, 01/28/2016 10:10 PM)

h1. Clustered SCSI target using RBD 

 h3. Summary 

 The goal of this project is to modify the Linux target layer, LIO, to be able to support active/active access to a device across multiple nodes running LIO. The changes to LIO are being done in a generic way to allow other cluster aware devices to be used, but our focus is on using RBD. 

 h3. Owners 

 * Mike Christie (Red Hat) 

 h3. Interested Parties 

 * Name (Affiliation) 

 h3. Current Status 

 There are many methods to configure LIO's iSCSI target for High Availability (HA). None support active/active, and most open source implementations do not support distributed SCSI Persistent Group Reservations. 

 h3. Detailed Description 

 In order for some operating systems to be able to access RBD they must go through a SCSI target gateway. Support for RBD with target layers like LIO and TGT exist today, but the HA implementations are lacking features, difficult to use, or only support one transport like iSCSI. To resolve these issues, we are modifying LIO, so that it can be run on multiple nodes and provide SCSI active-optimized access through all ports on all nodes at the same time. 

 There are several areas where active/active support in LIO requires distributed meta data and/or locking: SCSI task management/Unit Attention/PREEMPT AND ABORT handling, COMPARE_AND_WRITE support, Persistent Group Reservations, and INQUIRY/ALUA/discovery/setup related commands and state. 

 - SCSI task management (TMF) / Unit Attention (UA) / PREEMPT AND ABORT handling: 
 When a initiator cannot determine the state of a device or commands running on the device, it will send TMFs like LOGICAL UNIT RESET. Depending on the SCSI settings used (TAS, QERR, TST) requests like this may require actions to be taked on all LIO nodes. For example, running commands might need to be aborted, notifications like Unit Attentions must be sent, etc. 
 
 Other non TMF requests like PERSISTENT RESERVE IN - PREEMPT AND ABORT may also require commands to be abort on remote nodes. 

 
 
 To synchronize TMF execution across nodes, the ceph watch notify feature will be used. The initial patches for this were posted support these requirements we are investigating passing these TMFs/requests to userspace and using cluster tools like Corosync's cpg library to execute helper scripts on ceph-devel here: 
 http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/24553 

 The current version all nodes.    These scripts wold interact with fixes LIO through its configfs interface to syn up and additions by Douglas Fuller can be found here: 
 https://github.com/fullerdj/ceph-client/commits/wip-djf-watch-notify2-new 

 Status: 
 We perform the reset. 
 
 Another alternative we are currently modifying looking into is pushing the handling down to the real backing storage device. we would add a interface to the Linux kernel block layer to support task management requests, so LIO's iblock backend module where upper layers can request that block drivers abort commands, reset devices and sent events. LIO would then call into Low Level Drivers (LLD) like krbd to perform driver specific actions. 


 RBD, and RBD would handle the synchronization and clean up. 

 - COMPARE_AND_WRITE (CAW) support: 

 CAW is a SCSI command used by ESX to perform finely grained locking. The execution of CAW requires that the handler atomically read N blocks of data, compare them to a buffer passed in with the command, then if matching write N blocks of data. To guarantee this operation is done atomically, LIO uses a standard linux kernel mutex.    For multiple node active/active support, we have proposed to pass this request to the backing storage. This will allow the backing storage to utilize its own locking and serialization support, and LIO will not need to use a clustered lock like DLM. 

 Patches for passing COMPARE_AND_WRITE directly to the backing store have been sent upstream for review: 
 http://www.spinics.net/lists/target-devel/msg07823.html 

 The current patches along with ceph/rbd RBD support are here: has not yet been implemented. 

 https://github.com/mikechristie/linux-kernel/commits/ceph 

 Status: 
 The request/bio operation patches are waiting to be merged, then the ceph/rbd patches will be posted for review. 

 - Persistent Group Reservations (PGR): 

 PGRs allow a initiator to control access to a device. This access information needs to be distributed across all nodes and can be dynamically updated while other commands are being processed. 

 David Disseldorp has implemented ceph/rbd To distribute this information, pushing the PGR support here: 

 https://git.samba.org/?p=ddiss/linux.git;a=shortlog;h=refs/heads/target_rbd_pr_sq_20160126 

 This code execution to userspace and using corosync's cpg library is now being ported investigated. Other alternatives being looked at are using DLM or a clustered FS to share the upstream linux kernel reservation API added in this patch data among nodes. 

 https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/block/ioctl.c?id=bbd3e064362e5057cc4799ba2e4d68c7593e490b 

 When this is completed, LIO will call into the iblock backend which will then call rbd's pr_ops. 


 - Device state: 
 I have only just begun to look into how to distribute device state like ALUA port states, INQUIRY info like UUIDs, and configuration: device settings. 

 SUSE's lrbd package will manage configuration: - Configuration: 

 https://github.com/SUSE/lrbd TODO. GUI/tools for setup/configuration. 

 h3. Work items 

 h4. Coding tasks 

 # Task 1 
 # Task 2 
 # Task 3 

 h4. Build / release tasks 

 # Task 1 
 # Task 2 
 # Task 3 

 h4. Documentation tasks 

 # Task 1 
 # Task 2 
 # Task 3 

 h4. Deprecation tasks 

 # Task 1 
 # Task 2 
 # Task 3