Project

General

Profile

Clustered SCSI target using RBD » History » Version 6

Mike Christie, 01/28/2016 10:36 PM

1 1 Jessica Mack
h1. Clustered SCSI target using RBD
2
3
h3. Summary
4
5
The goal of this project is to modify the Linux target layer, LIO, to be able to support active/active access to a device across multiple nodes running LIO. The changes to LIO are being done in a generic way to allow other cluster aware devices to be used, but our focus is on using RBD.
6
7
h3. Owners
8
9
* Mike Christie (Red Hat)
10
11
h3. Interested Parties
12
13
* Name (Affiliation)
14
15
h3. Current Status
16
17
There are many methods to configure LIO's iSCSI target for High Availability (HA). None support active/active, and most open source implementations do not support distributed SCSI Persistent Group Reservations.
18 2 Jessica Mack
19
h3. Detailed Description
20
21 1 Jessica Mack
In order for some operating systems to be able to access RBD they must go through a SCSI target gateway. Support for RBD with target layers like LIO and TGT exist today, but the HA implementations are lacking features, difficult to use, or only support one transport like iSCSI. To resolve these issues, we are modifying LIO, so that it can be run on multiple nodes and provide SCSI active-optimized access through all ports on all nodes at the same time.
22
23
There are several areas where active/active support in LIO requires distributed meta data and/or locking: SCSI task management/Unit Attention/PREEMPT AND ABORT handling, COMPARE_AND_WRITE support, Persistent Group Reservations, and INQUIRY/ALUA/discovery/setup related commands and state.
24
25
- SCSI task management (TMF) / Unit Attention (UA) / PREEMPT AND ABORT handling:
26
When a initiator cannot determine the state of a device or commands running on the device, it will send TMFs like LOGICAL UNIT RESET. Depending on the SCSI settings used (TAS, QERR, TST) requests like this may require actions to be taked on all LIO nodes. For example, running commands might need to be aborted, notifications like Unit Attentions must be sent, etc.
27
 
28
Other non TMF requests like PERSISTENT RESERVE IN - PREEMPT AND ABORT may also require commands to be abort on remote nodes.
29
30 3 Mike Christie
To synchronize TMF execution across nodes, the ceph watch notify feature will be used. The initial patches for this were posted on ceph-devel here:
31
http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/24553
32
33
The current version with fixes and additions by Douglas Fuller can be found here:
34
https://github.com/fullerdj/ceph-client/commits/wip-djf-watch-notify2-new
35
36
Status:
37
We are currently modifying the block layer to support task management requests, so LIO's iblock backend module can call into Low Level Drivers (LLD) like krbd to perform driver specific actions.
38
39
40 1 Jessica Mack
- COMPARE_AND_WRITE (CAW) support:
41
42
CAW is a SCSI command used by ESX to perform finely grained locking. The execution of CAW requires that the handler atomically read N blocks of data, compare them to a buffer passed in with the command, then if matching write N blocks of data. To guarantee this operation is done atomically, LIO uses a standard linux kernel mutex.  For multiple node active/active support, we have proposed to pass this request to the backing storage. This will allow the backing storage to utilize its own locking and serialization support, and LIO will not need to use a clustered lock like DLM.
43
44
Patches for passing COMPARE_AND_WRITE directly to the backing store have been sent upstream for review:
45
http://www.spinics.net/lists/target-devel/msg07823.html
46
47 3 Mike Christie
The current patches along with ceph/rbd support are here:
48 1 Jessica Mack
49 3 Mike Christie
https://github.com/mikechristie/linux-kernel/commits/ceph
50
51
Status:
52 4 Mike Christie
The request/bio operation patches are waiting to be merged. The ceph/rbd patches will be posted for review when that is done.
53 3 Mike Christie
54 1 Jessica Mack
- Persistent Group Reservations (PGR):
55
56 3 Mike Christie
PGRs allow a initiator to control access to a device. This access information needs to be distributed across all nodes and can be dynamically updated while other commands are being processed.
57 1 Jessica Mack
58 3 Mike Christie
David Disseldorp has implemented ceph/rbd PGR support here:
59 1 Jessica Mack
60 3 Mike Christie
https://git.samba.org/?p=ddiss/linux.git;a=shortlog;h=refs/heads/target_rbd_pr_sq_20160126
61 1 Jessica Mack
62 4 Mike Christie
63
Status:
64
This code is now being ported to the upstream linux kernel reservation API added in this commit:
65 3 Mike Christie
66
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/block/ioctl.c?id=bbd3e064362e5057cc4799ba2e4d68c7593e490b
67
68
When this is completed, LIO will call into the iblock backend which will then call rbd's pr_ops.
69
70
71
- Device state and configuration:
72
73
SUSE's lrbd package will manage configuration:
74
75
https://github.com/SUSE/lrbd
76 1 Jessica Mack
77 5 Mike Christie
78
Status:
79
80
This currently only supports iSCSI. The LIO and ceph/rbd modifications are not tied to specific SCSI transports. lrbd will be modified to support Fibre Channel, SRP, etc.
81
82
- Extra:
83
84 6 Mike Christie
The most common use will likely be with VShpere/ESX. In this initial version most VAAI functions will be supported. VVOL/VASA support is being investigated for the next release.
85 5 Mike Christie
86
VAAI:
87
88
Delete/UNAMP - already completed and upstream.
89
90
ATS/COMPARE_AND_WRITE - completed and waiting posting/review. Current patches:
91
https://github.com/mikechristie/linux-kernel/tree/ceph
92
93
Zero/WRITE_SAME - completed and waiting posting/review. Current patches:
94
https://github.com/mikechristie/linux-kernel/tree/ceph.
95
96 6 Mike Christie
Clone/XCOPY/EXTENDED_COPY - This operation can be done in chunks of 4 - 16 MBs, but ceph/rbd's cloning function works at the device level. Currently, LIO will export that we support this, but the target node will have to do a read and write of the data, so we do not have true offloading. 
97 5 Mike Christie
98 6 Mike Christie
XCOPY/EXTENDED_COPY is being implemented in the upstream kernel. For the next version, we will investigate ceph/rbd support. We are also looking into the VASA cloneVirtualVolume operation.
99 5 Mike Christie
100
VVOL/VASA:
101
102
This will not be completed in this version. 
103
104
105 1 Jessica Mack
h3. Work items
106
107
h4. Coding tasks
108
109
# Task 1
110
# Task 2
111
# Task 3
112
113
h4. Build / release tasks
114
115
# Task 1
116
# Task 2
117
# Task 3
118
119
h4. Documentation tasks
120
121
# Task 1
122
# Task 2
123
# Task 3
124
125
h4. Deprecation tasks
126
127
# Task 1
128
# Task 2
129
# Task 3