Project

General

Profile

Clustered SCSI target using RBD » History » Version 1

Jessica Mack, 07/01/2015 11:00 PM

1 1 Jessica Mack
h1. Clustered SCSI target using RBD
2
3
h3. Summary
4
5
The goal of this project is to modify the Linux target layer, LIO, to be able to support active/active access to a device across multiple nodes running LIO. The changes to LIO are being done in a generic way to allow other cluster aware devices to be used, but our focus is on using RBD.
6
7
h3. Owners
8
9
* Mike Christie (Red Hat)
10
11
h3. Interested Parties
12
13
* Name (Affiliation)
14
15
h3. Current Status
16
17
There are many methods to configure LIO's iSCSI target for High Availability (HA). None support active/active, and most open source implementations do not support distributed SCSI Persistent Group Reservations.
18
Detailed Description
19
In order for some operating systems to be able to access RBD they must go through a SCSI target gateway. Support for RBD with target layers like LIO and TGT exist today, but the HA implementations are lacking features, difficult to use, or only support one transport like iSCSI. To resolve these issues, we are modifying LIO, so that it can be run on multiple nodes and provide SCSI active-optimized access through all ports on all nodes at the same time.
20
21
There are several areas where active/active support in LIO requires distributed meta data and/or locking: SCSI task management/Unit Attention/PREEMPT AND ABORT handling, COMPARE_AND_WRITE support, Persistent Group Reservations, and INQUIRY/ALUA/discovery/setup related commands and state.
22
23
- SCSI task management (TMF) / Unit Attention (UA) / PREEMPT AND ABORT handling:
24
When a initiator cannot determine the state of a device or commands running on the device, it will send TMFs like LOGICAL UNIT RESET. Depending on the SCSI settings used (TAS, QERR, TST) requests like this may require actions to be taked on all LIO nodes. For example, running commands might need to be aborted, notifications like Unit Attentions must be sent, etc.
25
 
26
Other non TMF requests like PERSISTENT RESERVE IN - PREEMPT AND ABORT may also require commands to be abort on remote nodes.
27
 
28
To support these requirements we are investigating passing these TMFs/requests to userspace and using cluster tools like Corosync's cpg library to execute helper scripts on all nodes.  These scripts wold interact with LIO through its configfs interface to syn up and perform the reset.
29
 
30
Another alternative we are looking into is pushing the handling down to the real backing storage device. we would add a interface to the Linux kernel block layer where upper layers can request that block drivers abort commands, reset devices and sent events. LIO would then call into RBD, and RBD would handle the synchronization and clean up.
31
32
- COMPARE_AND_WRITE (CAW) support:
33
34
CAW is a SCSI command used by ESX to perform finely grained locking. The execution of CAW requires that the handler atomically read N blocks of data, compare them to a buffer passed in with the command, then if matching write N blocks of data. To guarantee this operation is done atomically, LIO uses a standard linux kernel mutex.  For multiple node active/active support, we have proposed to pass this request to the backing storage. This will allow the backing storage to utilize its own locking and serialization support, and LIO will not need to use a clustered lock like DLM.
35
36
Patches for passing COMPARE_AND_WRITE directly to the backing store have been sent upstream for review:
37
http://www.spinics.net/lists/target-devel/msg07823.html
38
39
RBD support has not yet been implemented.
40
41
- Persistent Group Reservations (PGR):
42
43
PGRs allow a initiator to control access to a device. This access information needs to be distributed across all nodes and can be dynamically updated while other commands are being processed. To distribute this information, pushing the PGR execution to userspace and using corosync's cpg library is being investigated. Other alternatives being looked at are using DLM or a clustered FS to share the data among nodes.
44
45
- Device state:
46
I have only just begun to look into how to distribute device state like ALUA port states, INQUIRY info like UUIDs, and device settings.
47
48
- Configuration:
49
50
TODO. GUI/tools for setup/configuration.
51
52
h3. Work items
53
54
h4. Coding tasks
55
56
# Task 1
57
# Task 2
58
# Task 3
59
60
h4. Build / release tasks
61
62
# Task 1
63
# Task 2
64
# Task 3
65
66
h4. Documentation tasks
67
68
# Task 1
69
# Task 2
70
# Task 3
71
72
h4. Deprecation tasks
73
74
# Task 1
75
# Task 2
76
# Task 3