Project

General

Profile

Clustering a few NAS into a Ceph cluster » History » Version 2

Loïc Dachary, 01/09/2016 01:38 PM

1 2 Loïc Dachary
h3. Clustering a few NAS into a Ceph cluster
2 1 Jessica Mack
3
A NAS has at least 4GB of RAM, a network interface, and a single file systems supported by Ceph. Each NAS has the Ceph mon and OSD binaries installed. The NAS is assumed to provide the user with a way to connect to its IP in which a web server runs. A single page on this web server is dedicated to Ceph and contains:
4
 
5
* The admin key of the cluster
6
* An input field where an admin key can be entered
7
* The value of the mon hosts configuration value (i.e. the list of IP of the monitors)
8
* An input field where a monitor IP can be entered
9
* A capacity meter that shows the total number of bytes in the cluster and the % of occupation. If the cluster is inactive for some reason, the capacity meter is dimmed.
10
* A *health* indicator that is
11
** green if the pool size > 1 and the cluster health is ok. It means any NAS in the cluster can be lost without loosing data.
12
** orange otherwise. It could mean there are degraded objects but any condition other than green is assumed to be orange.
13
** red if there is at least one unfound object: it can only be removed by manual intervention.
14
15
h3. First NAS
16
17
When a NAS comes up for the first time, a mon and an OSD are created locally. When the user connects to the Ceph web page, the admin key and IP:PORT of this cluster are displayed. The capacity meter shows the how much space is available on this cluster. Since it contains only one OSD, it can only be used to create pools with one replica, which is convenient for testing even though it is not useful for actual usage.
18
 
19
* (echo '[global]' ; echo fsid = $(uuidgen) ) > /etc/ceph/ceph.conf
20
* echo mon host = $(getent hosts $(hostname) | cut -f1 -d' ')
21
* mon_name=a
22
* mon_data=/var/lib/ceph/mon/ceph-$mon_name
23
* keyring=$mon_data/keyring
24
* ceph-authtool --create-keyring --gen-key --name=mon. $keyring --cap mon 'allow *'
25
* ceph-mon --id $mon_name --keyring $keyring --mkfs --mon-data $mon_data
26
* touch $mon_data/done $mon_data/sysvinit
27
* /etc/init.d/ceph start mon
28
* copy $keyring [mon.] key into a copy of the /etc/ceph/ceph.client.admin.keyring file, in place of the [client.admin] key and ceph auth import the copy, move the copy to /etc/ceph/ceph.client.admin.keyring. the [mon.] and [client.admin] key are the same
29
* otherwise follow http://ceph.com/docs/master/install/...al-deployment/
30
31
h3. Second NAS
32
33
The second NAS created a cluster in the same way as the first. It is on the same subnet as the first NAS but they are two independent Ceph clusters. The user goes to the web interface of the second NAS and enters the admin key and the IP of the first cluster. The second NAS creates a new OSD and a new mon, connected to the cluster of the first NAS. The capacity meter increases and confirms the connection is effective.
34
The cluster that was created when the second NAS booted is deactivated and the cluster of the first NAS becomes the default. The web page of the second NAS now shows the value of the mon host configuration value which includes the IP of the first NAS and the IP of the second NAS. It also show the admin key of the first NAS. A given NAS can only be part of one cluster at a time.
35
 
36
* $key=$(wget admin / mon key from first NAS)
37
* $ip=IP of the first NAS
38
* trash or archive the existing mon / OSD
39
* create /etc/ceph/ceph.client.admin.keyring with the key
40
* get the fsid with ceph -s or something and create /etc/ceph/ceph.conf with mon host = $ip
41
* use http://docs.ceph.com/docs/master/rad...dd-or-rm-mons/
42
43
h3. Third NAS
44
45
The third NAS can join the cluster composed of the first two NAS using the same method as the second NAS. It does not matter which NAS is used to copy / paste the admin key or the IP. The admin key is the same and any mon can be used to join the cluster.
46
47
h3. One NAS is down
48
49
Let say the three NAS cluster is used to store data in a replicated pool with two copies of each object. After osd_down_out_interval seconds the OSD is considered down and the other two will start replicating objects. The capacity meter in the web interface of any of the two remaining NAS show an increase in the percentage of used space and a decrease of the cluster size.
50
When and if the missing NAS reconnects, the mon will join the quorum, the OSD will become available again and data will redistribute again.
51
52
h3. Two NAS are down
53
54
Since there is only one mon out of three, it cannot be elected and the cluster stops, waiting for another mon to show up. When one of the two mons reconnects, IO can resume, as in the scenario where one NAS is down.
55
56
h3. The IP of a NAS changes
57
58
When the network interface is assigned an IP, the NAS compares it to the IP of the active monitor. If it is not the same, presumably because the DHCP server decided so, a new mon is created and the former one is removed. The OSD remains the same.
59
60
h3. The IP of all NAS change
61
62
The user goes to the web interface of one NAS and enters the IP of another NAS. They will pair as they did when they were first configured. The only difference being that they already have an OSD and a configuration for the cluster id: they won't be overridden. The only information required to resume operations is the IP of the other monitors.
63
64
h3. Overlay L2 network
65
66
To accommodate NAS that are behind a NAT, all MONs and OSDs are connected on a tinc L2 overlay network. The IP:PORT displayed on the web interface is the public IP of the MON, even if it really is behind a NAT.
67
* If not behind a NAT, it is used as is.
68
* If behind a NAT and IP:PORT is forwarded to the MON from the firewall doing the nat and it can be used as is.
69
* If behind a NAT and IP:PORT is not forwarded, it cannot be used by anyone to join the cluster.
70
71
When a NAS tries to join:
72
* If the IP is public, use it for MON
73
* If the IP is private, use it if it for MON is on the same subnet, otherwise fail.
74
75
The IP:port is a web service that will return the tinc configuration to use (untar in /etc/tinc) to connect. When asked for the tinc configuration, three parameters are provided:
76
* the cluster key
77
* the NAS uuid
78
* the NAS IP
79
80
The web service does:
81
* osd_id=$(ceph osd create $uuid)
82
* nasip=$subnet.$osd_id
83
* create an entry for the nas
84
85
The web service returns
86
* the tar of the /etc/tinc directory
87
88
Tinc is started once /etc/tinc is untarred
89
90
All OSDs are configured with cluster/public network using the subnet of the tinc interface (can be optimized by adding routes to not use tinc when on the same LAN and reduce the overhead).
91
92
The /etc/tinc directory is a CephFS mounted file system.
93
94
h3. Web service
95
96
Used by
97
* the user interface page
98
* bootstrap to get the tinc configuration
99
* updates of /etc/ceph/ceph.conf
100
101
h3. Updating /etc/ceph/ceph.conf:mon hosts
102
103
The host answering the web srevice call to create / destroy a mon, call the web service of each host to update the mon host with ceph mon_status. This update may take time: dead MON imply delay when a client tries it and fails with a timeout, but that won't impact the MON or OSD or MDS operations.
104
105
h3. Notes
106
107
* the [mon.] and [client.admin] are made the same with *ceph auth export + ceph auth import* otherwise it is necessary to copy two keys instead of one from one NAS to another