Version 2 - History - Clustering a few NAS into a Ceph cluster - Ceph - Ceph

2

Jessica Mack

3

A NAS has at least 4GB of RAM, a network interface, and a single file systems supported by Ceph. Each NAS has the Ceph mon and OSD binaries installed. The NAS is assumed to provide the user with a way to connect to its IP in which a web server runs. A single page on this web server is dedicated to Ceph and contains:

4

5

* The admin key of the cluster

6

* An input field where an admin key can be entered

7

* The value of the mon hosts configuration value (i.e. the list of IP of the monitors)

8

* An input field where a monitor IP can be entered

9

* A capacity meter that shows the total number of bytes in the cluster and the % of occupation. If the cluster is inactive for some reason, the capacity meter is dimmed.

10

* A *health* indicator that is

11

** green if the pool size > 1 and the cluster health is ok. It means any NAS in the cluster can be lost without loosing data.

12

** orange otherwise. It could mean there are degraded objects but any condition other than green is assumed to be orange.

13

** red if there is at least one unfound object: it can only be removed by manual intervention.

14

15

h3. First NAS

16

17

When a NAS comes up for the first time, a mon and an OSD are created locally. When the user connects to the Ceph web page, the admin key and IP:PORT of this cluster are displayed. The capacity meter shows the how much space is available on this cluster. Since it contains only one OSD, it can only be used to create pools with one replica, which is convenient for testing even though it is not useful for actual usage.

18

19

* (echo '[global]' ; echo fsid = $(uuidgen) ) > /etc/ceph/ceph.conf

20

* echo mon host = $(getent hosts $(hostname) | cut -f1 -d' ')

21

* mon_name=a

22

* mon_data=/var/lib/ceph/mon/ceph-$mon_name

23

* keyring=$mon_data/keyring

24

* ceph-authtool --create-keyring --gen-key --name=mon. $keyring --cap mon 'allow *'

25

* ceph-mon --id $mon_name --keyring $keyring --mkfs --mon-data $mon_data

26

* touch $mon_data/done $mon_data/sysvinit

27

* /etc/init.d/ceph start mon

28

* copy $keyring [mon.] key into a copy of the /etc/ceph/ceph.client.admin.keyring file, in place of the [client.admin] key and ceph auth import the copy, move the copy to /etc/ceph/ceph.client.admin.keyring. the [mon.] and [client.admin] key are the same

29

* otherwise follow http://ceph.com/docs/master/install/...al-deployment/

30

31

h3. Second NAS

32

33

The second NAS created a cluster in the same way as the first. It is on the same subnet as the first NAS but they are two independent Ceph clusters. The user goes to the web interface of the second NAS and enters the admin key and the IP of the first cluster. The second NAS creates a new OSD and a new mon, connected to the cluster of the first NAS. The capacity meter increases and confirms the connection is effective.

34

The cluster that was created when the second NAS booted is deactivated and the cluster of the first NAS becomes the default. The web page of the second NAS now shows the value of the mon host configuration value which includes the IP of the first NAS and the IP of the second NAS. It also show the admin key of the first NAS. A given NAS can only be part of one cluster at a time.

35

36

* $key=$(wget admin / mon key from first NAS)

37

* $ip=IP of the first NAS

38

* trash or archive the existing mon / OSD

39

* create /etc/ceph/ceph.client.admin.keyring with the key

40

* get the fsid with ceph -s or something and create /etc/ceph/ceph.conf with mon host = $ip

41

* use http://docs.ceph.com/docs/master/rad...dd-or-rm-mons/

42

43

h3. Third NAS

44

45

The third NAS can join the cluster composed of the first two NAS using the same method as the second NAS. It does not matter which NAS is used to copy / paste the admin key or the IP. The admin key is the same and any mon can be used to join the cluster.

46

47

h3. One NAS is down

48

49

Let say the three NAS cluster is used to store data in a replicated pool with two copies of each object. After osd_down_out_interval seconds the OSD is considered down and the other two will start replicating objects. The capacity meter in the web interface of any of the two remaining NAS show an increase in the percentage of used space and a decrease of the cluster size.

50

When and if the missing NAS reconnects, the mon will join the quorum, the OSD will become available again and data will redistribute again.

51

52

h3. Two NAS are down

53

54

Since there is only one mon out of three, it cannot be elected and the cluster stops, waiting for another mon to show up. When one of the two mons reconnects, IO can resume, as in the scenario where one NAS is down.

55

56

h3. The IP of a NAS changes

57

58

When the network interface is assigned an IP, the NAS compares it to the IP of the active monitor. If it is not the same, presumably because the DHCP server decided so, a new mon is created and the former one is removed. The OSD remains the same.

59

60

h3. The IP of all NAS change

61

62

The user goes to the web interface of one NAS and enters the IP of another NAS. They will pair as they did when they were first configured. The only difference being that they already have an OSD and a configuration for the cluster id: they won't be overridden. The only information required to resume operations is the IP of the other monitors.

63

64

h3. Overlay L2 network

65

66

To accommodate NAS that are behind a NAT, all MONs and OSDs are connected on a tinc L2 overlay network. The IP:PORT displayed on the web interface is the public IP of the MON, even if it really is behind a NAT.

67

* If not behind a NAT, it is used as is.

68

* If behind a NAT and IP:PORT is forwarded to the MON from the firewall doing the nat and it can be used as is.

69

* If behind a NAT and IP:PORT is not forwarded, it cannot be used by anyone to join the cluster.

70

71

When a NAS tries to join:

72

* If the IP is public, use it for MON

73

* If the IP is private, use it if it for MON is on the same subnet, otherwise fail.

74

75

The IP:port is a web service that will return the tinc configuration to use (untar in /etc/tinc) to connect. When asked for the tinc configuration, three parameters are provided:

76

* the cluster key

77

* the NAS uuid

78

* the NAS IP

79

80

The web service does:

81

* osd_id=$(ceph osd create $uuid)

82

* nasip=$subnet.$osd_id

83

* create an entry for the nas

84

85

The web service returns

86

* the tar of the /etc/tinc directory

87

88

Tinc is started once /etc/tinc is untarred

89

90

All OSDs are configured with cluster/public network using the subnet of the tinc interface (can be optimized by adding routes to not use tinc when on the same LAN and reduce the overhead).

91

92

The /etc/tinc directory is a CephFS mounted file system.

93

94

h3. Web service

95

96

Used by

97

* the user interface page

98

* bootstrap to get the tinc configuration

99

* updates of /etc/ceph/ceph.conf

100

101

h3. Updating /etc/ceph/ceph.conf:mon hosts

102

103

The host answering the web srevice call to create / destroy a mon, call the web service of each host to update the mon host with ceph mon_status. This update may take time: dead MON imply delay when a client tries it and fails with a timeout, but that won't impact the MON or OSD or MDS operations.

104

105

h3. Notes

106

107

* the [mon.] and [client.admin] are made the same with *ceph auth export + ceph auth import* otherwise it is necessary to copy two keys instead of one from one NAS to another

Project

General

Profile

Ceph

Clustering a few NAS into a Ceph cluster » History » Version 2