Clustering a few NAS into a Ceph cluster

A NAS has at least 4GB of RAM, a network interface, and a single file systems supported by Ceph. Each NAS has the Ceph mon and OSD binaries installed. The NAS is assumed to provide the user with a way to connect to its IP in which a web server runs. A single page on this web server is dedicated to Ceph and contains:

  • The admin key of the cluster
  • An input field where an admin key can be entered
  • The value of the mon hosts configuration value (i.e. the list of IP of the monitors)
  • An input field where a monitor IP can be entered
  • A capacity meter that shows the total number of bytes in the cluster and the % of occupation. If the cluster is inactive for some reason, the capacity meter is dimmed.
  • A health indicator that is
    • green if the pool size > 1 and the cluster health is ok. It means any NAS in the cluster can be lost without loosing data.
    • orange otherwise. It could mean there are degraded objects but any condition other than green is assumed to be orange.
    • red if there is at least one unfound object: it can only be removed by manual intervention.

First NAS

When a NAS comes up for the first time, a mon and an OSD are created locally. When the user connects to the Ceph web page, the admin key and IP:PORT of this cluster are displayed. The capacity meter shows the how much space is available on this cluster. Since it contains only one OSD, it can only be used to create pools with one replica, which is convenient for testing even though it is not useful for actual usage.

  • (echo '[global]' ; echo fsid = $(uuidgen) ) > /etc/ceph/ceph.conf
  • echo mon host = $(getent hosts $(hostname) | cut -f1 -d' ')
  • mon_name=a
  • mon_data=/var/lib/ceph/mon/ceph-$mon_name
  • keyring=$mon_data/keyring
  • ceph-authtool --create-keyring --gen-key --name=mon. $keyring --cap mon 'allow *'
  • ceph-mon --id $mon_name --keyring $keyring --mkfs --mon-data $mon_data
  • touch $mon_data/done $mon_data/sysvinit
  • /etc/init.d/ceph start mon
  • copy $keyring [mon.] key into a copy of the /etc/ceph/ceph.client.admin.keyring file, in place of the [client.admin] key and ceph auth import the copy, move the copy to /etc/ceph/ceph.client.admin.keyring. the [mon.] and [client.admin] key are the same
  • otherwise follow

Second NAS

The second NAS created a cluster in the same way as the first. It is on the same subnet as the first NAS but they are two independent Ceph clusters. The user goes to the web interface of the second NAS and enters the admin key and the IP of the first cluster. The second NAS creates a new OSD and a new mon, connected to the cluster of the first NAS. The capacity meter increases and confirms the connection is effective.
The cluster that was created when the second NAS booted is deactivated and the cluster of the first NAS becomes the default. The web page of the second NAS now shows the value of the mon host configuration value which includes the IP of the first NAS and the IP of the second NAS. It also show the admin key of the first NAS. A given NAS can only be part of one cluster at a time.

  • $key=$(wget admin / mon key from first NAS)
  • $ip=IP of the first NAS
  • trash or archive the existing mon / OSD
  • create /etc/ceph/ceph.client.admin.keyring with the key
  • get the fsid with ceph -s or something and create /etc/ceph/ceph.conf with mon host = $ip
  • use

Third NAS

The third NAS can join the cluster composed of the first two NAS using the same method as the second NAS. It does not matter which NAS is used to copy / paste the admin key or the IP. The admin key is the same and any mon can be used to join the cluster.

One NAS is down

Let say the three NAS cluster is used to store data in a replicated pool with two copies of each object. After osd_down_out_interval seconds the OSD is considered down and the other two will start replicating objects. The capacity meter in the web interface of any of the two remaining NAS show an increase in the percentage of used space and a decrease of the cluster size.
When and if the missing NAS reconnects, the mon will join the quorum, the OSD will become available again and data will redistribute again.

Two NAS are down

Since there is only one mon out of three, it cannot be elected and the cluster stops, waiting for another mon to show up. When one of the two mons reconnects, IO can resume, as in the scenario where one NAS is down.

The IP of a NAS changes

When the network interface is assigned an IP, the NAS compares it to the IP of the active monitor. If it is not the same, presumably because the DHCP server decided so, a new mon is created and the former one is removed. The OSD remains the same.

The IP of all NAS change

The user goes to the web interface of one NAS and enters the IP of another NAS. They will pair as they did when they were first configured. The only difference being that they already have an OSD and a configuration for the cluster id: they won't be overridden. The only information required to resume operations is the IP of the other monitors.

Overlay L2 network

To accommodate NAS that are behind a NAT, all MONs and OSDs are connected on a tinc L2 overlay network. The IP:PORT displayed on the web interface is the public IP of the MON, even if it really is behind a NAT.
  • If not behind a NAT, it is used as is.
  • If behind a NAT and IP:PORT is forwarded to the MON from the firewall doing the nat and it can be used as is.
  • If behind a NAT and IP:PORT is not forwarded, it cannot be used by anyone to join the cluster.
When a NAS tries to join:
  • If the IP is public, use it for MON
  • If the IP is private, use it if it for MON is on the same subnet, otherwise fail.
The IP:port is a web service that will return the tinc configuration to use (untar in /etc/tinc) to connect. When asked for the tinc configuration, three parameters are provided:
  • the cluster key
  • the NAS uuid
  • the NAS IP
The web service does:
  • osd_id=$(ceph osd create $uuid)
  • nasip=$subnet.$osd_id
  • create an entry for the nas
The web service returns
  • the tar of the /etc/tinc directory

Tinc is started once /etc/tinc is untarred

All OSDs are configured with cluster/public network using the subnet of the tinc interface (can be optimized by adding routes to not use tinc when on the same LAN and reduce the overhead).

The /etc/tinc directory is a CephFS mounted file system.

Web service

Used by
  • the user interface page
  • bootstrap to get the tinc configuration
  • updates of /etc/ceph/ceph.conf

Updating /etc/ceph/ceph.conf:mon hosts

The host answering the web srevice call to create / destroy a mon, call the web service of each host to update the mon host with ceph mon_status. This update may take time: dead MON imply delay when a client tries it and fails with a timeout, but that won't impact the MON or OSD or MDS operations.


  • the [mon.] and [client.admin] are made the same with ceph auth export + ceph auth import otherwise it is necessary to copy two keys instead of one from one NAS to another
  • bits of implementation