A little over a year ago, I made a post about a new design for active/active NFS clustering. Since then, an implementation of it has been merged into nfs-ganesha.
The aim in this article is to deploy as simple a configuration as possible. We’ll be deploying a two nfs ganesha servers, layered on top of an existing CephFS cluster. The servers in this case are simple daemons running on regular hosts (VMs in my case, but this could be done on bare metal as well).
We will not be using any sort of traditional clustering software (e.g. pacemaker). This deployment is two independent NFS servers that are serving out the same backend filesystem. They will have separate addresses, and clients will need to be distributed between them manually.
“Failover” in this configuration will mean manually unmounting and remounting the other server on the client. If the server dies, we’ll just restart the whole thing.
For this, you’ll need an existing CephFS cluster. Setting this up is outside the scope of this article, but I deployed one under kubernetes with rook. For this configuration, you really want to be running a Nautilus (v14) cluster as there are many bugfixes and some new interfaces tailored for this use-case that make things much more reliable. A the time of this writing, nautilus is not yet released but should be in spring(-ish) 2019.
You’ll also need two hosts on which to run nfs-ganesha. Any sort of host will do. It’s even possible to run nfs-ganesha in an unprivileged docker or kubernetes container as it’s just a userland application that has no real need for local storage.
You can add more hosts if you like, but there is no need for quorum to handle split-brain scenarios here. Everything is coordinated via RADOS objects, so as long as the necessary objects are accessible by all of the cluster hosts they should be fine.
The nfs-ganesha packages need to be v2.7 or above. Earlier releases did not have the rados_cluster recovery backend driver. I used two virtual machines running Fedora Rawhide (the Fedora 30 prerelease). That way the packages I needed would be available via the standard package repos.
Install the OS on both hosts. I installed Fedora 29, and then updated to Fedora Rawhide from there. Install the necessary packages to run ganesha. For convenience, I also installed the ceph-common package, which allowed me to test the ability for the host to contact the ceph cluster:
# dnf install nfs-ganesha nfs-ganesha-ceph nfs-ganesha-rados-grace ceph-common
Configure the hosts to be able to access the ceph cluster. This part will depend on your particular ceph deployment. You’ll need the mon addresses at a minimum, and if you’re running with encryption enabled, the keyring file.
# cat /etc/ceph/ceph.conf
mon_host = 10.101.90.22:6790,10.110.5.39:6790,10.110.139.207:6790
keyring = /etc/ceph/keyring
Here’s my keyring file. This should be readable only by root (or whatever user the ganesha daemon will be running as):
# cat /etc/ceph/keyring
key = AQCafw5cEiphIBAAkmlazVh1O72yH+mXVrO6Tw==
Note that for this tutorial I’m not being at all careful about the cephx principals in use, and am just allowing the ganesha servers to use client.admin. For a real production deployment, you’d want to have the ganesha daemons use dedicated principals that just allow them the minimal access needed to the cluster.
Once you have this set up, you should be able to run commands like this to verify that the cluster is operational:
$ rados lspools
Here we see the data and metadata pools for our CephFS.
Add a ganesha.conf file on both hosts. The nfs-ganesha-ceph package ships with a sample ceph.conf file that we can tweak for our purposes. Here’s mine with the comments stripped out:
grep -v '^\s*#' /etc/ganesha/ganesha.conf | grep -v '^\s*
Enable_NLM = false;
Enable_RQUOTA = false;
Protocols = 4;
RecoveryBackend = rados_cluster;
Minor_Versions = 1,2;
Dir_Chunk = 0;
NParts = 1;
Cache_Size = 1;
Protocols = 4;
Transports = TCP;
Path = /;
Pseudo = /ceph/;
Access_Type = RW;
Attr_Expiration_Time = 0;
Name = CEPH;
pool = "cephfs.a.meta";
namespace = "ganesha";
nodeid = "a";
Note that we disable anything prior to NFSv4.1 here. This is because enabling legacy versions of NFS preclude us from being able to lift the grace period early, and that is a huge win on a server restart. There is also some tuning in there to disable most of ganesha’s caching. We’re already caching aggressively in the ceph libraries, so there’s no need to duplicate it here.
The rados_cluster recovery backend stores its info in RADOS objects. It’s not generally a lot of data, but we want it to be highly available. We piggyback onto the CephFS metadata pool for this purpose, and declare a new “ganesha” namespace in it to keep it distinct from cephfs objects.
Most of the configuration is identical between the hosts, but the nodeid setting in the RADOS_KV block requires special mention. It must be unique string for each node in the cluster. Here, we’ve set it to “a”. The second host in this cluster has this set to “b”.
It’s possible to not set this, in which case nfs-ganesha will default to using the hostname. That’s a perfectly valid choice for nodeid, as long as it is unique in the environment and is persistent across reboots.
Populate the Cluster Grace Database:
Now, we need to ensure that all of the nodes in the cluster know about each other. This done via a RADOS object that is shared between the hosts. Ganesha uses this object to communicate it current state and need with respect to a grace period.
The nfs-ganesha-rados-grace package contains a command-line tool for querying and manipulating this database. We’ll use that to create the DB and add both nodeids. One one of the hosts, we run:
$ ganesha-rados-grace -p cephfs.a.meta -n ganesha add a
$ ganesha-rados-grace -p cephfs.a.meta -n ganesha add b
$ ganesha-rados-grace -p cephfs.a.meta -n ganesha
This creates the grace database and adds both “a” and “b” to it. The last command has it dump the current state. Newly added hosts are always considered to be enforcing the grace period so they both have the E flag set.
The “cur” and “rec” values show the current and recovery epochs, which is how we keep track of what hosts are allow to perform recovery and when.
Open firewall ports:
On Fedora, I needed to allow clients to get to the server. On both hosts:
# firewall-cmd --permanent --add-service=nfs
# firewall-cmd --reload
This opens port 2049 so that the server is accessible from other hosts.
Start the Servers:
Start up the server on each host:
# systemctl start nfs-ganesha
# systemctl enable nfs-ganesha
Created symlink /etc/systemd/system/multi-user.target.wants/nfs-ganesha.service → /usr/lib/systemd/system/nfs-ganesha.service.
If all went well, ganesha.nfs should be running on both hosts. Now check the grace database:
$ ganesha-rados-grace -p cephfs.a.meta -n ganesha
Note that both nodes at this point have cleared their E flags, indicating that they are no longer enforcing the grace period and are now in normal operation mode.
From here, you should be able to mount either server from clients and do normal NFS operations against them.
This configuration contains an implicit assumption that if one of these servers goes down, that we will restart it within 5 minutes. After 5 minutes, the Ceph MDS may cancel the session that the ganesha client held and all of the state associated with it. If the session’s caps (stateful objects) get cancelled before the rest of the cluster goes into the grace period, the server’s clients may not be able to recover all of their state.
This is one of the reasons Red Hat is focused on containerized deployments for this. Restarting a container is quite a bit cheaper and has a much smaller footprint than an entire VM.
This is just a very quick overview of how to deploy a cluster in as simple a way as possible. I don’t consider this to be a configuration that many people would want to use, but it illustrates what is possible with other, layered technologies to improve resiliency.