Deploying an Active/Active NFS Cluster over CephFS

A little over a year ago, I made a post about a new design for active/active NFS clustering. Since then, an implementation of it has been merged into nfs-ganesha.

Overview:

The aim in this article is to deploy as simple a configuration as possible. We’ll be deploying a two nfs ganesha servers, layered on top of an existing CephFS cluster. The servers in this case are simple daemons running on regular hosts (VMs in my case, but this could be done on bare metal as well).

We will not be using any sort of traditional clustering software (e.g. pacemaker). This deployment is two independent NFS servers that are serving out the same backend filesystem. They will have separate addresses, and clients will need to be distributed between them manually.

“Failover” in this configuration will mean manually unmounting and remounting the other server on the client. If the server dies, we’ll just restart the whole thing.

Prerequisites:

For this, you’ll need an existing CephFS cluster. Setting this up is outside the scope of this article, but I deployed one under kubernetes with rook. For this configuration, you really want to be running a Nautilus (v14) cluster as there are many bugfixes and some new interfaces tailored for this use-case that make things much more reliable. A the time of this writing, nautilus is not yet released but should be in spring(-ish) 2019.

You’ll also need two hosts on which to run nfs-ganesha. Any sort of host will do. It’s even possible to run nfs-ganesha in an unprivileged docker or kubernetes container as it’s just a userland application that has no real need for local storage.

You can add more hosts if you like, but there is no need for quorum to handle split-brain scenarios here. Everything is coordinated via RADOS objects, so as long as the necessary objects are accessible by all of the cluster hosts they should be fine.

The nfs-ganesha packages need to be v2.7 or above. Earlier releases did not have the rados_cluster recovery backend driver. I used two virtual machines running Fedora Rawhide (the Fedora 30 prerelease). That way the packages I needed would be available via the standard package repos.

Installation:

Install the OS on both hosts. I installed Fedora 29, and then updated to Fedora Rawhide from there. Install the necessary packages to run ganesha. For convenience, I also installed the ceph-common package, which allowed me to test the ability for the host to contact the ceph cluster:

# dnf install nfs-ganesha nfs-ganesha-ceph nfs-ganesha-rados-grace ceph-common

Configure Ceph:

Configure the hosts to be able to access the ceph cluster. This part will depend on your particular ceph deployment. You’ll need the mon addresses at a minimum, and if you’re running with encryption enabled, the keyring file.

# cat /etc/ceph/ceph.conf 
[global]
mon_host = 10.101.90.22:6790,10.110.5.39:6790,10.110.139.207:6790

[client.admin]
keyring = /etc/ceph/keyring

Here’s my keyring file. This should be readable only by root (or whatever user the ganesha daemon will be running as):

# cat /etc/ceph/keyring 
[client.admin]
key = AQCafw5cEiphIBAAkmlazVh1O72yH+mXVrO6Tw==

Note that for this tutorial I’m not being at all careful about the cephx principals in use, and am just allowing the ganesha servers to use client.admin. For a real production deployment, you’d want to have the ganesha daemons use dedicated principals that just allow them the minimal access needed to the cluster.

Once you have this set up, you should be able to run commands like this to verify that the cluster is operational:

$ rados lspools
cephfs.a.meta
cephfs.a.data

Here we see the data and metadata pools for our CephFS.

Configure Ganesha:

Add a ganesha.conf file on both hosts. The nfs-ganesha-ceph package ships with a sample ceph.conf file that we can tweak for our purposes. Here’s mine with the comments stripped out:

grep -v '^\s*#' /etc/ganesha/ganesha.conf | grep -v '^\s*
NFS_CORE_PARAM
{
Enable_NLM = false;
Enable_RQUOTA = false;
Protocols = 4;
}
NFSv4
{
RecoveryBackend = rados_cluster;
Minor_Versions = 1,2;
}
CACHEINODE {
Dir_Chunk = 0;
NParts = 1;
Cache_Size = 1;
}
EXPORT
{
Export_ID=100;
Protocols = 4;
Transports = TCP;
Path = /;
Pseudo = /ceph/;
Access_Type = RW;
Attr_Expiration_Time = 0;
FSAL {
Name = CEPH;
}
}
RADOS_KV
{
pool = "cephfs.a.meta";
namespace = "ganesha";
nodeid = "a";
}

Note that we disable anything prior to NFSv4.1 here. This is because enabling legacy versions of NFS preclude us from being able to lift the grace period early, and that is a huge win on a server restart. There is also some tuning in there to disable most of ganesha’s caching. We’re already caching aggressively in the ceph libraries, so there’s no need to duplicate it here.

The rados_cluster recovery backend stores its info in RADOS objects. It’s not generally a lot of data, but we want it to be highly available. We piggyback onto the CephFS metadata pool for this purpose, and declare a new “ganesha” namespace in it to keep it distinct from cephfs objects.

Cluster Nodeids:

Most of the configuration is identical between the hosts, but the nodeid setting in the RADOS_KV block requires special mention. It must be unique string for each node in the cluster. Here, we’ve set it to “a”. The second host in this cluster has this set to “b”.

It’s possible to not set this, in which case nfs-ganesha will default to using the hostname. That’s a perfectly valid choice for nodeid, as long as it is unique in the environment and is persistent across reboots.

Populate the Cluster Grace Database:

Now, we need to ensure that all of the nodes in the cluster know about each other. This done via a RADOS object that is shared between the hosts. Ganesha uses this object to communicate it current state and need with respect to a grace period.

The nfs-ganesha-rados-grace package contains a command-line tool for querying and manipulating this database. We’ll use that to create the DB and add both nodeids. One one of the hosts, we run:

$ ganesha-rados-grace -p cephfs.a.meta -n ganesha add a
$ ganesha-rados-grace -p cephfs.a.meta -n ganesha add b
$ ganesha-rados-grace -p cephfs.a.meta -n ganesha
cur=1 rec=0
======================================================
a E
b E

This creates the grace database and adds both “a” and “b” to it. The last command has it dump the current state. Newly added hosts are always considered to be enforcing the grace period so they both have the E flag set.

The “cur” and “rec” values show the current and recovery epochs, which is how we keep track of what hosts are allow to perform recovery and when.

Open firewall ports:

On Fedora, I needed to allow clients to get to the server. On both hosts:

# firewall-cmd --permanent --add-service=nfs
# firewall-cmd --reload

This opens port 2049 so that the server is accessible from other hosts.

Start the Servers:

Start up the server on each host:

# systemctl start nfs-ganesha
# systemctl enable nfs-ganesha
Created symlink /etc/systemd/system/multi-user.target.wants/nfs-ganesha.service → /usr/lib/systemd/system/nfs-ganesha.service.

If all went well, ganesha.nfs should be running on both hosts. Now check the grace database:

$ ganesha-rados-grace -p cephfs.a.meta -n ganesha 
cur=3 rec=0
======================================================
a
b

Note that both nodes at this point have cleared their E flags, indicating that they are no longer enforcing the grace period and are now in normal operation mode.

From here, you should be able to mount either server from clients and do normal NFS operations against them.

Caveats:

This configuration contains an implicit assumption that if one of these servers goes down, that we will restart it within 5 minutes. After 5 minutes, the Ceph MDS may cancel the session that the ganesha client held and all of the state associated with it. If the session’s caps (stateful objects) get cancelled before the rest of the cluster goes into the grace period, the server’s clients may not be able to recover all of their state.

This is one of the reasons Red Hat is focused on containerized deployments for this. Restarting a container is quite a bit cheaper and has a much smaller footprint than an entire VM.

Conclusion:

This is just a very quick overview of how to deploy a cluster in as simple a way as possible. I don’t consider this to be a configuration that many people would want to use, but it illustrates what is possible with other, layered technologies to improve resiliency.

15 thoughts on “Deploying an Active/Active NFS Cluster over CephFS

  1. Pingback: Links 11/12/2018: Tails 3.11, New Firefox, FreeBSD 12.0 | Techrights

    1. Jeff Layton Post author

      Set up two clients each mounting different servers in the cluster, and see if they can both access data simultaneously.

      Of course, most of that worked prior to the rados_cluster backend being merged. The difference here is that with that recovery backend we can properly handle the case where a server crashes and restarts. To verify that that is doing the right thing, you can run something like this:

      $ watch ganesha-rados-grace -p cephfs.a.meta -n ganesha

      …and then restart one or more of the ganesha heads while driving activity on it. You should see the new head come up, request a grace period and then all of the other heads respond by beginning enforcement. Once reclaim is done, everything should eventually settle back into normal operation.

      Reply
  2. Hoan Nguyen Van

    Do you have a solution for nfs-ganesha active-active with one endpoint.

    I use tcp mode proxy of nginx or harproxy to client access nfs-ganesha. My 2 servers proxy need keepalived.

    Thanks.

    Reply
    1. Jeff Layton Post author

      The problem with trying to do something like a load balancer is that reclaim relies on the client finding its way back to the same NFS head. If you can guarantee that client affinity will always be preserved, then you’re probably fine to use one. If there is a chance that the client could end up getting redirected to a different node after a server crash, then that would prevent it from reclaiming properly.

      Reply
  3. đức phạm xuân

    Hello guys, I used this article for manila openstack. So, I mounted /etc/ganesha/export.d/ to the same backend on both. Manila share config to nfs-ganesha server 1. When access-allow a share, the config accessing also have in nfs-ganesah server 2. However, I can’t mount the share from nfs-ganesha server 2. I don’t know why both servers have the same config but server 2 not working?

    Reply
  4. Pingback: DevConf.CZ 2019 conference report | Lenz Grimmer's random thoughts

  5. Bruce Fields

    ““Failover” in this configuration will mean manually unmounting and remounting the other server on the client. If the server dies, we’ll just restart the whole thing.”

    I’m having trouble reconciling this with the fact that you’re doing server reboot recovery.

    Maybe by “the server” you mean the backend ceph cluster? Or maybe “the whole thing” means “the one server that died”?

    Reply
    1. Jeff Layton Post author

      Sorry, that is unclear. Let me try and clarify:

      In this case, “restart the whole thing” means a single ganesha server. In a containerized deployment (like with http://rook.io), we end up just reconstituting an entire container from scratch. Any state that needs to persist between incarnations of the same ganesha server is in RADOS, so we don’t need to worry about moving filesystems around and such.

      When parts of the backend ceph cluster die, then standard ceph recovery behavior applies. Usually, OSDs and such work in groups to keep redundant copies of data, so if you just lose a single physical node you usually can still keep chugging along.

      If you have a larger-scale failure then standard ceph client recovery behavior applies. In general, in that situation the ganesha servers just get stuck waiting for enough of the backend to return, similar to how NFS clients behave when an NFS server goes away.

      The point of that statement is that we’re not targeting traditional failover clustering (like pacemaker) with this solution. One ganesha server doesn’t really take over resources for another in this setup. A server in this configuration is an isolated unit, and we rely on kubernetes or similar container orchestration to bring a server head back quickly if it goes down.

      Hopefully this clarifies things!

      Reply
  6. Dray Piper

    Tried this setup on 3 Fedora 30 Cloud Base small nodes in oVirt cluster with some additional security. Had some problems with SELinux — had to debug and add “allow ganesha_t usr_t:file map;” & “allow ganesha_t cyphesis_port_t:tcp_socket name_connect;” rules in additional module. Then I tried to use restricted principal for connecting to Ceph cluster but it wasn’t clear to me that I still needed client.admin key on the Ganesha nodes for ganesha-rados-grace component). Now that I finally got everything working (tiny VMs restart in the matter of seconds on SSD Ceph) I still think that Storhaug way with its CTDB backend having failover and LB is a more production-ready approach. We need virtual IP for simple connectivity(e.g. I cannot configure multiple NFS nodes in ansible inventory file for Openshift cluster and manually typing in different IPs for PV in pod deployments will also be a burden, we need 1 entrypoint for NFS “volumes” dynamic provisioning), we need better bandwidth that load-balancing allows. I wish Storhaug supported CephFS as a second option… or RADOS_KV offered full clustering capabilities. JM2C

    Reply
    1. Jeff Layton Post author

      Fair points all around. The code that exists in ganesha today is only a partial solution. To really do this correctly, I think we’d want to have some mechanism for distributing clients to different head-end nodes, probably using NFSv4 migration and replication.

      The problem with using a load balancer to distribute clients is that it makes reclaim _very_ iffy. Suppose you have the server crash and then the load balancer decides to send the client to a different ganesha daemon than it was originally using. At that point the client can’t count on being able to reclaim its state since the new server may not have a recovery record for it.

      There are a number of potential ways to deal with that, but none of them are implemented in ganesha at this point (and not even when deployed by storhaug).

      Reply
  7. Pingback: CephFS Backed Red Hat Virtualization – WhiskeyTech

Leave a reply to Jeff Layton Cancel reply