Since returning to Red Hat, I’ve done a fair bit of work in enabling nfs-ganesha to serve out cephfs. The operating idea here has been to push as much infrastructure into CephFS as we can to enable us to properly implement NFS semantics.
nfs-ganesha then is just another cephfs client. It hands out NFS state to the clients and holds the corresponding CephFS state on their behalf.
With a single NFS server, this works reasonably well. We’ll even have delegation support (coming soon!). The question now is — can we stand up an arbitrary set of servers that only interact with one another via ceph?
The advent of technologies like kubernetes makes it attractive to be able to stand up cookie-cutter NFS servers dynamically as the load requires. How loosely can we couple them without breaking NFS recovery semantics?
Note that in the discussion below, I’m deliberately ignoring other applications besides NFS that are interacting with the cluster during the grace period. That’s potentially a real problem, but for now I’m ignoring it for the sake of simplicity.
The basic problem:
The main issue here is that NFS server state is not persistently stored in a fine-grained way. Most of the state (opens, locks, delegations, layouts) are only stored in memory. That’s not usually a problem on a standalone NFS server.
When a NFS server crashes, it comes up with zero state held by clients. We inform them of the crash, and they then start reclaiming their state within a certain window (usually 2 lease periods — around 90-120s depending on the implementation). During the grace period, no new state can be established — clients can do reclaims only. Once the grace period has ended, the reverse is true.
There are some corner cases here involving network partitions and multiple reboots that require us to maintain a list of clients that hold at least one open on stable storage. That’s also fairly simple to handle with a single server.
Scaling out, the simple way:
There are ways to scale out without needing to do a lot of complex coordination. If you know that an inode can only ever be served out by a single NFS server, then you can really treat it as wholly independent of other servers. There can be no conflicting state acquired via other NFS servers that way.
If you imagine a set of servers that only serve out a single export each via an individual IP address, with no hardlinks that cross export boundaries, then that should work just fine today without any external coordination.
With something like kubernetes, where you could potentially restart a cookie-cutter NFS server very quickly this might be a workable solution. You effectively allow yourself to be able to scale out as the number of exports grows.
Serving the same content in parallel:
Those configurations are a potentially limiting though. We really want to be able to scale out according to the number of clients we have, not the number of exports.
That means that we need to allow multiple servers to export the same data in parallel.
Suppose we have a setup that looks like this:
+----------------+ +----------------+ | | | | | NFS Client 1 +------+ NFS Server 1 +----------+ | | | | | +----------------+ +----------------+ +-----+----+ | | | CephFS | | | +----------------+ +----------------+ +-----+----+ | | | | | | NFS Client 2 +------+ NFS Server 2 +----------+ | | | | +----------------+ +----------------+
In this configuration, we don’t float IP addresses between NFS Server 1 and 2 like we would with a traditional clustered configuration. Instead, we assign an address to each NFS server, and reconstitute the server from scratch if it crashes.
All of the stable storage info for the NFS server is stored in RADOS, though each NFS server has its own database for recovery.
Now, suppose NFS Server 1 crashes. At that point, NFS Client 1 knows what NFS state it holds. The CephFS server knows what capabilities and state NFS Server 1 held before it crashed.
Kubernetes or whatever detects that its down and creates a new instance of NFS Server 1. NFS Server 1 comes up knowing only the list of clients that are allowed to reclaim their state. What we need to happen at this point is for CephFS to release the state that was held by the previous NFS Server 1 incarnation, and allow NFS Client 1 to reclaim it.
Between the window of time when the CephFS caps held by the earlier NFS Server 1 are trashed, and NFS Client 1 reclaims its state, NFS Client 2 talking to NFS Server 2 could race in and set some state that conflicts with what NFS Client 1 holds.
We must prevent this from occurring. The simplest way to do that is to force all of the other NFS servers into the grace period prior to cephfs revoking the caps that the old client held.
In practical terms, this means that we need to ensure that all of the remaining NFS servers are enforcing the grace period prior to restarting a crashed NFS server in the cluster. They are all only allowed to exit the grace period once all of the servers have stopped allowing reclaim.
That also neatly handles the case where we have a clean start from zero state in the cluster. All servers will go into their grace period until clients are finished reclaiming (or all of the individual grace periods time out).
Handling excessively tardy server restart:
It’s also possible for a NFS server to go down while clients still hold state on it, but then take so long to come back that the MDS decides to give up on it and revokes its caps anyway.
Clients of these servers must not be allowed to reclaim any state, as we cannot guarantee consistency in that situation. The simplest way to do that is to simply trash its persistent client database if the caps that it held previously were revoked outside of the grace period. That will prevent any reclaim. Those surviving clients will likely need to be restarted, but that’s really the best we can do in that situation.
Decommissioning a server instance with active clients:
If we know that we want to decommission one of the NFS servers (scaling down the cluster at times of low-load, for instance), we should be able to use migration to move them to other servers (support for that would need to be added to ganesha).
Alternately we could merge its IP addresses and client databases into another existing server. There are several options here. The main point is that we do need to take proactive steps to “drain” a server before decommissioning it.
So, that more or less lays out the rules for how we need various NFS servers to behave. The question now is what will ensure that we follow them.
Traditional clustering does this with stuff like corosync and pacemaker, but having to use that when we already have clustering infrastructure in ceph itself seems wrong.
The current thinking (credit to Patrick Donnelly) is to add a ceph monitor service and map for the NFS gateways. That map would be consumed by a new daemon that would manage the ganesha daemon, possibly using dbus to handle the grace period of a running instance.
Alternately, we implement this as a new shared library in ceph that ganesha would link in. This part is still being designed/determined.
Overall, I’m pretty excited by the potential of the overall design here though. I think it should offer us the ability to scale nearly linearly, at least with some common usage patterns.
Beyond this design, we may want to look at how to more effectively move clients around the cluster based on their access patterns. For instance, we may want to explore assigning a NFS server to each CephFS MDS in a multimds cluster, and leverage NFS migration to have clients access objects on the NFS server closest to the authoritative MDS for an inode.
That sort of design might be very interesting coupled with pNFS. A flexfiles layout is something of a natural fit over CephFS.