In an earlier blog post, I’ve described an Active/Active NFSv4 cluster from a high-level standpoint. The difficulty here though comes into play when we consider failover scenarios. In this post, I’ll try to describe the requirements for state handling and recovery.
When a NFSv4 server reboots, then the clients must reclaim their state once it comes back up. First it must re-establish its clientid, and then reclaim any locks and open files that it previously held. The server grants a particular period after rebooting for this to occur — this period is called the grace period, and is generally a function of the client lease time. While we want to eliminate the grace period on failover, we will still need to deal with it for cluster-wide restarts.
Typically, we consider the grace period to be a property of the server. With a clustered filesystem we must consider the grace period to be a property of the cluster as a whole. Why? The basic rule that we must adhere to is that once any server in the cluster grants a non-recovery lock or open, then the cluster can allow no more recovery. To do anything else might mean that we grant a lock that conflicts with one that has not yet been reclaimed . One way to implement this would be to mandate that once one server exits the grace period, then the other servers must not allow any more recovery either.
Also, it’s necessary for servers to track some information about the clients on stable storage to guard against edge cases that can occur when reboots and network partitions coincide. Once the first node exits the grace period, it must ensure that any client id records that were not reclaimed during the current grace period are invalidated.
So, those are our basic design constraints for the state recovery subsystem. We will need some sort of cluster-aware userspace component to do this as it can’t reasonably be done in the kernel alone. It will need to talk to the kernel however, so we need to determine what sort of kernel/userspace interface makes the most sense for this.
For a first pass, it may be simpler to just have the cluster return NFS4ERR_NO_GRACE on any reclaim that doesn’t match an existing lock that DLM is aware of. The upshot here is that there would be no reclaim of files on a cluster-wide restart. Eventually though we’ll need to deal with the cluster-wide restart case so we need to ensure that any initial design doesn’t hamstring that ability.