Monthly Archives: December 2011

Active/Active NFSv4 on Clustered Filesystems: Client ID Tracking (part 2)

In an earlier post, I outlined a proposed overhaul of the scheme that knfsd uses to track client names on stable storage. There’s another aspect we need to consider as well. First, some background…

There are actually two sets of client “id’s” that we need to consider:

  • The long, client-generated string that (supposedly) uniquely identifies the client. This object is referred to in RFC3530 as an nfs_client_id4. This is the thing that the client tracking daemon will be tracking. Henceforth, I’ll call this the “client name”. This value is generally just used to acquire an ID of the other type.
  • The shorthand (64-bit) value that the server hands out to the client. This is referred to as a clientid4 in the RFC. This is the value that’s generally used for “day to day” stateful operations. We’ll call this the “client id”

Currently, the server generates client id’s by taking a 32-bit “boot time” value, and a 32-bit “counter”. This is generally fine in a single-node configuration. It’s highly unlikely you’d ever get a client id collision. This is important because when a call comes in with a particular client ID, we need to be able to use that to (quickly) find the actual client tracking object in the server.

In a cluster however, this scheme is probably not going to be sufficient. If 2 cluster hosts boot at exactly the same second, then they could hand out the same client ID. This is a big problem. Consider the following example:

Cluster hosts foo and bar are clustered NFS servers. They boot at the same time and hand out the same client id to two different clients. Now, an address floats from foo to bar. The client that was originally talking to foo then does an operation against bar that requires the clientid. bar sees that clientid and confuses it with the one that it originally handed out.

There are a few different ways to avoid this:

  1. We can try to ensure that clientid collisions like this never happen. Sounds simple, but is difficult in practice.
  2. Instead of identifying clients just by client id, we can use a client id + server address tuple
  3. We can containerize nfsd and ensure that each address is started within its own container. Presumably a containerized nfsd will have its own list of clients, and a collision is no big deal.
  4. We can consider abandoning the floating address model altogether and rely on clients using RR DNS and fs_locations info to find the servers. If the client knows that it must reclaim state before it ever talks to the failover host, then collisions aren’t a problem.

Containerizing sounds like the best approach, but perhaps abandoning floating addresses for this would be easier?

Active/Active NFSv4 on Clustered Filesystems: Client ID Tracking

There are a number of problems that can occur when server reboots coincide with a loss of communication between the client and server. A couple possible cases are outlined in Section 8.6.3 of RFC 3530. In order to prevent these problems, the server must track some information about its clients on stable storage. The Linux server currently does the minimum scheme to prevent this:

  1. Whenever the client does its first non-reclaim OPEN, OPEN_CONFIRM or (for v4.1 clients) RECLAIM_COMPLETE, the server must store information about that client’s opaque ID string.
  2. When the client needs to reclaim during the grace period, the server must verify that there is a record of the client on stable storage
  3. When the clientid expires, then we MAY remove that record
  4. When the grace period ends, any client records that were not reclaimed during the current grace period must be removed

Linux knfsd does this currently by tracking this information on disk, usually in /var/lib/nfs/v4recovery. Note that there is no daemon for doing this. knfsd creates and removes directories in this directory in order to track this using vfs_* calls from within the kernel. This seems a bit odd and somewhat alarming until you realize that knfsd manipulates filesystems itself all the time in the course of its normal duties. Still, changing this has been something on the to-do list for a long time.

Now though we have a clear need to change this. If an address floats to a different server (or there’s a migration event or something), then we still need to track it in case it needs to reclaim its state. The clientid database must be shared among all the hosts in the cluster.

One thing to bear in mind is that clients do not call SETCLIENTID against a server, but rather against an address. There is no reliable mechanism for a client to tell that a server is multi-homed, so it must treat each address as if it were an independent server. By the same token, when the address floats to a new server, we need to move any stateful information that’s associated with that address to the new server as well.

So, how do we fix this…

The first step toward fixing this is to remove the code that does this tracking today and replace it with a more standard upcall + daemon to track this information. Since we have a large installed base of servers using the old mechanism, we will also need a transition mechanism. Currently, I’m considering just moving the older code inside of something like this:

        #ifdef CONFIG_NFSD_V4_LEGACY_STATE_TRACKING

…and adding a new Kconfig option to switch between the two.

Once that’s settled, we’ll need to add the upcall and daemon. I have some prototype code in progress for this, and it’s fairly straightforward. The upcall uses a new pipe in rpc_pipefs, and the daemon tracks the info using a sqlite DB in a standard location.

Doing this in the single host case is relatively easy, and most of the functions are identical in the clustered case as well. The kicker is the last element of the tracking scheme where we clean up at the end of the grace period. Because this involves the grace period (which must be coordinated among all the cluster hosts), we have to do something more elaborate for a clustered setup. We’ll need to ensure that only the first (or last?) host to come out of grace does the “purge” at the end of the grace period in this situation.

This may mean that some sort of quorum/election cluster-fu will be involved here. Ideally, we”ll just be able to extend the single-host daemon to be cluster aware when needed. Worst case, we’ll need to design an entirely separate daemon for it, but I hope that won’t be necessary.

Active/Active NFS on Clustered Filesystems: Lock Recovery and Grace Period

In an earlier blog post, I’ve described an Active/Active NFSv4 cluster from a high-level standpoint. The difficulty here though comes into play when we consider failover scenarios. In this post, I’ll try to describe the requirements for state handling and recovery.

When a NFSv4 server reboots, then the clients must reclaim their state once it comes back up. First it must re-establish its clientid, and then reclaim any locks and open files that it previously held. The server grants a particular period after rebooting for this to occur — this period is called the grace period, and is generally a function of the client lease time. While we want to eliminate the grace period on failover, we will still need to deal with it for cluster-wide restarts.

Typically, we consider the grace period to be a property of the server. With a clustered filesystem we must consider the grace period to be a property of the cluster as a whole. Why? The basic rule that we must adhere to is that once any server in the cluster grants a non-recovery lock or open, then the cluster can allow no more recovery. To do anything else might mean that we grant a lock that conflicts with one that has not yet been reclaimed . One way to implement this would be to mandate that once one server exits the grace period, then the other servers must not allow any more recovery either.

Also, it’s necessary for servers to track some information about the clients on stable storage to guard against edge cases that can occur when reboots and network partitions coincide. Once the first node exits the grace period, it must ensure that any client id records that were not reclaimed during the current grace period are invalidated.

So, those are our basic design constraints for the state recovery subsystem. We will need some sort of cluster-aware userspace component to do this as it can’t reasonably be done in the kernel alone. It will need to talk to the kernel however, so we need to determine what sort of kernel/userspace interface makes the most sense for this.

For a first pass, it may be simpler to just have the cluster return NFS4ERR_NO_GRACE on any reclaim that doesn’t match an existing lock that DLM is aware of. The upshot here is that there would be no reclaim of files on a cluster-wide restart. Eventually though we’ll need to deal with the cluster-wide restart case so we need to ensure that any initial design doesn’t hamstring that ability.

Active/Active NFS on Clustered Filesystems: Basic Design

In the last couple of years, clustering support for Samba has really taken off. It’s now possible to take a clustered filesystem (such as GFS2, OCFS2, or GPFS), and serve that filesystem via samba from multiple nodes. The next step is do the same with NFS. Since NFSv2/3 are rapidly becoming legacy protocols, we want to base this on NFSv4.

Here’s what I envision as a basic overall design…

Start with a cluster of machines with a shared-storage filesystem like GFS2. Each node in the cluster runs knfsd, and its own set of daemons:

  • mountd
  • idmapd
  • rpcsvcgssd
  • …and any other ones that are needed, such as a client id tracking daemon

Multiple IP addresses will float back and forth freely in the cluster. That will be handled by resource management software (rgmanager or something). When a cluster resource group floats to another server, it should just need to move the address. Restarting knfsd should not be allowed (or recommended).

When an IP address moves, any clients that have mounted through that address will need to reclaim their state (locks and opens, eventually we may also want to allow reclaim of delegations too).

One major goal is to ensure that when an IP address floats to another server that we don’t need to put the entire cluster into grace, forcing clients on all of the other addresses to reclaim their state as well.