Active/Active NFSv4 on Clustered Filesystems: avoiding the VFS layer

So far, most of my posts about this proposed design have been about teaching the VFS and underlying filesystems how to handle the state properly for NFSv4 semantics. We could however, consider another design. We could have a clustered daemon that runs and tracks the state independently of the VFS. In other words, we could go with a more Samba/CTDB type design here.

knfsd would upcall to talk to a daemon and it would be the final arbiter. Such a daemon could be integrated with Samba/CTDB such that knfsd and samba are aware of each others state. In principle, that would allow you to export the same filesystem via both knfsd and Samba/CTDB and get reasonable semantics for both. Oplocks and delegations would get broken appropriately, share mode reservations should work, and (possibly) we could make locking work in a more consistent fashion than we do today.

Of course, there’s a catch — nothing else would be aware of that state. It would all be tied up inside the CTDB database (or whatever clustered DB infrastructure we’d end up using). The result would be a NFS+SMB “appliance”. I think that’s a less desirable design than one that’s more general purpose, but it might be easier to achieve and we might be able to hammer that out quicker since we’d be able to avoid a lot of the VFS-layer work.

In the near term, we don’t really need to make this decision. Either way, we’ll still need to be able to swap in the correct set of operations to handle it, so the focus for now can be on simply abstracting out the parts of the server code that we’ll need to swap out to handle this later. It should even be possible to do this sort of design as an interim step, and then add what the VFS would need for a more general solution later.


Active/Active NFSv4 on Clustered Filesystems: abstracting out fs operations from nfsd

Regardless of what method we ultimately choose, a more immediate engineering task is to abstract out the operations that we ultimately want to delegate to the lower filesystems. Since we want to eventually allow filesystems to override the default grace-period handling, we’ll need to do provide some way to do that. Most likely, we’ll just add some new export_ops to allow for this (or maybe a new struct that hangs off the export ops).

Here’s a first stab at what I envision for this, mostly based on looking at where nfsd currently checks the grace period:

  •  nfsd_open operation: this will be the most complex of the of the set. It’ll need to do several things semi-atomically (at least wrt to other nfsd threads)
    • check the grace period and determine whether the file is allowed to be opened
    • (optionally) check share reservations that are being requested and that are already set, and determine whether the file is allowed to be opened
    • open the file (assuming the above checks passed)
    • set share reservation (if any, we may eventually combine that with the actual open if we ever get VFS support for share mode reservations)
    • get a lease/delegation (if needed)
    • …and hand the result back to nfsd
  • a nfsd_lock operation (for F_SETLK/F_GETLK):
    • check the grace period and see if we’re allowed to set a new lock or reclaim an old one
    • acquire or release the lock
  • an overridable grace_disallows_io to handle mandatory locks. This will prevent stateid-based ops from performing I/O during the grace period if necessary.
  • a nfsd_remove  and nfsd_rename op: to allow the blocking of metadata ops during the grace period

That’s a start anyway. The actual API itself will need to be determined for this. In the near term, I plan to do some reshuffling of the code to make it easier to break this stuff out later. Once those pieces have a more sharply defined box around them it should just be a matter of turning them into a set of operations that filesystems can override in some fashion.

Active/Active NFSv4 on Clustered Filesystems: POSIX Lock Recovery (cold-start case)

In my last post, I talked about POSIX lock recovery in this configuration once the cluster is up and running. The assumption there is that the lock manager (DLM in the case of GFS2) has a record of the lock that we want the client to be able to reclaim.

We do however, also need to consider the case where the cluster is being cold-started. At that point, no node in the cluster will have any active locks. How do we know whether a client is allowed to reclaim locks at that point? To illustrate the problem, consider the following (pathological) case:

A client has mounted a filesystem from two different cluster nodes, for instance:

node1:/export mounted on /mnt1
node2:/export mounted on /mnt2

The client takes locks under both mountpoints, and then loses contact with node2. Eventually the client’s lease expires on node2 and that node releases all of the locks that the client held on it. The entire cluster is then rebooted, and when it comes back up, the client mounts both filesystems from node1 and attempts to reclaim all the locks under /mnt1 and /mnt2.

How does node1 know which locks are allowed to be reclaimed? Clearly, everything under /mnt1 should be allowed and nothing under /mnt2, but the cluster doesn’t currently store reboot recovery info at this level of detail. Right now, it only stores information about clients themselves, but nothing about the resources they hold.

After spending some time working through different scenarios, I’ve come up with the following “rule”:

The flexibility of client configurations is directly proportional to the granularity of lock information tracked on stable storage.

In other words: the more information that we keep about active locks on stable storage, the better able we are to cope with different client configurations. In real terms, that means that we have some choices, and implications for those choices:

  1. We could avoid tracking any information at all on stable storage. We’d just return NFS4ERR_NO_GRACE on any lock reclaim request that didn’t match an existing lock. On cold-boot of the cluster, there would be no lock reclaim. This is very easy to implement, but clearly non-optimal. I don’t think we’ll find this acceptable.
  2. We could just track information about the clients themselves on stable storage. In particular, we’d just track the client name “strings” that are sent in the SETCLIENTID or EXCHANGE_ID call. This is basically what knfsd does today and what my proposed nfsdcld daemon does today. The problem here is that it can’t handle the above situation I outlined. In order for this to work, we’d have to insist that clients can only mount from one node in the cluster at a time.
  3. We could track client name strings along with the destination IP address for each call. If a SETCLIENTID call comes in then we’d track a tuple of information that includes the client name string and the server address where the call came in. With this, we could reasonably support a “floating address” cluster where clients can be moved between servers by moving the server’s address to another cluster node. This might involve a hack of some sort to fix the server’s ID string to be destination address specific for v4.1. The downside here is that this would prevent clients from being able to reclaim locks against a different address from which they were acquired. That means that NFSv4 replication and migration using fs_locations info would not work and that reduces flexibility. It might also be difficult or impossible to integrate this configuration with pNFS serving if that ever becomes desirable.
  4. We could track client name strings along with the fsid of the filesystem where the resources are held. This would involve changing when we upcall/downcall to store this information. It would have to happen whenever we acquire our first resource on a fs and when the last resource is released from a fs. With this, we could reasonably allow for floating addresses and v4 fs_locations based migration/replication provided that the clients don’t mount the same filesystem off of more than one node at a time. In other words, the scenario described earlier would be unsupportable. Worse, nothing would prevent someone from doing that (other than documentation), so there is the potential for people to shoot themselves in the foot. They may not even realize it since it would basically work until they hit a network partition + cold boot scenario.
  5. We could track information about every lock on stable storage. This has the most performance impact, but it allows for complete flexibility in client configurations. Clients could migrate freely between hosts using any mechanism. There would also be no NFS4ERR_GRACE returns to clients. Clients would be able to make new lock requests as soon as the cluster starts. For this, we’d probably do something a bit different than the previous mechanisms. We’d probably want to integrate this tracking with the userspace lock manager (DLM or equivalent). What I’d envision is that DLM would keep info on stable storage about what locks it holds. When the cluster is started, we’d use that info to “preload” the reclaimable locks into DLM. Then as the clients come up, they’d take ownership of them. Any that aren’t reclaimed during an initial grace period would be released. Then we’d just enable NFS serving and start letting clients reclaim those locks. This simplifies the knfsd piece, but adds quite a bit of complexity to DLM.

Those are the possibilities I see for doing this. While it sounds a bit tougher to do, I’m starting to warm up to #5. It seems like the best option for future flexibility. It will make locking performance even slower than it is today however, so we’ll need to weigh carefully whether that’s reasonable.

Active/Active NFSv4 on Clustered Filesystems: POSIX Lock Recovery

In the previous blog posts, we’ve talked about clientid tracking. I think we’ve more or less got a scheme that will work there and that is extendible for the clustered case. I’ve proposed some patches for the Linux kernel and nfs-utils to move that along. It’ll probably need a little more work, but the concept is basically sound.

The next thing that we have to tackle is how to manage locking in the cluster, in particular lock recovery. A key goal here is to avoid the need to put nodes into grace on failover events. In order to do that, we have to ensure that DLM can recognize locks acquired by knfsd, and have it not immediately release them if the node that acquired them failed. Here’s what I envision:

When knfsd acquires a lock, it will set an FL_RECLAIMABLE flag in the lock request. DLM will then set a corresponding flag on the plock — DLM_PLOCK_FL_RECLAIMABLE. On an actual lock reclaim from knfsd, it will set a FL_RECLAIM flag on the lock request. DLM can use that to determine if this is a reclaim request and handle it differently than a new lock request. If it’s reclaim request, then it can find the matching lock in need of reclaim, and then transfer the ownership of that lock to the new node. This may involve splitting the existing lock.

If a cluster node goes down, DLM will not immediately release all of the node’s plocks. It will instead wait for a period of time on the order of the grace period for them to be reclaimed by the clients. At the end of that period, any unreclaimed locks will be released (the mechanism for this is still TBD). This is a fairly simple scheme, but should allow the cluster to continue granting normal locks and opens without worrying about whether we’re handing out a lock that might conflict with a reclaim request.

There are still potential problems here, however relating to network partitions. Consider a case where client1 has lost connectivity to the server. The server waits for a bit for client1 to reclaim the lock, but eventually gives up and purges the state from client1, and hands the lock to client2. Eventually connectivity is reestablished and the client1 reclaims the lock, effectively “stealing” it from client2.

In order to prevent this sort of problem, I believe we’ll need to be a bit more strict on reclaim. We need to ensure that the lock being reclaimed is in fact being held by the correct client. What I think we’ll need to do in practice is have the new client ID tracking daemon (nfsdcld) pass back an “index” of the client name being tracked. We can put that index in the op->info.owner field and use it to verify the the correct client is actually doing the reclaim. Assuming that they are not recycled too quickly, that should be sufficient to identify that the correct client is reclaiming the lock.

There are probably other solutions as well, but this one seems relatively simple to implement.

Active/Active NFSv4 on Clustered Filesystems: Client ID Tracking (part 2)

In an earlier post, I outlined a proposed overhaul of the scheme that knfsd uses to track client names on stable storage. There’s another aspect we need to consider as well. First, some background…

There are actually two sets of client “id’s” that we need to consider:

  • The long, client-generated string that (supposedly) uniquely identifies the client. This object is referred to in RFC3530 as an nfs_client_id4. This is the thing that the client tracking daemon will be tracking. Henceforth, I’ll call this the “client name”. This value is generally just used to acquire an ID of the other type.
  • The shorthand (64-bit) value that the server hands out to the client. This is referred to as a clientid4 in the RFC. This is the value that’s generally used for “day to day” stateful operations. We’ll call this the “client id”

Currently, the server generates client id’s by taking a 32-bit “boot time” value, and a 32-bit “counter”. This is generally fine in a single-node configuration. It’s highly unlikely you’d ever get a client id collision. This is important because when a call comes in with a particular client ID, we need to be able to use that to (quickly) find the actual client tracking object in the server.

In a cluster however, this scheme is probably not going to be sufficient. If 2 cluster hosts boot at exactly the same second, then they could hand out the same client ID. This is a big problem. Consider the following example:

Cluster hosts foo and bar are clustered NFS servers. They boot at the same time and hand out the same client id to two different clients. Now, an address floats from foo to bar. The client that was originally talking to foo then does an operation against bar that requires the clientid. bar sees that clientid and confuses it with the one that it originally handed out.

There are a few different ways to avoid this:

  1. We can try to ensure that clientid collisions like this never happen. Sounds simple, but is difficult in practice.
  2. Instead of identifying clients just by client id, we can use a client id + server address tuple
  3. We can containerize nfsd and ensure that each address is started within its own container. Presumably a containerized nfsd will have its own list of clients, and a collision is no big deal.
  4. We can consider abandoning the floating address model altogether and rely on clients using RR DNS and fs_locations info to find the servers. If the client knows that it must reclaim state before it ever talks to the failover host, then collisions aren’t a problem.

Containerizing sounds like the best approach, but perhaps abandoning floating addresses for this would be easier?

Active/Active NFSv4 on Clustered Filesystems: Client ID Tracking

There are a number of problems that can occur when server reboots coincide with a loss of communication between the client and server. A couple possible cases are outlined in Section 8.6.3 of RFC 3530. In order to prevent these problems, the server must track some information about its clients on stable storage. The Linux server currently does the minimum scheme to prevent this:

  1. Whenever the client does its first non-reclaim OPEN, OPEN_CONFIRM or (for v4.1 clients) RECLAIM_COMPLETE, the server must store information about that client’s opaque ID string.
  2. When the client needs to reclaim during the grace period, the server must verify that there is a record of the client on stable storage
  3. When the clientid expires, then we MAY remove that record
  4. When the grace period ends, any client records that were not reclaimed during the current grace period must be removed

Linux knfsd does this currently by tracking this information on disk, usually in /var/lib/nfs/v4recovery. Note that there is no daemon for doing this. knfsd creates and removes directories in this directory in order to track this using vfs_* calls from within the kernel. This seems a bit odd and somewhat alarming until you realize that knfsd manipulates filesystems itself all the time in the course of its normal duties. Still, changing this has been something on the to-do list for a long time.

Now though we have a clear need to change this. If an address floats to a different server (or there’s a migration event or something), then we still need to track it in case it needs to reclaim its state. The clientid database must be shared among all the hosts in the cluster.

One thing to bear in mind is that clients do not call SETCLIENTID against a server, but rather against an address. There is no reliable mechanism for a client to tell that a server is multi-homed, so it must treat each address as if it were an independent server. By the same token, when the address floats to a new server, we need to move any stateful information that’s associated with that address to the new server as well.

So, how do we fix this…

The first step toward fixing this is to remove the code that does this tracking today and replace it with a more standard upcall + daemon to track this information. Since we have a large installed base of servers using the old mechanism, we will also need a transition mechanism. Currently, I’m considering just moving the older code inside of something like this:


…and adding a new Kconfig option to switch between the two.

Once that’s settled, we’ll need to add the upcall and daemon. I have some prototype code in progress for this, and it’s fairly straightforward. The upcall uses a new pipe in rpc_pipefs, and the daemon tracks the info using a sqlite DB in a standard location.

Doing this in the single host case is relatively easy, and most of the functions are identical in the clustered case as well. The kicker is the last element of the tracking scheme where we clean up at the end of the grace period. Because this involves the grace period (which must be coordinated among all the cluster hosts), we have to do something more elaborate for a clustered setup. We’ll need to ensure that only the first (or last?) host to come out of grace does the “purge” at the end of the grace period in this situation.

This may mean that some sort of quorum/election cluster-fu will be involved here. Ideally, we”ll just be able to extend the single-host daemon to be cluster aware when needed. Worst case, we’ll need to design an entirely separate daemon for it, but I hope that won’t be necessary.

Active/Active NFS on Clustered Filesystems: Lock Recovery and Grace Period

In an earlier blog post, I’ve described an Active/Active NFSv4 cluster from a high-level standpoint. The difficulty here though comes into play when we consider failover scenarios. In this post, I’ll try to describe the requirements for state handling and recovery.

When a NFSv4 server reboots, then the clients must reclaim their state once it comes back up. First it must re-establish its clientid, and then reclaim any locks and open files that it previously held. The server grants a particular period after rebooting for this to occur — this period is called the grace period, and is generally a function of the client lease time. While we want to eliminate the grace period on failover, we will still need to deal with it for cluster-wide restarts.

Typically, we consider the grace period to be a property of the server. With a clustered filesystem we must consider the grace period to be a property of the cluster as a whole. Why? The basic rule that we must adhere to is that once any server in the cluster grants a non-recovery lock or open, then the cluster can allow no more recovery. To do anything else might mean that we grant a lock that conflicts with one that has not yet been reclaimed . One way to implement this would be to mandate that once one server exits the grace period, then the other servers must not allow any more recovery either.

Also, it’s necessary for servers to track some information about the clients on stable storage to guard against edge cases that can occur when reboots and network partitions coincide. Once the first node exits the grace period, it must ensure that any client id records that were not reclaimed during the current grace period are invalidated.

So, those are our basic design constraints for the state recovery subsystem. We will need some sort of cluster-aware userspace component to do this as it can’t reasonably be done in the kernel alone. It will need to talk to the kernel however, so we need to determine what sort of kernel/userspace interface makes the most sense for this.

For a first pass, it may be simpler to just have the cluster return NFS4ERR_NO_GRACE on any reclaim that doesn’t match an existing lock that DLM is aware of. The upshot here is that there would be no reclaim of files on a cluster-wide restart. Eventually though we’ll need to deal with the cluster-wide restart case so we need to ensure that any initial design doesn’t hamstring that ability.