Monthly Archives: January 2012

Active/Active NFSv4 on Clustered Filesystems: POSIX Lock Recovery (cold-start case)

In my last post, I talked about POSIX lock recovery in this configuration once the cluster is up and running. The assumption there is that the lock manager (DLM in the case of GFS2) has a record of the lock that we want the client to be able to reclaim.

We do however, also need to consider the case where the cluster is being cold-started. At that point, no node in the cluster will have any active locks. How do we know whether a client is allowed to reclaim locks at that point? To illustrate the problem, consider the following (pathological) case:

A client has mounted a filesystem from two different cluster nodes, for instance:

node1:/export mounted on /mnt1
node2:/export mounted on /mnt2

The client takes locks under both mountpoints, and then loses contact with node2. Eventually the client’s lease expires on node2 and that node releases all of the locks that the client held on it. The entire cluster is then rebooted, and when it comes back up, the client mounts both filesystems from node1 and attempts to reclaim all the locks under /mnt1 and /mnt2.

How does node1 know which locks are allowed to be reclaimed? Clearly, everything under /mnt1 should be allowed and nothing under /mnt2, but the cluster doesn’t currently store reboot recovery info at this level of detail. Right now, it only stores information about clients themselves, but nothing about the resources they hold.

After spending some time working through different scenarios, I’ve come up with the following “rule”:

The flexibility of client configurations is directly proportional to the granularity of lock information tracked on stable storage.

In other words: the more information that we keep about active locks on stable storage, the better able we are to cope with different client configurations. In real terms, that means that we have some choices, and implications for those choices:

  1. We could avoid tracking any information at all on stable storage. We’d just return NFS4ERR_NO_GRACE on any lock reclaim request that didn’t match an existing lock. On cold-boot of the cluster, there would be no lock reclaim. This is very easy to implement, but clearly non-optimal. I don’t think we’ll find this acceptable.
  2. We could just track information about the clients themselves on stable storage. In particular, we’d just track the client name “strings” that are sent in the SETCLIENTID or EXCHANGE_ID call. This is basically what knfsd does today and what my proposed nfsdcld daemon does today. The problem here is that it can’t handle the above situation I outlined. In order for this to work, we’d have to insist that clients can only mount from one node in the cluster at a time.
  3. We could track client name strings along with the destination IP address for each call. If a SETCLIENTID call comes in then we’d track a tuple of information that includes the client name string and the server address where the call came in. With this, we could reasonably support a “floating address” cluster where clients can be moved between servers by moving the server’s address to another cluster node. This might involve a hack of some sort to fix the server’s ID string to be destination address specific for v4.1. The downside here is that this would prevent clients from being able to reclaim locks against a different address from which they were acquired. That means that NFSv4 replication and migration using fs_locations info would not work and that reduces flexibility. It might also be difficult or impossible to integrate this configuration with pNFS serving if that ever becomes desirable.
  4. We could track client name strings along with the fsid of the filesystem where the resources are held. This would involve changing when we upcall/downcall to store this information. It would have to happen whenever we acquire our first resource on a fs and when the last resource is released from a fs. With this, we could reasonably allow for floating addresses and v4 fs_locations based migration/replication provided that the clients don’t mount the same filesystem off of more than one node at a time. In other words, the scenario described earlier would be unsupportable. Worse, nothing would prevent someone from doing that (other than documentation), so there is the potential for people to shoot themselves in the foot. They may not even realize it since it would basically work until they hit a network partition + cold boot scenario.
  5. We could track information about every lock on stable storage. This has the most performance impact, but it allows for complete flexibility in client configurations. Clients could migrate freely between hosts using any mechanism. There would also be no NFS4ERR_GRACE returns to clients. Clients would be able to make new lock requests as soon as the cluster starts. For this, we’d probably do something a bit different than the previous mechanisms. We’d probably want to integrate this tracking with the userspace lock manager (DLM or equivalent). What I’d envision is that DLM would keep info on stable storage about what locks it holds. When the cluster is started, we’d use that info to “preload” the reclaimable locks into DLM. Then as the clients come up, they’d take ownership of them. Any that aren’t reclaimed during an initial grace period would be released. Then we’d just enable NFS serving and start letting clients reclaim those locks. This simplifies the knfsd piece, but adds quite a bit of complexity to DLM.

Those are the possibilities I see for doing this. While it sounds a bit tougher to do, I’m starting to warm up to #5. It seems like the best option for future flexibility. It will make locking performance even slower than it is today however, so we’ll need to weigh carefully whether that’s reasonable.


Active/Active NFSv4 on Clustered Filesystems: POSIX Lock Recovery

In the previous blog posts, we’ve talked about clientid tracking. I think we’ve more or less got a scheme that will work there and that is extendible for the clustered case. I’ve proposed some patches for the Linux kernel and nfs-utils to move that along. It’ll probably need a little more work, but the concept is basically sound.

The next thing that we have to tackle is how to manage locking in the cluster, in particular lock recovery. A key goal here is to avoid the need to put nodes into grace on failover events. In order to do that, we have to ensure that DLM can recognize locks acquired by knfsd, and have it not immediately release them if the node that acquired them failed. Here’s what I envision:

When knfsd acquires a lock, it will set an FL_RECLAIMABLE flag in the lock request. DLM will then set a corresponding flag on the plock — DLM_PLOCK_FL_RECLAIMABLE. On an actual lock reclaim from knfsd, it will set a FL_RECLAIM flag on the lock request. DLM can use that to determine if this is a reclaim request and handle it differently than a new lock request. If it’s reclaim request, then it can find the matching lock in need of reclaim, and then transfer the ownership of that lock to the new node. This may involve splitting the existing lock.

If a cluster node goes down, DLM will not immediately release all of the node’s plocks. It will instead wait for a period of time on the order of the grace period for them to be reclaimed by the clients. At the end of that period, any unreclaimed locks will be released (the mechanism for this is still TBD). This is a fairly simple scheme, but should allow the cluster to continue granting normal locks and opens without worrying about whether we’re handing out a lock that might conflict with a reclaim request.

There are still potential problems here, however relating to network partitions. Consider a case where client1 has lost connectivity to the server. The server waits for a bit for client1 to reclaim the lock, but eventually gives up and purges the state from client1, and hands the lock to client2. Eventually connectivity is reestablished and the client1 reclaims the lock, effectively “stealing” it from client2.

In order to prevent this sort of problem, I believe we’ll need to be a bit more strict on reclaim. We need to ensure that the lock being reclaimed is in fact being held by the correct client. What I think we’ll need to do in practice is have the new client ID tracking daemon (nfsdcld) pass back an “index” of the client name being tracked. We can put that index in the op->info.owner field and use it to verify the the correct client is actually doing the reclaim. Assuming that they are not recycled too quickly, that should be sufficient to identify that the correct client is reclaiming the lock.

There are probably other solutions as well, but this one seems relatively simple to implement.