Active/Active NFSv4 on Clustered Filesystems: POSIX Lock Recovery

In the previous blog posts, we’ve talked about clientid tracking. I think we’ve more or less got a scheme that will work there and that is extendible for the clustered case. I’ve proposed some patches for the Linux kernel and nfs-utils to move that along. It’ll probably need a little more work, but the concept is basically sound.

The next thing that we have to tackle is how to manage locking in the cluster, in particular lock recovery. A key goal here is to avoid the need to put nodes into grace on failover events. In order to do that, we have to ensure that DLM can recognize locks acquired by knfsd, and have it not immediately release them if the node that acquired them failed. Here’s what I envision:

When knfsd acquires a lock, it will set an FL_RECLAIMABLE flag in the lock request. DLM will then set a corresponding flag on the plock — DLM_PLOCK_FL_RECLAIMABLE. On an actual lock reclaim from knfsd, it will set a FL_RECLAIM flag on the lock request. DLM can use that to determine if this is a reclaim request and handle it differently than a new lock request. If it’s reclaim request, then it can find the matching lock in need of reclaim, and then transfer the ownership of that lock to the new node. This may involve splitting the existing lock.

If a cluster node goes down, DLM will not immediately release all of the node’s plocks. It will instead wait for a period of time on the order of the grace period for them to be reclaimed by the clients. At the end of that period, any unreclaimed locks will be released (the mechanism for this is still TBD). This is a fairly simple scheme, but should allow the cluster to continue granting normal locks and opens without worrying about whether we’re handing out a lock that might conflict with a reclaim request.

There are still potential problems here, however relating to network partitions. Consider a case where client1 has lost connectivity to the server. The server waits for a bit for client1 to reclaim the lock, but eventually gives up and purges the state from client1, and hands the lock to client2. Eventually connectivity is reestablished and the client1 reclaims the lock, effectively “stealing” it from client2.

In order to prevent this sort of problem, I believe we’ll need to be a bit more strict on reclaim. We need to ensure that the lock being reclaimed is in fact being held by the correct client. What I think we’ll need to do in practice is have the new client ID tracking daemon (nfsdcld) pass back an “index” of the client name being tracked. We can put that index in the op->info.owner field and use it to verify the the correct client is actually doing the reclaim. Assuming that they are not recycled too quickly, that should be sufficient to identify that the correct client is reclaiming the lock.

There are probably other solutions as well, but this one seems relatively simple to implement.

8 thoughts on “Active/Active NFSv4 on Clustered Filesystems: POSIX Lock Recovery

  1. Bruce Fields

    I’m not so sure about the possibility of stealing:

    In your example, the server shouldn’t give up on client 1 without also ending its grace period (hence refusing any more reclaims).

    The protocol assumes two things:
    1. That the server can keep track of which clients are allowed to perform reclaims at a given moment. (They must be clients of a node that is in grace, and must be in that node’s nfsdcld’s database.)
    2. That clients are well-behaved, in the sense that they will only attempt to reclaim locks that they succesfully acquired before, didn’t drop, and reclaimed as they were aware it was necessary.

    Given both of those assumptions:

    Suppose two clients both reclaim the same lock.

    By #2, both must have succesfully acquired that lock before, and both believe they still have it.

    The only way that can happen is if one of them missed a chance to renew its state or to reclaim the lock when it needed to (perhaps because it lost contact with the server for a while). In which case by #1 the server should have purged that client from nfsdcld’s database and/or left the grace period.

    That said: it would probably still be a good idea to keep track of which client owns a given lock, to aid debugging, and to improve robustness against broken clients.

    Reply
  2. jtlayton Post author

    Stealing is a possibility since I’m planning to do the absolute bare minimum here for lock reclaims.

    The cluster would not track whether individual clients were in grace at any particular time. Doing that would be very difficult. The client could hold state on multiple cluster nodes at a given time. Trying to do that also brings up the sticky problem of how you track whether a client is in grace or not on stable storage, and how you deal with that in the face of reboots of individual cluster nodes or the entire cluster as a whole.

    What I envision is more “dumb”. Once the cluster as a whole is out of grace, clients would basically be allowed to reclaim locks at any time iff there is a record in nfsdcld for it at that time. In order for that to be OK however, we must ensure that the lock that the client is reclaiming is actually being held by the cluster on its behalf.

    This means that we need yet another “client id”. That client id must be invariant across server node reboots while the cluster holds any locks for that client. It does not however need to be persistent if the server nodes purge all state for the client and remove its record from the DB.

    As it turns out, I think this will be fairly simple to implement. I’m planning to add an autoindex field to the SQL table that tracks client names, and have the “create” and “check” upcalls hand that value back to the kernel. SQLite autoindex integers are 64 bits, and that looks like it should fit neatly in DLM’s op->info.owner field.

    Reply
  3. Bruce Fields

    Instead of allowing clients with records in nfsdcld to reclaim at any time, you could allow them to reclaim only within that nfsd’s grace period, and I think that does the job.

    So, something like:

    Each “virtual” nfsd has a grace period. All grace periods are the same length. There is no cluster-wide idea of a grace period, it’s just something each vnfsd enforces on its own.

    During the grace period, reclaims (for recorded clients) are allowed if they match an existing lock, and do whatever’s needed to migrate the lock to the new node. Non-reclaims are allowed if they conflict with no lock.

    Outside the grace period, no reclaims are allowed. Non-reclaims are allowed if they conflict with no lock, or if they conflict with a lock that is “orphaned”.

    That does require knowing when a locks is “orphaned”.

    Reply
  4. jtlayton Post author

    That seems like it would work, but I’m not sure that makes anything simpler…

    In addition to defining what an orphaned lock is, it involves the concept of a virtual nfsd. I’m a little leery of basing this on containers. Containerization is still vaporware at this point, and I think trying to use that will add a lot of complexity to the implementation.

    That scheme will also make it difficult to do migration-based load balancing later. The only way you’ll be able to shift clients to another server is to fail over the address. If we (hypothetically) came up with a way to issue NFS4ERR_MOVED selectively to certain clients, then when those clients failed over they wouldn’t be able to recover any locks.

    I think we might be able to coordinate grace periods by having the kernel upcall when it enters grace. The daemon would store a timestamp in the DB with the start time, ones that enter grace later would update the start time with a later one. On an end_grace upcall, each daemon would check the start time in the DB. If it less than the lease period in the past, then it could wait until the lease period is up and then do the post-grace period activities.

    That does still require some mechanism to deal with nodes that enter the cluster after the grace period has expired however…

    Reply
  5. Bruce Fields

    “Containerization is still vaporware at this point, and I think trying to use that will add a lot of complexity to the implementation.”

    Yes, I’d still like to be able to depend on containerization, but it looks harder than I hoped–bad judgement on my part.

    If we take the floating-ip approach, I think we end up doing something a lot like containerization regardless.

    OK, so you’re talking about using v4 protocol without floating ip’s, and (once the cold-start grace period is passed) allowing locks to be “stolen” whenever there’s a reclaim from the correct client. Got it (I think), sorry for being slow.

    That still does require some cleanup of “orphan” locks, in case a client dies on failover and never sends the reclaims. (And I think there are also some scenarios where a failed-over client could legitimately reclaim some of its locks, but not all, even without being buggy.)

    Reply
  6. jtlayton Post author

    > “If we take the floating-ip approach, I think we end up doing something a lot like
    > containerization regardless.”

    Not necessarily. I think the scheme I’ve been outlining allows for us to do a floating-ip based implementation without needing any containerization. Granted I’m still working out the details of the design, but I think that’s possible.

    > “OK, so you’re talking about using v4 protocol without floating ip’s, and (once the
    > cold-start grace period is passed) allowing locks to be “stolen” whenever there’s a reclaim
    > from the correct client.”

    That’s basically correct, but I think this scheme should work regardless of whether we use floating IP’s or migration-based failover. I plan to use floating IPs initially, and then we can consider using migration-based failover later. Basically, I want to allow clients to migrate to another node whenever they like. How we force that migration to occur doesn’t really matter too much.

    > That still does require some cleanup of “orphan” locks, in case a client dies on failover and
    > never sends the reclaims.

    Correct. There are other scenarios too — the virtual IP might go down and just not come back, precluding any reclaim.

    This would be part of the “reengineering” of DLM to handle this. Today, when a node in the cluster dies, DLM releases all locks for that node. One of the things we’ll change is to have DLM hold on to those locks for a period of time so they can be reclaimed. Probably what we’ll do is have one of the DLM daemons clean up unreclaimed locks after a timer pops that’s equivalent to the grace period.

    That’s all well and good when the node actually dies. On a planned failover of any sort that doesn’t take down the node, we’ll need some mechanism to clean up unreclaimed locks. I’m not yet sure what that mechanism will look like.

    Reply
  7. jtlayton Post author

    Actually, now that I think about it, we may not need to do anything special to clean up unreclaimed locks in this case…

    When a client fails to renew its clientid after a certain period (2 or 3 lease periods?), we could just release all state associated with it and expire it. DLM would need to be smart enough to know which resources (plocks in particular) have been taken over by other hosts and not actually release them when the unlock comes in from the wrong cluster node. That should be sufficient though to prevent problems with orphaned locks…

    Reply

Leave a reply to jtlayton Cancel reply