Active/Active NFSv4 on Clustered Filesystems: POSIX Lock Recovery (cold-start case)

In my last post, I talked about POSIX lock recovery in this configuration once the cluster is up and running. The assumption there is that the lock manager (DLM in the case of GFS2) has a record of the lock that we want the client to be able to reclaim.

We do however, also need to consider the case where the cluster is being cold-started. At that point, no node in the cluster will have any active locks. How do we know whether a client is allowed to reclaim locks at that point? To illustrate the problem, consider the following (pathological) case:

A client has mounted a filesystem from two different cluster nodes, for instance:

node1:/export mounted on /mnt1
node2:/export mounted on /mnt2

The client takes locks under both mountpoints, and then loses contact with node2. Eventually the client’s lease expires on node2 and that node releases all of the locks that the client held on it. The entire cluster is then rebooted, and when it comes back up, the client mounts both filesystems from node1 and attempts to reclaim all the locks under /mnt1 and /mnt2.

How does node1 know which locks are allowed to be reclaimed? Clearly, everything under /mnt1 should be allowed and nothing under /mnt2, but the cluster doesn’t currently store reboot recovery info at this level of detail. Right now, it only stores information about clients themselves, but nothing about the resources they hold.

After spending some time working through different scenarios, I’ve come up with the following “rule”:

The flexibility of client configurations is directly proportional to the granularity of lock information tracked on stable storage.

In other words: the more information that we keep about active locks on stable storage, the better able we are to cope with different client configurations. In real terms, that means that we have some choices, and implications for those choices:

  1. We could avoid tracking any information at all on stable storage. We’d just return NFS4ERR_NO_GRACE on any lock reclaim request that didn’t match an existing lock. On cold-boot of the cluster, there would be no lock reclaim. This is very easy to implement, but clearly non-optimal. I don’t think we’ll find this acceptable.
  2. We could just track information about the clients themselves on stable storage. In particular, we’d just track the client name “strings” that are sent in the SETCLIENTID or EXCHANGE_ID call. This is basically what knfsd does today and what my proposed nfsdcld daemon does today. The problem here is that it can’t handle the above situation I outlined. In order for this to work, we’d have to insist that clients can only mount from one node in the cluster at a time.
  3. We could track client name strings along with the destination IP address for each call. If a SETCLIENTID call comes in then we’d track a tuple of information that includes the client name string and the server address where the call came in. With this, we could reasonably support a “floating address” cluster where clients can be moved between servers by moving the server’s address to another cluster node. This might involve a hack of some sort to fix the server’s ID string to be destination address specific for v4.1. The downside here is that this would prevent clients from being able to reclaim locks against a different address from which they were acquired. That means that NFSv4 replication and migration using fs_locations info would not work and that reduces flexibility. It might also be difficult or impossible to integrate this configuration with pNFS serving if that ever becomes desirable.
  4. We could track client name strings along with the fsid of the filesystem where the resources are held. This would involve changing when we upcall/downcall to store this information. It would have to happen whenever we acquire our first resource on a fs and when the last resource is released from a fs. With this, we could reasonably allow for floating addresses and v4 fs_locations based migration/replication provided that the clients don’t mount the same filesystem off of more than one node at a time. In other words, the scenario described earlier would be unsupportable. Worse, nothing would prevent someone from doing that (other than documentation), so there is the potential for people to shoot themselves in the foot. They may not even realize it since it would basically work until they hit a network partition + cold boot scenario.
  5. We could track information about every lock on stable storage. This has the most performance impact, but it allows for complete flexibility in client configurations. Clients could migrate freely between hosts using any mechanism. There would also be no NFS4ERR_GRACE returns to clients. Clients would be able to make new lock requests as soon as the cluster starts. For this, we’d probably do something a bit different than the previous mechanisms. We’d probably want to integrate this tracking with the userspace lock manager (DLM or equivalent). What I’d envision is that DLM would keep info on stable storage about what locks it holds. When the cluster is started, we’d use that info to “preload” the reclaimable locks into DLM. Then as the clients come up, they’d take ownership of them. Any that aren’t reclaimed during an initial grace period would be released. Then we’d just enable NFS serving and start letting clients reclaim those locks. This simplifies the knfsd piece, but adds quite a bit of complexity to DLM.

Those are the possibilities I see for doing this. While it sounds a bit tougher to do, I’m starting to warm up to #5. It seems like the best option for future flexibility. It will make locking performance even slower than it is today however, so we’ll need to weigh carefully whether that’s reasonable.

4 thoughts on “Active/Active NFSv4 on Clustered Filesystems: POSIX Lock Recovery (cold-start case)

  1. Bruce Fields

    “It might also be difficult or impossible to integrate this configuration with pNFS serving if that ever becomes desirable.”

    pNFS shouldn’t make much difference: with pNFS you still mount a single server, and basically everything except READ and WRITE goes to that server (so it’s responsible for all your state, including locks).

    I’m still skeptical about the state of the v4 migration/replication protocol, so I’m a little down on #5.

    Reply
    1. jtlayton Post author

      #5 does add some simplicity to the design though…

      It means that we don’t need to really worry about the grace period at all, other than to time out a set of unreclaimed locks. That’s something we’ll need to be able to do anyway when a node fails. In this case, we’d just preload all of the locks as “orphans” and then set a timer that releases any that weren’t reclaimed.

      It is bound to be slower however for lock-heavy workloads.

      Reply
  2. jujjuriJV

    If we have clientids on stable storage, your #2 must work right?
    Node2 released all locks; and when che cluster comes back up, it is a cold start
    and if the client comes back to node1, it has the clientid on the stable storage to allow it
    during grace, and all lock requests come with ‘reclaim flag’ and during grace these lock
    requests are allowed by the server and sent to FS, and FS grants those locks.

    Reply
  3. Pingback: Some useful links for NFS Clustering and Failover | Bits and Bytes

Leave a comment