In the previous blog posts, we’ve talked about clientid tracking. I think we’ve more or less got a scheme that will work there and that is extendible for the clustered case. I’ve proposed some patches for the Linux kernel and nfs-utils to move that along. It’ll probably need a little more work, but the concept is basically sound.
The next thing that we have to tackle is how to manage locking in the cluster, in particular lock recovery. A key goal here is to avoid the need to put nodes into grace on failover events. In order to do that, we have to ensure that DLM can recognize locks acquired by knfsd, and have it not immediately release them if the node that acquired them failed. Here’s what I envision:
When knfsd acquires a lock, it will set an FL_RECLAIMABLE flag in the lock request. DLM will then set a corresponding flag on the plock — DLM_PLOCK_FL_RECLAIMABLE. On an actual lock reclaim from knfsd, it will set a FL_RECLAIM flag on the lock request. DLM can use that to determine if this is a reclaim request and handle it differently than a new lock request. If it’s reclaim request, then it can find the matching lock in need of reclaim, and then transfer the ownership of that lock to the new node. This may involve splitting the existing lock.
If a cluster node goes down, DLM will not immediately release all of the node’s plocks. It will instead wait for a period of time on the order of the grace period for them to be reclaimed by the clients. At the end of that period, any unreclaimed locks will be released (the mechanism for this is still TBD). This is a fairly simple scheme, but should allow the cluster to continue granting normal locks and opens without worrying about whether we’re handing out a lock that might conflict with a reclaim request.
There are still potential problems here, however relating to network partitions. Consider a case where client1 has lost connectivity to the server. The server waits for a bit for client1 to reclaim the lock, but eventually gives up and purges the state from client1, and hands the lock to client2. Eventually connectivity is reestablished and the client1 reclaims the lock, effectively “stealing” it from client2.
In order to prevent this sort of problem, I believe we’ll need to be a bit more strict on reclaim. We need to ensure that the lock being reclaimed is in fact being held by the correct client. What I think we’ll need to do in practice is have the new client ID tracking daemon (nfsdcld) pass back an “index” of the client name being tracked. We can put that index in the op->info.owner field and use it to verify the the correct client is actually doing the reclaim. Assuming that they are not recycled too quickly, that should be sufficient to identify that the correct client is reclaiming the lock.
There are probably other solutions as well, but this one seems relatively simple to implement.