Active/Active NFSv4 on Clustered Filesystems: Client ID Tracking (part 2)

In an earlier post, I outlined a proposed overhaul of the scheme that knfsd uses to track client names on stable storage. There’s another aspect we need to consider as well. First, some background…

There are actually two sets of client “id’s” that we need to consider:

  • The long, client-generated string that (supposedly) uniquely identifies the client. This object is referred to in RFC3530 as an nfs_client_id4. This is the thing that the client tracking daemon will be tracking. Henceforth, I’ll call this the “client name”. This value is generally just used to acquire an ID of the other type.
  • The shorthand (64-bit) value that the server hands out to the client. This is referred to as a clientid4 in the RFC. This is the value that’s generally used for “day to day” stateful operations. We’ll call this the “client id”

Currently, the server generates client id’s by taking a 32-bit “boot time” value, and a 32-bit “counter”. This is generally fine in a single-node configuration. It’s highly unlikely you’d ever get a client id collision. This is important because when a call comes in with a particular client ID, we need to be able to use that to (quickly) find the actual client tracking object in the server.

In a cluster however, this scheme is probably not going to be sufficient. If 2 cluster hosts boot at exactly the same second, then they could hand out the same client ID. This is a big problem. Consider the following example:

Cluster hosts foo and bar are clustered NFS servers. They boot at the same time and hand out the same client id to two different clients. Now, an address floats from foo to bar. The client that was originally talking to foo then does an operation against bar that requires the clientid. bar sees that clientid and confuses it with the one that it originally handed out.

There are a few different ways to avoid this:

  1. We can try to ensure that clientid collisions like this never happen. Sounds simple, but is difficult in practice.
  2. Instead of identifying clients just by client id, we can use a client id + server address tuple
  3. We can containerize nfsd and ensure that each address is started within its own container. Presumably a containerized nfsd will have its own list of clients, and a collision is no big deal.
  4. We can consider abandoning the floating address model altogether and rely on clients using RR DNS and fs_locations info to find the servers. If the client knows that it must reclaim state before it ever talks to the failover host, then collisions aren’t a problem.

Containerizing sounds like the best approach, but perhaps abandoning floating addresses for this would be easier?


6 thoughts on “Active/Active NFSv4 on Clustered Filesystems: Client ID Tracking (part 2)

  1. Tigran Mkrtchyan

    Hi Jeff,

    I am working on similar issue and have some ideas as well:

    I guess it’s common to use BOOT_TIME + COUNTER as client id. At lest this is the case in our server as well. Nevertheless, if we 28 bits to store boot time (just keep last 8 years) and use 28 bits counter (still big enough), then we will get yet another 8 bits to encode node id. Well, of course one can argue, that 255 node cluster is a limitation today.


    1. jtlayton Post author

      I think there’s a simpler scheme. The nodeid will need to come from userspace somehow anyway, so we might as well allow more of the clientid4 to come from there.

      Rather than using the boot time, we’ll keep a 32 bit counter on stable storage that’s shared amongst all nodes (the new nfsdcld daemon could track this). On each boot, the kernel will upcall to get the next value (and increment the shared counter). That should offer enough uniqueness without needing to deal with boot times or any sort of nodeid.

      Then we can use a regular old counter like we do today for the other 32 bits of the clientid4.

  2. Bruce Fields

    I like the global counter too.

    I’m pretty sure we have to do your #2 or #3 anyway. But we’ll eventually want cluster-wide unique clientid’s when we implement v4-based migration/failover.

    I’m not sure I understand #4–you’re thinking about v4-based migration/failover, I guess? In that case there are two possible clustered v4 server implementations: a fancy implementation may share the v4 state between nodes, so the new server already knows all of the client’s state; or a dumb implementation may require the client to reclaim all of its state. An implementation might even do both, depending on the circumstances: e.g. it might be “fancy” on planned migrations but “dumb” on failover. To determine which is the case, I expect the client will probe the new server using a RENEW with the clientid the old server gave it, and assume the “fancy” case iff that succeeds.

    That’s why I think we’ll want cluster-wide unique clientid’s for v4-based migration/failover.

  3. jtlayton Post author

    For #4 I was considering migration-based failover. The idea was that if the client is aware that it’s being relocated, then it could throw out the old clientid (and state). As you point out though, that might break other clustered server implementations that expect you to be able to reuse those clientids on other nodes in the cluster, so that’s not really a solution at all

    We absolutely must have client IDs that are unique during the a single cold-start of the cluster as a whole. Technically, we could get away with collisions between different cold starts of the cluster, but I don’t think we’ll need to do that. With an on-stable-storage 32-bit counter, we’re probably OK without having to do anything else to ensure uniqueness. If we’re paranoid about that counter wrapping, we could even come up with a scheme to reset it back to 0 on a cold boot of the cluster.

    For this project, I’m planning to do the absolute bare minimum of state sharing that I can get away with to meet the goals. I want the cluster to be as dumb as possible in order to simplify the implementation.

  4. Bruce Fields

    “We absolutely must have client IDs that are unique during the a single cold-start of the cluster as a whole.”

    I don’t understand–why? (I think it is required in certain cases, I’m just not sure which assumptions you’re making when you say that….)

  5. jtlayton Post author

    I’m mostly concerned about collisions. If two different nodes hand out the same client ID value to two different clients, and then one of those clients migrates to another server, then all sorts of “bad things” might happen.

    I think we ought to avoid that. It turns out that I don’t think it’ll be too hard to do as long as we can upcall for a “boot generation” value from stable storage.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s