Active/Active NFSv4 on Clustered Filesystems: Client ID Tracking

There are a number of problems that can occur when server reboots coincide with a loss of communication between the client and server. A couple possible cases are outlined in Section 8.6.3 of RFC 3530. In order to prevent these problems, the server must track some information about its clients on stable storage. The Linux server currently does the minimum scheme to prevent this:

  1. Whenever the client does its first non-reclaim OPEN, OPEN_CONFIRM or (for v4.1 clients) RECLAIM_COMPLETE, the server must store information about that client’s opaque ID string.
  2. When the client needs to reclaim during the grace period, the server must verify that there is a record of the client on stable storage
  3. When the clientid expires, then we MAY remove that record
  4. When the grace period ends, any client records that were not reclaimed during the current grace period must be removed

Linux knfsd does this currently by tracking this information on disk, usually in /var/lib/nfs/v4recovery. Note that there is no daemon for doing this. knfsd creates and removes directories in this directory in order to track this using vfs_* calls from within the kernel. This seems a bit odd and somewhat alarming until you realize that knfsd manipulates filesystems itself all the time in the course of its normal duties. Still, changing this has been something on the to-do list for a long time.

Now though we have a clear need to change this. If an address floats to a different server (or there’s a migration event or something), then we still need to track it in case it needs to reclaim its state. The clientid database must be shared among all the hosts in the cluster.

One thing to bear in mind is that clients do not call SETCLIENTID against a server, but rather against an address. There is no reliable mechanism for a client to tell that a server is multi-homed, so it must treat each address as if it were an independent server. By the same token, when the address floats to a new server, we need to move any stateful information that’s associated with that address to the new server as well.

So, how do we fix this…

The first step toward fixing this is to remove the code that does this tracking today and replace it with a more standard upcall + daemon to track this information. Since we have a large installed base of servers using the old mechanism, we will also need a transition mechanism. Currently, I’m considering just moving the older code inside of something like this:

        #ifdef CONFIG_NFSD_V4_LEGACY_STATE_TRACKING

…and adding a new Kconfig option to switch between the two.

Once that’s settled, we’ll need to add the upcall and daemon. I have some prototype code in progress for this, and it’s fairly straightforward. The upcall uses a new pipe in rpc_pipefs, and the daemon tracks the info using a sqlite DB in a standard location.

Doing this in the single host case is relatively easy, and most of the functions are identical in the clustered case as well. The kicker is the last element of the tracking scheme where we clean up at the end of the grace period. Because this involves the grace period (which must be coordinated among all the cluster hosts), we have to do something more elaborate for a clustered setup. We’ll need to ensure that only the first (or last?) host to come out of grace does the “purge” at the end of the grace period in this situation.

This may mean that some sort of quorum/election cluster-fu will be involved here. Ideally, we”ll just be able to extend the single-host daemon to be cluster aware when needed. Worst case, we’ll need to design an entirely separate daemon for it, but I hope that won’t be necessary.

Advertisements

One thought on “Active/Active NFSv4 on Clustered Filesystems: Client ID Tracking

  1. J. Bruce Fields

    Nit:

    “There is no reliable mechanism for a client to tell that a server is multi-homed”

    NFSv4.1 does actually have such a mechanism. And people are experimenting with tricks to do this with the NFSv4.0 protocol, but I’m not sure that’s completely figured out yet.

    I’m not actually sure how v2/v3 (well, NSM would be the interesting part I think) deal with multi-homed servers.

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s