Active/Active NFSv4 on Clustered Filesystems: avoiding the VFS layer

So far, most of my posts about this proposed design have been about teaching the VFS and underlying filesystems how to handle the state properly for NFSv4 semantics. We could however, consider another design. We could have a clustered daemon that runs and tracks the state independently of the VFS. In other words, we could go with a more Samba/CTDB type design here.

knfsd would upcall to talk to a daemon and it would be the final arbiter. Such a daemon could be integrated with Samba/CTDB such that knfsd and samba are aware of each others state. In principle, that would allow you to export the same filesystem via both knfsd and Samba/CTDB and get reasonable semantics for both. Oplocks and delegations would get broken appropriately, share mode reservations should work, and (possibly) we could make locking work in a more consistent fashion than we do today.

Of course, there’s a catch — nothing else would be aware of that state. It would all be tied up inside the CTDB database (or whatever clustered DB infrastructure we’d end up using). The result would be a NFS+SMB “appliance”. I think that’s a less desirable design than one that’s more general purpose, but it might be easier to achieve and we might be able to hammer that out quicker since we’d be able to avoid a lot of the VFS-layer work.

In the near term, we don’t really need to make this decision. Either way, we’ll still need to be able to swap in the correct set of operations to handle it, so the focus for now can be on simply abstracting out the parts of the server code that we’ll need to swap out to handle this later. It should even be possible to do this sort of design as an interim step, and then add what the VFS would need for a more general solution later.


3 thoughts on “Active/Active NFSv4 on Clustered Filesystems: avoiding the VFS layer

  1. Bruce Fields

    I’d still rather push the common mechanisms into the kernel than userspace, but, maybe so….

    We might also be able to cooperate with Ganesha people on that (if it’s true that there’s a Ganesha/CTDB project, then they may already be adding CTDB support for NFS state).

    I’d be a little afraid of having too many knobs here, with people left losing data if they choose the wrong settings for their usage. So we need to limit the cases to something we can easily document, I suppose.

  2. Jeff Layton Post author

    My thinking (still quite hand-wavy at this point) is that we’d allow the user to select the state tracking to use via export options. Maybe a “state={native|ctdb|…}” export option. So, you’d do something like this in /etc/exports:

    /foo *(rw,state=native)


    /bar *(rw,state=ctdb)

    …we could eventually have other values for state= as well. Once we ID’ed the export that a particular RPC is intended for, we’d just swap in the correct operation to handle the stateful op.

    1. Jeff Layton Post author

      …or maybe: state={single|cluster|ctdb}

      single = single server access — track all state internally

      cluster = clustered access — upcall to track stuff like client lock state, etc but issue locks to the vfs itself

      ctdb = upcall to ctdb for all stateful objects, don’t worry about trying to track them in the vfs

      …of course, the names for the option and its potential values is quite open for debate.

      I guess the point is that we’re going to need to abstract all of this out to some degree anyway, so accomodating a use case like ctdb isn’t significantly more difficult…


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s