Active/Active NFSv4 on Clustered Filesystems: abstracting out fs operations from nfsd

Regardless of what method we ultimately choose, a more immediate engineering task is to abstract out the operations that we ultimately want to delegate to the lower filesystems. Since we want to eventually allow filesystems to override the default grace-period handling, we’ll need to do provide some way to do that. Most likely, we’ll just add some new export_ops to allow for this (or maybe a new struct that hangs off the export ops).

Here’s a first stab at what I envision for this, mostly based on looking at where nfsd currently checks the grace period:

  •  nfsd_open operation: this will be the most complex of the of the set. It’ll need to do several things semi-atomically (at least wrt to other nfsd threads)
    • check the grace period and determine whether the file is allowed to be opened
    • (optionally) check share reservations that are being requested and that are already set, and determine whether the file is allowed to be opened
    • open the file (assuming the above checks passed)
    • set share reservation (if any, we may eventually combine that with the actual open if we ever get VFS support for share mode reservations)
    • get a lease/delegation (if needed)
    • …and hand the result back to nfsd
  • a nfsd_lock operation (for F_SETLK/F_GETLK):
    • check the grace period and see if we’re allowed to set a new lock or reclaim an old one
    • acquire or release the lock
  • an overridable grace_disallows_io to handle mandatory locks. This will prevent stateid-based ops from performing I/O during the grace period if necessary.
  • a nfsd_remove  and nfsd_rename op: to allow the blocking of metadata ops during the grace period

That’s a start anyway. The actual API itself will need to be determined for this. In the near term, I plan to do some reshuffling of the code to make it easier to break this stuff out later. Once those pieces have a more sharply defined box around them it should just be a matter of turning them into a set of operations that filesystems can override in some fashion.


4 thoughts on “Active/Active NFSv4 on Clustered Filesystems: abstracting out fs operations from nfsd

  1. Bruce Fields

    “nfsd_open operation: this will be the most complex of the of the set.”

    Yes, ugh.

    I have some other work either planned or ongoing there:

    – Lease/delegations: I’m hoping to get some changes here merged for 3.4. Fortunately I think they don’t have to be too closely tied with the rest of the open: we have to guarantee is that the file still exists under the same name before we return a delegation (which is broken on unlink or rename), but we can do that by acquiring the lease late in the process and then re-checking the lookup. Delegations are optional, so if it fails at that point we can just return without a delegation.

    – open locking: I don’t have any patches here yet. I believe open-create is the one place where we hold v4’s global state lock over a file operation. That limits the number of creates per second to 1/(disk latency). This can show up as a severe regression from v3.

    – vfs share locks: longer term, we would like to enforce share locks properly against other users. Samba currently does this with flock(fd, LOCK_MAND|LOCK_WRITE) and flock(fd, LOCK_MAND|LOCK_READ). Which struck me as an ugly hack the first time I saw it, but perhaps it could do the job for us too….

    In general, the v4 open code is a mess. It would be nice if we can do all this without making it more so.

  2. Bruce Fields

    “a nfsd_remove and nfsd_rename op: to allow the blocking of metadata ops during the grace period”

    By the way, the real reason we need this, I think, is for delegations: if a client holds a read delegation, then it’s supposed to be synchronously notified of renames or unlinks so that, for example, it can continue to handle opens locally, and guarantee to applications that at least the last component is still correct without waiting for a server round trip. This could be broken over reboot if, for example, another client unlinks a file before our client notices the reboot (and the lost delegation).

    But then I think it would be nicer to move the grace-check calls into nfsd_break_lease() (which is called from nfsd rename and unlink code).

    And longer term (if we want to some day move grace enforcement to the vfs) we’d move those checks into break_lease() itself.

  3. jtlayton Post author

    Thanks for the input. Yes, we really want to avoid making this code even more convoluted than it already is. I think that’s doable by trying (as much as possible) to separate out the protocol level
    code from the vfs layer code. The grace-period sort of straddles both parts, but I think delegating
    that to the filesystem for the most part should be the right approach.

    My thinking is to move whatever we think a filesystem might want to override behavior-wise into these operations. In particular, we probably want to get rid of any direct references to the grace period in the nfsd code and move those into these operations. Moving delegation handling and
    share-level locking under that allows us to handle that in an fs-specific manner as well.

    Since we’re still not 100% clear how best to handle this at the fs layer, then we really want to shoot
    to allow filesystems flexibility in how these stateful objects are handled. We may even want to experiment with an approach that’s more samba/ctdb-like where we track these objects outside of the VFS via an upcall of some sort. If we move to a set of operations like what I’m suggesting then we can play with that idea and then later go back and retrofit how it works at the fs layer without disturbing nfsd too much if it turns out to be unworkable.

    That said, I’m still very much at the hand-wavy stage with this so far. I don’t really know what it’ll
    eventually look like, so the initial pass at this will be an attempt to clean up and draw more well
    defined boxes around these parts of the existing code.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s