Monthly Archives: February 2012

Active/Active NFSv4 on Clustered Filesystems: abstracting out fs operations from nfsd

Regardless of what method we ultimately choose, a more immediate engineering task is to abstract out the operations that we ultimately want to delegate to the lower filesystems. Since we want to eventually allow filesystems to override the default grace-period handling, we’ll need to do provide some way to do that. Most likely, we’ll just add some new export_ops to allow for this (or maybe a new struct that hangs off the export ops).

Here’s a first stab at what I envision for this, mostly based on looking at where nfsd currently checks the grace period:

  •  nfsd_open operation: this will be the most complex of the of the set. It’ll need to do several things semi-atomically (at least wrt to other nfsd threads)
    • check the grace period and determine whether the file is allowed to be opened
    • (optionally) check share reservations that are being requested and that are already set, and determine whether the file is allowed to be opened
    • open the file (assuming the above checks passed)
    • set share reservation (if any, we may eventually combine that with the actual open if we ever get VFS support for share mode reservations)
    • get a lease/delegation (if needed)
    • …and hand the result back to nfsd
  • a nfsd_lock operation (for F_SETLK/F_GETLK):
    • check the grace period and see if we’re allowed to set a new lock or reclaim an old one
    • acquire or release the lock
  • an overridable grace_disallows_io to handle mandatory locks. This will prevent stateid-based ops from performing I/O during the grace period if necessary.
  • a nfsd_remove  and nfsd_rename op: to allow the blocking of metadata ops during the grace period

That’s a start anyway. The actual API itself will need to be determined for this. In the near term, I plan to do some reshuffling of the code to make it easier to break this stuff out later. Once those pieces have a more sharply defined box around them it should just be a matter of turning them into a set of operations that filesystems can override in some fashion.