Author Archives: Jeff Layton

About Jeff Layton

I'm a Linux Filesystem developer for Red Hat, primarily focusing on NFS and CIFS. I also am a member of the worldwide Samba team, and am maintainer of the cifs-utils project.

The inode->i_version counter in Linux

I’ve written previously about the NFSv4 change attribute and CephFS. The Linux kernel has also supported it for a long time via a field i_version field in the inode structure.

Most kernel subsystems just treat this field as an opaque value. Some filesystems use it internally, primarily to track changes to directories (mostly for handling directory changes during readdir). The kernel NFS client uses that field to store the NFSv4 change attribute.

The exception is when the filesystem sets the MS_I_VERSION flag in the superblock. That flag tells the kernel that it should increment i_version on regular files when it goes to change the timestamps during a data or metadata change. Only a few filesystems in the kernel do this: xfs and btrfs always have it enabled. ext4 has it off by default but you can turn it on with the i_version mount option.

With that enabled, the kernel NFS server will also populate the NFSv4 change attribute from that value as well. This is what allows us to (more or less) eliminate timestamp granularity problems for NFSv4 clients when exporting those filesystems.

The important bit to note here is that there is a lot of inconsistency on how this field is handled in the kernel. Some filesystems have very different semantics from others. AFS for instance uses this as the data version counter which is not incremented on metadata changes. For the most part though, these differences haven’t really mattered as the users of this field are limited and are confined to the kernel.

But, that may change in the not too distant future. There is (hopefully!) a new syscall on the horizon: statx(). With that, exposing the i_version field to userland could be very useful. There are a lot of applications that would like to have a cheap and reliable indicator that something changed on an inode. Before we can expose that field however, we’ll need to sort out the inconsistencies between the implementations, and clearly define what this field should mean for applications.

Now we circle back to Ceph. The Ceph change attribute implementation exploits the fact that we don’t necessarily need to bump the counter atomically on every change. When userland ceph clients are able to cache a change to the metadata, they bump the counter in their local copy. When the change counter is queried by an application (which uses an interface very similar to statx), all of the cached copies are synched up before handing the value back to the caller.

Given that we have at least one filesystem that already benefits from looser semantics for the i_version counter, can we also relax this requirement in in-kernel filesystems to reduce the number of metadata updates that are required on a write-heavy workload?

Can we make it go faster?

A few months ago, a colleague of mine that primarily works on local filesystems asked about the feasibility of turning of the i_version counter. Apparently some of the Red Hat QA folks were concerned about an XFS performance regression when moving from RHEL6 to RHEL7. They had tracked it down to the fact that XFS had revved its on disk metadata format, and the i_version counter had been switched on universally.

The workload in this case was very small I/Os in iozone which is clearly synthetic, but it got me thinking that it might be possible to reduce the number of situations where we need to dirty the inode metadata.

Basically the idea is to set a flag in the inode when we query the i_version counter. Then, when we go to bump it, we only bump the counter if that flag is set and then we clear the flag. Then, we fix the filesystems to only log the change to disk if the counter was incremented and the times didn’t change.

For the vastly common case where no one has queried for it, we can reduce the number of times that the inode metadata is dirtied down to a frequency on par with the timestamp granularity.

Which is neato and may really helps those really small buffered I/O workloads, where fsync is only infrequently called. As a pessimal case, we can play with time and dd with 1 byte writes:

$ time dd if=/dev/zero of=/mnt/scratch/foo bs=1 count=10000000

Which is a silly workload of course, but it should give us the most dramatic results. On XFS on my fairly crappy test rig on an unpatched kernel:

real 0m13.840s
user 0m0.964s
sys 0m12.824s

After my draft patch series:

real 0m6.753s
user 0m0.617s
sys 0m6.114s

…about half the time. Again though, this is not at all a real-world workload, so I don’t think it’s sufficient to merge this based on that alone. I see some small speedup on kernel builds but nothing to get too excited about.

That said, ext4 has the i_version counter off by default and you have to specify an option to turn it on. I don’t think they’d do that unless they thought the counter were cost-prohibitive. Clearly there may be a use case there?

So, I pushed this into my kernel tree on samba.org and few days later, I got this from the Intel kernel test robot:

Greeting,

FYI, we noticed a 43.5% improvement of vm-scalability.throughput due to
commit:

commit: fa629d46c4da556a77c7b8c7760e734dd88d1f3e ("fs: only set S_VERSION
when updating times if it has been queried")

43%! That certainly sounds significant. The docs for the vm-scalability testsuite say this about the test that showed the improvement:

case-msync-mt:
Create a sparse file with size of $MemTotal, before creating N threads, it
will preallocate and prefault 1/2 memory space with mmap using this sparse
file as backing store and then the N threads will all write data there
using the preallocated space. When this is done, use msync to flush
change back to the file.

Is that representative of any sort of real-world workload?

CephFS and the NFSv4 Change Attribute

Since I’ve rejoined Red Hat, I’ve been doing some work within cephfs to better understand the code. One of the projects I’ve been working on is to add a statx-like interface to libcephfs (the userland ceph client code). As part of that, I’m adding in support for file create times (also known as the “btime”), and a NFSv4 change attribute with the aim of adding support for it to nfs-ganesha.

The NFSv4 Change Attribute

Good NFS performance relies on being able to cache file data aggressively. Clients use inode metadata to tell whether they should invalidate their caches. Basically they look at certain metadata in the inode (usually via a GETATTR operation) and then invalidate the cache if it looks like it has changed.

When doing something like a write() syscall however, POSIX says only that you must restamp the modification time (mtime) of the inode with the current time. Since that action changes the metadata, the inode change time (ctime) must also be stamped with the current time.

Older versions of NFS relied heavily on watching for timestamp changes in order to tell if the cache was valid. This was always problematic though. Timestamp granularity on Linux is generally 1 millisecond, and a lot can happen within that time, which can cause the clients to end up with stale data in the file when they don’t see an apparent change to the times.

So, when NFSv4 was drafted, it mandated a new metadata attribute — the change attribute. The general idea is that it is a unsigned 64-bit number, that must increase in value whenever there has been either a data or metadata change to the file. When a client queries it, and the file has changed in some fashion then it must be larger than when the client last queried for it.

Servers have a bit of latitude on how they implement the change attribute, but the end result is always semi-opaque to the client. It can only look at the change attribute and see whether it has increased since the last time it was checked. If so, then it can infer that something has changed in the file (either data or metadata).

Note that clients are not allowed to inspect the value to see how much it has changed. The only thing they can do is look and see if it’s larger than the previous value. At that point, any data or metadata cached on that inode is suspect.

CephFS capabilities

First a bit of background on CephFS. CephFS is quite different from other network filesystems in that parts of the data and metadata can be delegated to different clients in a very granular way. These are called capabilities (aka caps) in CephFS parlance. For instance, the owner, group and mode are all under the AUTH capability, whereas the mtime and size are under the FILE capability.

These generally come in two flavors: shared and exclusive. Shared caps allow clients to cache the existing values, and exclusive caps allow them to be changed by the client and buffer that change until the capability is revoked.

Furthermore, the CephFS metadata server can grant different exclusive caps to different clients on the same inode, allowing them to cache changes to the metadata and data. The MDS will recall these capabilities in order to ensure that other clients see the changes that those clients may be caching, prior to satisfying a request that would provide those attributes. At that point, the clients will return their capabilities and update the server with the new metadata.

Thus it’s possible that one client can have the AUTH exclusive cap (allowing it to cache a change to the mode, for instance), and then another client may hold the FILE exclusive cap which allows it to buffer up writes to the file and the associated changes to the mtime.

By way of shorthand, we often refer to the capabilities with a capital letter, and then a lowercase letter denoting whether it’s shared or exclusive.

Adding a NFSv4 Change Attribute to CephFS

The implementation adds the change attribute to the FILE capability since writes are more common than explicit metadata changes, and it often changes in conjunction with the mtime.

But! Suppose we have one client that holds AUTH exclusive caps (Ax) and another with the necessary caps to  buffer writes to the file and cache mtime changes (we’ll denote this Fx for now, but the FILE cap is actually a bit more complex).

Both clients start out seeing the same change attribute (call it 1), and the same mode on the file. The Ax client does a change of the mode, and bumps his change attribute to 2. The Fx client does a write, bumps his change attribute to 2, and then does a GETATTR to fetch the mode and change attribute. What happens now?

The MDS then recalls the Ax cap from the first client, sees that the change attribute is now at 2 and changes the value in the MDS cache. It then sends updated inode attributes to the other client (the one holding Fx caps). That client sees the change attr at 2.

But wait — he already has a value of 2. Is that a violation of the semantics?

The answer (I believe) is no. Recall that the client is not allowed to infer anything from the actual value of the change attribute. It can only check to see if it’s newer than what it had before. So, while the Fx client did have the same change attribute cached after its write, its cached inode attributes (including the mode) became invalid on the transition from 1 to 2.

IOW, it’s perfectly legitimate to “bundle” several changes under one change attribute value change, as long as you ensure that no one has queried for the change attribute in the meantime.

Note that there are even more complex situations that can involve NFSv4 delegations, but the key points here are that all cached metadata becomes invalid once the change attribute changes, and that the MDS will recall (and sync up) changes to the change attribute when there are exclusive caps outstanding to another client that cover the metadata being queried.

Current Progress

The patchset has been through several iterations and there is currently a pull request up to add the first pile of patches. nfs-ganesha will require adding several other ceph_ll_* interfaces, but the heavy lifting is already done here. We’re hoping this will make the next CephFS release (Kraken).

File-private POSIX locks (aka un-POSIX locks)

The Problems with POSIX Locking

At last year’s Linux Storage and Filesystem summit, we had a lively discussion about what Linux needs in order to make it easier to implement fileservers on top of it. One of the bullet points brought up by Boaz Harrosh was the brokenness of the POSIX file locking standard. There are really two different problems:

POSIX Locks are tracked on a per-process basis such that lock requests by the same process cannot conflict with one another. Thus, they are useless for synchronization between threads within the same process. This also means that locks are not inheritable across a fork() since they are intimately tied to the process.

The bigger issue is that all POSIX locks held by a process are dropped any time the process closes any file descriptor on the inode that the process holds open. This is the bit that catches most programmers by surprise. It requires that you carefully delay closing any file descriptors held open on a file until you’re sure that all locks on that file can be released. As Jeremy Allison points out in his blog “this is not behavior that you would ever want!” (he also has a good description of how this standard came about — see the section entitled “First Implementation Past the Post”).

It also leads to the “library problem”. Occasionally libraries may need to open a file in order to read or write to it and then close it again. If a library does this while the application has the file open with locks held on it, those locks are then lost. In many cases, this is not obvious to the calling program.

BSD locks (via flock() mechanism) had much more sane semantics with respect to inheritance and behavior on close. Locks are tied to the open file table entry, not the process. If a process opens a file twice and tries to set exclusive locks on both, the second one will be denied. BSD locks are only released automatically when the last reference to the open file on which they were acquired is closed. The only problem is that BSD locks are whole-file locks only. They aren’t byte-range aware.

What we need is a new lock type that is a hybrid of the two. We want BSD behavior for inheritance and close, but the locks need to be byte-range locks. Furthermore, since we have a large legacy codebase of programs that use “classic” POSIX locks, these new locks need to conflict with them so that programs using the new locks will interoperate correctly with them.

A Possible Solution

I have a preliminary patchset to add a lock that behaves like this. I’m calling them “file-private” locks for now, but I’m amenable to better names if anyone has one. The current patchset adds a new set of cmd values to fcntl(2). The interface is very similar to the traditional POSIX lock interface, even down to the struct flock argument that is passed in. The main difference is that these locks behave more like BSD locks with respect to inheritance and close. The new cmd values are (in a nutshell):

F_SETLKP:  Acquire or release a file-private lock.
F_SETLKPW: Acquire or release a file-private lock. If the lock isn't available then wait for it.
F_GETLKP:  Test whether a file-private lock would be able to be acquired.

Whereas classic POSIX locks are “owned” by the process, these are owned by the open file. Thus, these locks will conflict with “classic” POSIX locks, even if they are taken on the same file descriptor by the same process. If however, a fd is inherited by a child process via fork then locks acquired via that same fd in both the parent and child will not conflict (since the open file table is the same).

Also, the locks would only be automatically released on the last close of that file descriptor. So, if the parent takes a lock on a fd and forks, and the child then closes that fd, then the lock will stay around until the parent closes it. By the same token, if the parent closes the fd first, then the lock will stay around until the child closes it.

With this, someone implementing a fileserver for linux that needs to manage byte range locks on behalf of a client can simply ensure that they just keep an open file for each “lockowner”.

I have an initial implementation of this code for Linux, and have opened a RFE with the Austin group to have them consider adopting this as part of the formal POSIX standard. So far they seem amenable to the idea.

Active/Active NFSv4 Serving — Userspace NFS Servers

Much of the difficulty in creating a Linux-based active/active NFS server is the fact that we’ll need to upcall so much since there are so many things that are too difficult to handle in the kernel.

The existing (single-node) NFSv4 server in the kernel already has to upcall for:

  • mountd/exportfs information
  • ID mapping
  • GSSAPI
  • Client ID tracking (technically, this isn’t an upcall currently. The kernel manages files in a well-known directory via VFS interfaces. That’s pretty gross though, and we plan to switch that to an upcall soon)

Now, consider all of the information that will need to flow between nodes with a clustered NFS server. That’s at least another upcall of some sort as well, and probably more than one

At some point, we need to ask ourselves: “Why do any of this in the kernel at all?”

Doing this in userspace smooths over a lot of the hassles. krb5 auth and ID mapping for instance become much more trivial to handle, and we diminish the round trips between kernel and userspace for some things.. Also, if we want to do something like plug in Gluster on the backend, we no longer have to deal with multiple round trips through the kernel and back. We can just plug the userspace nfs server into the Gluster userspace libs and let ‘er rip.

Userspace NFS servers do have their own sorts of hassles as well:

  • filehandles: traditionally the main reason for an in-kernel NFS server was to eliminate the need to track filehandle-to-path information. When a server gets a NFS filehandle it has to figure out which inode that refers to. Most userspace servers have relied on tracking that info in a separate database, but that approach has many problems if something else is manipulating the filesystem too. With recent Linux kernels, we have open-by-filehandle support, which obviates much of the need for this. There are apparently still some issues with this support, but I think we can probably fix those as they are identified.
  • ownership/permissions: Most current userspace NFS servers always just run as root and try to fix up and enforce permissions after the fact. This is less than ideal since it’s racy and means that you have to enforce permissions in userspace. The main reason for this is that the POSIX setuid() family of syscalls work on a per-process basis. If you have a multithreaded process, then changing uids will change it on every thread. This makes it really difficult to handle the situation where you want to do something like open/create a file as a particular user. An open() is a synchronous syscall so ideally you’d like to spawn a thread to handle that call so we can process other calls arriving in  parallel. Unfortunately, you may need to change uids to handle the next call coming in. So, there is also a need to be able to set per-thread credentials. glibc has no support for this currently. The Samba server guys are calling the setfsuid() syscall directly to handle this, but that’s less than ideal. It would be nice to have some
    sort of officially blessed interface for this.
  • Asynchronous IO: buffered AIO is non-existent under Linux. There have been many attempts to make that work, but none have come to fruition. The current fashion is to spawn threads that do standard synchronous I/O calls. It’s kludgey but as long as you have per-thread credentials, that approach works properly.
  • recvfile(): ideally, we’d like to be able to move pages of data straight from a socket to a file. We have the reverse currently — sendfile() works and can inline pages from the pagecache directly into a skbuf fraglist. This allows a userspace server to implement zero-copy read calls. It would be nice to be able to do zero-copy writes as well by yanking pages out of a receive buffer and dropping them directly into the pagecache. Some OS’ have recvfile() for that purpose. Linux has splice() which is supposed to be a way to implement recvfile(). Unfortunately, splice() is broken in this regard — it always copies the data to the pagecache instead of replacing/moving pages into it. This would offer up a nice performance boost if it could be fixed.

There are other problems too of course. The TI-RPC userspace libraries are notoriously buggy when you use threads. The NFS Ganesha folks have done a bunch of work on them, but they haven’t pushed their changes upstream yet. Hopefully that will happen soon. There was talk about that at the Bake-a-Thon recently, so hopefully it’ll come to fruition soon.

Active/Active NFSv4 on Clustered Filesystems: avoiding the VFS layer

So far, most of my posts about this proposed design have been about teaching the VFS and underlying filesystems how to handle the state properly for NFSv4 semantics. We could however, consider another design. We could have a clustered daemon that runs and tracks the state independently of the VFS. In other words, we could go with a more Samba/CTDB type design here.

knfsd would upcall to talk to a daemon and it would be the final arbiter. Such a daemon could be integrated with Samba/CTDB such that knfsd and samba are aware of each others state. In principle, that would allow you to export the same filesystem via both knfsd and Samba/CTDB and get reasonable semantics for both. Oplocks and delegations would get broken appropriately, share mode reservations should work, and (possibly) we could make locking work in a more consistent fashion than we do today.

Of course, there’s a catch — nothing else would be aware of that state. It would all be tied up inside the CTDB database (or whatever clustered DB infrastructure we’d end up using). The result would be a NFS+SMB “appliance”. I think that’s a less desirable design than one that’s more general purpose, but it might be easier to achieve and we might be able to hammer that out quicker since we’d be able to avoid a lot of the VFS-layer work.

In the near term, we don’t really need to make this decision. Either way, we’ll still need to be able to swap in the correct set of operations to handle it, so the focus for now can be on simply abstracting out the parts of the server code that we’ll need to swap out to handle this later. It should even be possible to do this sort of design as an interim step, and then add what the VFS would need for a more general solution later.

Active/Active NFSv4 on Clustered Filesystems: abstracting out fs operations from nfsd

Regardless of what method we ultimately choose, a more immediate engineering task is to abstract out the operations that we ultimately want to delegate to the lower filesystems. Since we want to eventually allow filesystems to override the default grace-period handling, we’ll need to do provide some way to do that. Most likely, we’ll just add some new export_ops to allow for this (or maybe a new struct that hangs off the export ops).

Here’s a first stab at what I envision for this, mostly based on looking at where nfsd currently checks the grace period:

  •  nfsd_open operation: this will be the most complex of the of the set. It’ll need to do several things semi-atomically (at least wrt to other nfsd threads)
    • check the grace period and determine whether the file is allowed to be opened
    • (optionally) check share reservations that are being requested and that are already set, and determine whether the file is allowed to be opened
    • open the file (assuming the above checks passed)
    • set share reservation (if any, we may eventually combine that with the actual open if we ever get VFS support for share mode reservations)
    • get a lease/delegation (if needed)
    • …and hand the result back to nfsd
  • a nfsd_lock operation (for F_SETLK/F_GETLK):
    • check the grace period and see if we’re allowed to set a new lock or reclaim an old one
    • acquire or release the lock
  • an overridable grace_disallows_io to handle mandatory locks. This will prevent stateid-based ops from performing I/O during the grace period if necessary.
  • a nfsd_remove  and nfsd_rename op: to allow the blocking of metadata ops during the grace period

That’s a start anyway. The actual API itself will need to be determined for this. In the near term, I plan to do some reshuffling of the code to make it easier to break this stuff out later. Once those pieces have a more sharply defined box around them it should just be a matter of turning them into a set of operations that filesystems can override in some fashion.

Active/Active NFSv4 on Clustered Filesystems: POSIX Lock Recovery (cold-start case)

In my last post, I talked about POSIX lock recovery in this configuration once the cluster is up and running. The assumption there is that the lock manager (DLM in the case of GFS2) has a record of the lock that we want the client to be able to reclaim.

We do however, also need to consider the case where the cluster is being cold-started. At that point, no node in the cluster will have any active locks. How do we know whether a client is allowed to reclaim locks at that point? To illustrate the problem, consider the following (pathological) case:

A client has mounted a filesystem from two different cluster nodes, for instance:

node1:/export mounted on /mnt1
node2:/export mounted on /mnt2

The client takes locks under both mountpoints, and then loses contact with node2. Eventually the client’s lease expires on node2 and that node releases all of the locks that the client held on it. The entire cluster is then rebooted, and when it comes back up, the client mounts both filesystems from node1 and attempts to reclaim all the locks under /mnt1 and /mnt2.

How does node1 know which locks are allowed to be reclaimed? Clearly, everything under /mnt1 should be allowed and nothing under /mnt2, but the cluster doesn’t currently store reboot recovery info at this level of detail. Right now, it only stores information about clients themselves, but nothing about the resources they hold.

After spending some time working through different scenarios, I’ve come up with the following “rule”:

The flexibility of client configurations is directly proportional to the granularity of lock information tracked on stable storage.

In other words: the more information that we keep about active locks on stable storage, the better able we are to cope with different client configurations. In real terms, that means that we have some choices, and implications for those choices:

  1. We could avoid tracking any information at all on stable storage. We’d just return NFS4ERR_NO_GRACE on any lock reclaim request that didn’t match an existing lock. On cold-boot of the cluster, there would be no lock reclaim. This is very easy to implement, but clearly non-optimal. I don’t think we’ll find this acceptable.
  2. We could just track information about the clients themselves on stable storage. In particular, we’d just track the client name “strings” that are sent in the SETCLIENTID or EXCHANGE_ID call. This is basically what knfsd does today and what my proposed nfsdcld daemon does today. The problem here is that it can’t handle the above situation I outlined. In order for this to work, we’d have to insist that clients can only mount from one node in the cluster at a time.
  3. We could track client name strings along with the destination IP address for each call. If a SETCLIENTID call comes in then we’d track a tuple of information that includes the client name string and the server address where the call came in. With this, we could reasonably support a “floating address” cluster where clients can be moved between servers by moving the server’s address to another cluster node. This might involve a hack of some sort to fix the server’s ID string to be destination address specific for v4.1. The downside here is that this would prevent clients from being able to reclaim locks against a different address from which they were acquired. That means that NFSv4 replication and migration using fs_locations info would not work and that reduces flexibility. It might also be difficult or impossible to integrate this configuration with pNFS serving if that ever becomes desirable.
  4. We could track client name strings along with the fsid of the filesystem where the resources are held. This would involve changing when we upcall/downcall to store this information. It would have to happen whenever we acquire our first resource on a fs and when the last resource is released from a fs. With this, we could reasonably allow for floating addresses and v4 fs_locations based migration/replication provided that the clients don’t mount the same filesystem off of more than one node at a time. In other words, the scenario described earlier would be unsupportable. Worse, nothing would prevent someone from doing that (other than documentation), so there is the potential for people to shoot themselves in the foot. They may not even realize it since it would basically work until they hit a network partition + cold boot scenario.
  5. We could track information about every lock on stable storage. This has the most performance impact, but it allows for complete flexibility in client configurations. Clients could migrate freely between hosts using any mechanism. There would also be no NFS4ERR_GRACE returns to clients. Clients would be able to make new lock requests as soon as the cluster starts. For this, we’d probably do something a bit different than the previous mechanisms. We’d probably want to integrate this tracking with the userspace lock manager (DLM or equivalent). What I’d envision is that DLM would keep info on stable storage about what locks it holds. When the cluster is started, we’d use that info to “preload” the reclaimable locks into DLM. Then as the clients come up, they’d take ownership of them. Any that aren’t reclaimed during an initial grace period would be released. Then we’d just enable NFS serving and start letting clients reclaim those locks. This simplifies the knfsd piece, but adds quite a bit of complexity to DLM.

Those are the possibilities I see for doing this. While it sounds a bit tougher to do, I’m starting to warm up to #5. It seems like the best option for future flexibility. It will make locking performance even slower than it is today however, so we’ll need to weigh carefully whether that’s reasonable.