Monthly Archives: December 2016

The inode->i_version counter in Linux

I’ve written previously about the NFSv4 change attribute and CephFS. The Linux kernel has also supported it for a long time via a field i_version field in the inode structure.

Most kernel subsystems just treat this field as an opaque value. Some filesystems use it internally, primarily to track changes to directories (mostly for handling directory changes during readdir). The kernel NFS client uses that field to store the NFSv4 change attribute.

The exception is when the filesystem sets the MS_I_VERSION flag in the superblock. That flag tells the kernel that it should increment i_version on regular files when it goes to change the timestamps during a data or metadata change. Only a few filesystems in the kernel do this: xfs and btrfs always have it enabled. ext4 has it off by default but you can turn it on with the i_version mount option.

With that enabled, the kernel NFS server will also populate the NFSv4 change attribute from that value as well. This is what allows us to (more or less) eliminate timestamp granularity problems for NFSv4 clients when exporting those filesystems.

The important bit to note here is that there is a lot of inconsistency on how this field is handled in the kernel. Some filesystems have very different semantics from others. AFS for instance uses this as the data version counter which is not incremented on metadata changes. For the most part though, these differences haven’t really mattered as the users of this field are limited and are confined to the kernel.

But, that may change in the not too distant future. There is (hopefully!) a new syscall on the horizon: statx(). With that, exposing the i_version field to userland could be very useful. There are a lot of applications that would like to have a cheap and reliable indicator that something changed on an inode. Before we can expose that field however, we’ll need to sort out the inconsistencies between the implementations, and clearly define what this field should mean for applications.

Now we circle back to Ceph. The Ceph change attribute implementation exploits the fact that we don’t necessarily need to bump the counter atomically on every change. When userland ceph clients are able to cache a change to the metadata, they bump the counter in their local copy. When the change counter is queried by an application (which uses an interface very similar to statx), all of the cached copies are synched up before handing the value back to the caller.

Given that we have at least one filesystem that already benefits from looser semantics for the i_version counter, can we also relax this requirement in in-kernel filesystems to reduce the number of metadata updates that are required on a write-heavy workload?

Can we make it go faster?

A few months ago, a colleague of mine that primarily works on local filesystems asked about the feasibility of turning of the i_version counter. Apparently some of the Red Hat QA folks were concerned about an XFS performance regression when moving from RHEL6 to RHEL7. They had tracked it down to the fact that XFS had revved its on disk metadata format, and the i_version counter had been switched on universally.

The workload in this case was very small I/Os in iozone which is clearly synthetic, but it got me thinking that it might be possible to reduce the number of situations where we need to dirty the inode metadata.

Basically the idea is to set a flag in the inode when we query the i_version counter. Then, when we go to bump it, we only bump the counter if that flag is set and then we clear the flag. Then, we fix the filesystems to only log the change to disk if the counter was incremented and the times didn’t change.

For the vastly common case where no one has queried for it, we can reduce the number of times that the inode metadata is dirtied down to a frequency on par with the timestamp granularity.

Which is neato and may really helps those really small buffered I/O workloads, where fsync is only infrequently called. As a pessimal case, we can play with time and dd with 1 byte writes:

$ time dd if=/dev/zero of=/mnt/scratch/foo bs=1 count=10000000

Which is a silly workload of course, but it should give us the most dramatic results. On XFS on my fairly crappy test rig on an unpatched kernel:

real 0m13.840s
user 0m0.964s
sys 0m12.824s

After my draft patch series:

real 0m6.753s
user 0m0.617s
sys 0m6.114s

…about half the time. Again though, this is not at all a real-world workload, so I don’t think it’s sufficient to merge this based on that alone. I see some small speedup on kernel builds but nothing to get too excited about.

That said, ext4 has the i_version counter off by default and you have to specify an option to turn it on. I don’t think they’d do that unless they thought the counter were cost-prohibitive. Clearly there may be a use case there?

So, I pushed this into my kernel tree on samba.org and few days later, I got this from the Intel kernel test robot:

Greeting,

FYI, we noticed a 43.5% improvement of vm-scalability.throughput due to
commit:

commit: fa629d46c4da556a77c7b8c7760e734dd88d1f3e ("fs: only set S_VERSION
when updating times if it has been queried")

43%! That certainly sounds significant. The docs for the vm-scalability testsuite say this about the test that showed the improvement:

case-msync-mt:
Create a sparse file with size of $MemTotal, before creating N threads, it
will preallocate and prefault 1/2 memory space with mmap using this sparse
file as backing store and then the N threads will all write data there
using the preallocated space. When this is done, use msync to flush
change back to the file.

Is that representative of any sort of real-world workload?