Using kdevops to test the Linux kernel NFS server

A few months ago, I transitioned from working on CephFS full time back to helping maintain the Linux kernel NFS server (aka knfsd). Chuck Lever (the primary maintainer) and I have a long-term goal to improve testing of incoming changes. In particular, we’d like to have some sort of continuous integration (CI) to help catch regressions.

After reviewing a few options, I settled on kdevops since it’s being adopted by several filesystem teams, and configuring and running it uses tools familiar to kernel developers. It’s configured using the same Kconfig system that the Linux kernel uses, and provisioning and running machines is done using make targets.

Importantly, it is (supposedly) infrastructure-neutral. It can spin up virtual machines locally, using vagrant, or in several major cloud environments using terraform. In principle, it should be possible to run the same set of tests in either environment without any modifications.

The purpose of kdevops is to bring up one or more virtual machines in the environment you specify, configure it to run the kernel you want and then run tests on it. This post assumes some familiarity with Linux kernel development.

Caveats and disclaimers

Under the hood, kdevops is a giant hairball of ansible playbooks, shell and python scripts, configured using Kconfig and run by GNU Makefiles. When things break in the automation, it can be difficult to track down what happened.

The automation can make changes to services and such on the host, so you may want to consider using a dedicated host that can be rebuilt at will. I’ve still never run this project in the cloud using the terraform provider either (something that is on my to-do list). This may have unanticipated bugs if you use a terraform-based setup instead of a vagrant-based local VMs.

Where to run it?

My first thought was to just run this on my main workstation, but some of the Kconfig options make configuration changes to the host. I decided put together a dedicated test machine using some spare parts I had laying around.

If you’re building a machine to run kdevops on, then it requires a very recent distro. I originally considered using CentOS Stream 9 for this figuring that might be more stable as a virtualization platform, but even that wasn’t new enough. I ended up installing Fedora 37. Maybe those requirements will be more lax in the future.

Installing kdevops

Once you have a machine and install the basic requirements listed in the docs, you can pull down the git tree for kdevops. Follow the README for the prerequisites (currently just ansible and make).

Configuring kdevops

The first step is to run:

$ make menuconfig

This should look reasonably familiar to experienced kernel developers, but the configuration in this case is for kdevops itself. There are many options, and some of them affect others. My advice is to walk carefully through the different options, at least the first time, and resign yourself to some trial and error. The goal here is to get something that works for your test environment.

The first choice you have to make is the “Node bring up method”. Choose either Vagrant to run the tests locally or terraform to run them in the cloud. From there you’ll need to make a choice of what distro to use for the guests, and other options for setting it up.

Next, you have to configure the test suite to use. This is called a workflow in kdevops parlance. In this case, you’ll want to choose the pynfs workflow. Also choose CONFIG_WORKFLOWS_DEDICATED_WORKFLOW since some of the higher-level automation doesn’t work reliably with multiple workflows.

Once you’ve gone through the configuration, save it and exit, and then run:

$ make

That will ensure that the host and automation are set up properly for the Kconfig options you’ve chosen.

Bring up the guest virtual machines

To kick off VMs, run:

$ make bringup

This brings up the VMs that have been configured. In this configuration, I have it starting a VM in the local hypervisor using vagrant boxes, but if you configure it to use a cloud environment instead, it’ll bring the machines up in the cloud using terraform.

Aside: ssh’ing into the VMs

At least in a basic vagrant-based setup, the VMs created are on a NAT’ed network, and it’s not always simple to figure out their address. From the top level kdevops directory, you can run this to ssh into a vagrant guest named “kdevops” as the user “vagrant”:

$ cd ./vagrant
$ vagrant ssh kdevops

This is helpful for troubleshooting guest problems. I assume terraform has something similar.

Build and install a kernel

At this point we can run pynfs, but it would run on whatever distro kernel happens to be in the cloud image you’re using. kdevops can pull down a kernel git tree, build and install it on the guests, and reboot into it.

Most of the config options that govern this are under the “Target workflows” section of the Kconfig. Once you have it configured, you can run:

$ make linux

If successful, it should get the new kernel, build it, install it and boot to it.

Install pynfs

At this point, we can pull down and build pynfs in the guests.

$ make pynfs

…and then run it against the NFS server on the same machine:

$ make pynfs-baseline

That will generate two JSON files under the workflows/pynfs/results directory with a name that contains the kernel version on which it was run.

Running it in a loop

Running them once is useful, but what we really want to do is run pynfs in a loop for a number of iterations, and then compare the results against a “gold standard” (that may change over time). kdevops has support for this with pynfs:

$ make pynfs-baseline-loop

That will run pynfs against the server in a loop, and compare the results of each iteration against workflows/pynfs/baseline/current-v4.{0,1}.json. If any new failures are seen that are not failures in those master files, the run aborts and (optionally) email is sent to a predefined address.

Shutting down the guest VMs

To bring down the guests, run:

$ make destroy

Current status and long term plans

Eventually, I want to expand this so that we can run other tests, such as NFStest. That adds a new level of complexity since we’ll probably need to set up multiple hosts per test or some sort of containerization. That would also help test the client in addition to the server.

While my Frankenbox was OK while I was developing pynfs support in kdevops, it’s not really ideal for iterative testing, since it’s just a machine next to my desk, and doubles as a space heater. Long term, I’d like to run this on a regular basis in a cloud environment.

Some projects that I’ve worked on in the past (e.g. Ceph and Samba) have community foundations that they can fall back on to cover expenses like this. The Linux NFS project doesn’t have anything of that nature.

Are there cloud providers that offer free credits to open source developers?

Testing the Linux Kernel CephFS Client with xfstests

I do a lot of testing with the kernel cephfs client these days, and have had a number of people ask about how I test it. For now, I’ll gloss over the cluster setup since there are other tutorials for that.

Test Environment

For the cluster, I have a separate machine dedicated to running a set of 3 KVMs (8G each, running centos-stream8). I use cephadm to build a cluster that uses those machines as cluster hosts. Each KVM has a dedicated SSD so I get OK-ish performance (but not stellar).

Occasionally, I’ll also need to test against a vstart cluster, usually when I need work with some bleeding-edge userland changes, but for the most part I rely on my 3 node KVM setup.

The machines are connected via 1GB ethernet.

Cluster Setup

The cephadm cluster has 3 KVM hosts that act as cluster nodes. I run a mon on each, and each gets an OSD daemon.

From there I usually create two separate cephfs’s. One named “test” and one named “scratch”. I then bump up the MDS count in the orchestrator and max_mds on each filesystem to give each fs a set of 3 active MDSs, and one standby MDS.

I also enable random pinning on both fs’s with a 0.1% frequency, just to thrash things a bit more.

Client Configuration

I run a KVM on my main workstation that acts as a client (with 16G of memory). The client VM is Fedora 34 (but I’ll probably upgrade it soon). Make sure the ceph-common package is installed (so you have the mount.ceph binary).

Next, you’ll need to set up the configuration. Here’s the script I use:


# final locations

ssh $1 "sudo $CEPHADM shell ceph config generate-minimal-conf" > $CONFTMP
sudo chown root:ceph $CONFTMP
sudo chmod 0644 $CONFTMP
sudo cp -p $CONFTMP $CONF

ssh $1 "sudo $CEPHADM shell ceph auth get-or-create client.admin" > $KEYTMP
sudo chown root:ceph $KEYTMP
sudo chmod 0640 $KEYTMP
sudo cp -p $KEYTMP $KEYRING

Run it with the hostname of a cluster node where you have an acct as the first argument. Be sure to set $CEPHADM to the right location for the cephadm script on the machine.

Once you run that, you’ll have a minimal config on the client. You may want to test it by running “ceph -s” or something similar.

At this point, I usually mount up both the test and scratch filesystems and create a directory under there with the hostname of the client (client1 in this example), and set it up to do random pinning. This way I can run multiple clients and let them test in their own areas of each fs.


So you now have a cluster and a client, and the client’s userland code can talk to the cluster.

On the client, you’ll need to pull down the xfstests tree, and build it (e.g. run “make”). You may need to install some prerequisite packages (see the README file in the xfstests sources).

To run xfstests, you’ll need an appropriate config file. Here’s the main one I use (hopefully with some helpful comments). I usually replace local.config in the xfstests tree with this, just to make test running easy. You’ll need to do adjust this for your own environment, of course:

# This config file is or running xfstests on kcephfs. You'll require
# an existing cluster to test against (a vstart cluster is fine).
# To understand the mount syntax and the available options, see
# mount.ceph(8).
export TEST_DIR=/mnt/test
export SCRATCH_MNT=/mnt/scratch

# "check" can't automatically detect ceph device strings, so we must
# explicitly declare that we want to use "-t ceph".
export FSTYP=ceph

# In this example, we've created two different named cephfs filesystems:
# "test" and "scratch. They must be pre-created on the ceph cluster before
# the test is run.
# The check script gets very confused when two different mounts use
# the same device string. There is a project to change how the mount device
# syntax works, but it's not yet merged.
# For now, we must declare the location of the mons explicitly. Note that we're
# using two different monaddrs here on different hosts, though these are using
# the same cluster.  The monaddrs must also match the type of ms_mode option
# that is in effect (i.e.  ms_mode=legacy requires v1 monaddrs).
export TEST_DEV=

# TEST_FS_MOUNT_OPTS is for /mnt/test, and MOUNT_OPTONS is for /mnt/scratch
# Here, we're using the admin account for both mounts. The credentials
# should be in a standard keyring location somewhere. See:

# if you want to use an explicit secret instead of finding it in a ceph keyring
# COMMON_OPTIONS+=",secret=AQAkaM5g7+GuIRAAM3xLNwSQc8953uo3/1QkLw=="

# use msgr2 in crc mode

# asynchronous directory ops

# enable copy offload

# now for the per-mount options

# select the correct cephfs

# fscache -- each fs needs its own fsc= tag


Finally, you just need to run the tests. Some xfstests can take a very long time to run on cephfs. I often run the quick test group. It still takes a couple of hours on ceph, but it covers a variety of things.

Some tests always fail on cephfs. generic/003, for example complains about atime handling, and ceph really can’t (easily) offer the semantics it wants. I have a file called ceph.exclude in the root of the xfstests tree and I have a single line in it so I can skip that one:


Now we can run the tests!

$ sudo ./check -g quick -E ./ceph.exclude

If there are failures, please report them to the ceph-devel mailing list and we’ll try to help troubleshoot what happened.

Deploying an Active/Active NFS Cluster over CephFS

A little over a year ago, I made a post about a new design for active/active NFS clustering. Since then, an implementation of it has been merged into nfs-ganesha.


The aim in this article is to deploy as simple a configuration as possible. We’ll be deploying a two nfs ganesha servers, layered on top of an existing CephFS cluster. The servers in this case are simple daemons running on regular hosts (VMs in my case, but this could be done on bare metal as well).

We will not be using any sort of traditional clustering software (e.g. pacemaker). This deployment is two independent NFS servers that are serving out the same backend filesystem. They will have separate addresses, and clients will need to be distributed between them manually.

“Failover” in this configuration will mean manually unmounting and remounting the other server on the client. If the server dies, we’ll just restart the whole thing.


For this, you’ll need an existing CephFS cluster. Setting this up is outside the scope of this article, but I deployed one under kubernetes with rook. For this configuration, you really want to be running a Nautilus (v14) cluster as there are many bugfixes and some new interfaces tailored for this use-case that make things much more reliable. A the time of this writing, nautilus is not yet released but should be in spring(-ish) 2019.

You’ll also need two hosts on which to run nfs-ganesha. Any sort of host will do. It’s even possible to run nfs-ganesha in an unprivileged docker or kubernetes container as it’s just a userland application that has no real need for local storage.

You can add more hosts if you like, but there is no need for quorum to handle split-brain scenarios here. Everything is coordinated via RADOS objects, so as long as the necessary objects are accessible by all of the cluster hosts they should be fine.

The nfs-ganesha packages need to be v2.7 or above. Earlier releases did not have the rados_cluster recovery backend driver. I used two virtual machines running Fedora Rawhide (the Fedora 30 prerelease). That way the packages I needed would be available via the standard package repos.


Install the OS on both hosts. I installed Fedora 29, and then updated to Fedora Rawhide from there. Install the necessary packages to run ganesha. For convenience, I also installed the ceph-common package, which allowed me to test the ability for the host to contact the ceph cluster:

# dnf install nfs-ganesha nfs-ganesha-ceph nfs-ganesha-rados-grace ceph-common

Configure Ceph:

Configure the hosts to be able to access the ceph cluster. This part will depend on your particular ceph deployment. You’ll need the mon addresses at a minimum, and if you’re running with encryption enabled, the keyring file.

# cat /etc/ceph/ceph.conf 
mon_host =,,

keyring = /etc/ceph/keyring

Here’s my keyring file. This should be readable only by root (or whatever user the ganesha daemon will be running as):

# cat /etc/ceph/keyring 
key = AQCafw5cEiphIBAAkmlazVh1O72yH+mXVrO6Tw==

Note that for this tutorial I’m not being at all careful about the cephx principals in use, and am just allowing the ganesha servers to use client.admin. For a real production deployment, you’d want to have the ganesha daemons use dedicated principals that just allow them the minimal access needed to the cluster.

Once you have this set up, you should be able to run commands like this to verify that the cluster is operational:

$ rados lspools

Here we see the data and metadata pools for our CephFS.

Configure Ganesha:

Add a ganesha.conf file on both hosts. The nfs-ganesha-ceph package ships with a sample ceph.conf file that we can tweak for our purposes. Here’s mine with the comments stripped out:

grep -v '^\s*#' /etc/ganesha/ganesha.conf | grep -v '^\s*
Enable_NLM = false;
Enable_RQUOTA = false;
Protocols = 4;
RecoveryBackend = rados_cluster;
Minor_Versions = 1,2;
Dir_Chunk = 0;
NParts = 1;
Cache_Size = 1;
Protocols = 4;
Transports = TCP;
Path = /;
Pseudo = /ceph/;
Access_Type = RW;
Attr_Expiration_Time = 0;
Name = CEPH;
pool = "cephfs.a.meta";
namespace = "ganesha";
nodeid = "a";

Note that we disable anything prior to NFSv4.1 here. This is because enabling legacy versions of NFS preclude us from being able to lift the grace period early, and that is a huge win on a server restart. There is also some tuning in there to disable most of ganesha’s caching. We’re already caching aggressively in the ceph libraries, so there’s no need to duplicate it here.

The rados_cluster recovery backend stores its info in RADOS objects. It’s not generally a lot of data, but we want it to be highly available. We piggyback onto the CephFS metadata pool for this purpose, and declare a new “ganesha” namespace in it to keep it distinct from cephfs objects.

Cluster Nodeids:

Most of the configuration is identical between the hosts, but the nodeid setting in the RADOS_KV block requires special mention. It must be unique string for each node in the cluster. Here, we’ve set it to “a”. The second host in this cluster has this set to “b”.

It’s possible to not set this, in which case nfs-ganesha will default to using the hostname. That’s a perfectly valid choice for nodeid, as long as it is unique in the environment and is persistent across reboots.

Populate the Cluster Grace Database:

Now, we need to ensure that all of the nodes in the cluster know about each other. This done via a RADOS object that is shared between the hosts. Ganesha uses this object to communicate it current state and need with respect to a grace period.

The nfs-ganesha-rados-grace package contains a command-line tool for querying and manipulating this database. We’ll use that to create the DB and add both nodeids. One one of the hosts, we run:

$ ganesha-rados-grace -p cephfs.a.meta -n ganesha add a
$ ganesha-rados-grace -p cephfs.a.meta -n ganesha add b
$ ganesha-rados-grace -p cephfs.a.meta -n ganesha
cur=1 rec=0
a E
b E

This creates the grace database and adds both “a” and “b” to it. The last command has it dump the current state. Newly added hosts are always considered to be enforcing the grace period so they both have the E flag set.

The “cur” and “rec” values show the current and recovery epochs, which is how we keep track of what hosts are allow to perform recovery and when.

Open firewall ports:

On Fedora, I needed to allow clients to get to the server. On both hosts:

# firewall-cmd --permanent --add-service=nfs
# firewall-cmd --reload

This opens port 2049 so that the server is accessible from other hosts.

Start the Servers:

Start up the server on each host:

# systemctl start nfs-ganesha
# systemctl enable nfs-ganesha
Created symlink /etc/systemd/system/ → /usr/lib/systemd/system/nfs-ganesha.service.

If all went well, ganesha.nfs should be running on both hosts. Now check the grace database:

$ ganesha-rados-grace -p cephfs.a.meta -n ganesha 
cur=3 rec=0

Note that both nodes at this point have cleared their E flags, indicating that they are no longer enforcing the grace period and are now in normal operation mode.

From here, you should be able to mount either server from clients and do normal NFS operations against them.


This configuration contains an implicit assumption that if one of these servers goes down, that we will restart it within 5 minutes. After 5 minutes, the Ceph MDS may cancel the session that the ganesha client held and all of the state associated with it. If the session’s caps (stateful objects) get cancelled before the rest of the cluster goes into the grace period, the server’s clients may not be able to recover all of their state.

This is one of the reasons Red Hat is focused on containerized deployments for this. Restarting a container is quite a bit cheaper and has a much smaller footprint than an entire VM.


This is just a very quick overview of how to deploy a cluster in as simple a way as possible. I don’t consider this to be a configuration that many people would want to use, but it illustrates what is possible with other, layered technologies to improve resiliency.

Active/Active NFS over CephFS

Since returning to Red Hat, I’ve done a fair bit of work in enabling nfs-ganesha to serve out cephfs. The operating idea here has been to push as much infrastructure into CephFS as we can to enable us to properly implement NFS semantics.

nfs-ganesha then is just another cephfs client. It hands out NFS state to the clients and holds the corresponding CephFS state on their behalf.

With a single NFS server, this works reasonably well. We’ll even have delegation support (coming soon!). The question now is — can we stand up an arbitrary set of servers that only interact with one another via ceph?

The advent of technologies like kubernetes makes it attractive to be able to stand up cookie-cutter NFS servers dynamically as the load requires. How loosely can we couple them without breaking NFS recovery semantics?

Note that in the discussion below, I’m deliberately ignoring other applications besides NFS that are interacting with the cluster during the grace period. That’s potentially a real problem, but for now I’m ignoring it for the sake of simplicity.

The basic problem:

The main issue here is that NFS server state is not persistently stored in a fine-grained way. Most of the state (opens, locks, delegations, layouts) are only stored in memory. That’s not usually a problem on a standalone NFS server.

When a NFS server crashes, it comes up with zero state held by clients. We inform them of the crash, and they then start reclaiming their state within a certain window (usually 2 lease periods — around 90-120s depending on the implementation). During the grace period, no new state can be established — clients can do reclaims only. Once the grace period has ended, the reverse is true.

There are some corner cases here involving network partitions and multiple reboots that require us to maintain a list of clients that hold at least one open on stable storage. That’s also fairly simple to handle with a single server.

Scaling out, the simple way:

There are ways to scale out without needing to do a lot of complex coordination. If you know that an inode can only ever be served out by a single NFS server, then you can really treat it as wholly independent of other servers. There can be no conflicting state acquired via other NFS servers that way.

If you imagine a set of servers that only serve out a single export each via an individual IP address, with no hardlinks that cross export boundaries, then that should work just fine today without any external coordination.

With something like kubernetes, where you could potentially restart a cookie-cutter NFS server very quickly this might be a workable solution. You effectively allow yourself to be able to scale out as the number of exports grows.

Serving the same content in parallel:

Those configurations are a potentially limiting though. We really want to be able to scale out according to the number of clients we have, not the number of exports.

That means that we need to allow multiple servers to export the same data in parallel.

Suppose we have a setup that looks like this:

+----------------+      +----------------+
|                |      |                |
|  NFS Client 1  +------+  NFS Server 1  +----------+
|                |      |                |          |
+----------------+      +----------------+    +-----+----+
                                              |          |
                                              |  CephFS  |
                                              |          |
+----------------+      +----------------+    +-----+----+
|                |      |                |          |
|  NFS Client 2  +------+  NFS Server 2  +----------+
|                |      |                |
+----------------+      +----------------+

In this configuration, we don’t float IP addresses between NFS Server 1 and 2 like we would with a traditional clustered configuration. Instead, we assign an address to each NFS server, and reconstitute the server from scratch if it crashes.

All of the stable storage info for the NFS server is stored in RADOS, though each NFS server has its own database for recovery.

Now, suppose NFS Server 1 crashes. At that point, NFS Client 1 knows what NFS state it holds. The CephFS server knows what capabilities and state NFS Server 1 held before it crashed.

Kubernetes or whatever detects that its down and creates a new instance of NFS Server 1. NFS Server 1 comes up knowing only the list of clients that are allowed to reclaim their state. What we need to happen at this point is for CephFS to release the state that was held by the previous NFS Server 1 incarnation, and allow NFS Client 1 to reclaim it.

Between the window of time when the CephFS caps held by the earlier NFS Server 1 are trashed, and NFS Client 1 reclaims its state, NFS Client 2 talking to NFS Server 2 could race in and set some state that conflicts with what NFS Client 1 holds.

We must prevent this from occurring. The simplest way to do that is to force all of the other NFS servers into the grace period prior to cephfs revoking the caps that the old client held.

In practical terms, this means that we need to ensure that all of the remaining NFS servers are enforcing the grace period prior to restarting a crashed NFS server in the cluster. They are all only allowed to exit the grace period once all of the servers have stopped allowing reclaim.

That also neatly handles the case where we have a clean start from zero state in the cluster. All servers will go into their grace period until clients are finished reclaiming (or all of the individual grace periods time out).

Handling excessively tardy server restart:

It’s also possible for a NFS server to go down while clients still hold state on it, but then take so long to come back that the MDS decides to give up on it and revokes its caps anyway.

Clients of these servers must not be allowed to reclaim any state, as we cannot guarantee consistency in that situation. The simplest way to do that is to simply trash its persistent client database if the caps that it held previously were revoked outside of the grace period. That will prevent any reclaim. Those surviving clients will likely need to be restarted, but that’s really the best we can do in that situation.

Decommissioning a server instance with active clients:

If we know that we want to decommission one of the NFS servers (scaling down the cluster at times of low-load, for instance), we should be able to use migration to move them to other servers (support for that would need to be added to ganesha).

Alternately we could merge its IP addresses and client databases into another existing server. There are several options here. The main point is that we do need to take proactive steps to “drain” a server before decommissioning it.


So, that more or less lays out the rules for how we need various NFS servers to behave. The question now is what will ensure that we follow them.

Traditional clustering does this with stuff like corosync and pacemaker, but having to use that when we already have clustering infrastructure in ceph itself seems wrong.

The current thinking (credit to Patrick Donnelly) is to add a ceph monitor service and map for the NFS gateways. That map would be consumed by a new daemon that would manage the ganesha daemon, possibly using dbus to handle the grace period of a running instance.

Alternately, we implement this as a new shared library in ceph that ganesha would link in. This part is still being designed/determined.

Overall, I’m pretty excited by the potential of the overall design here though. I think it should offer us the ability to scale nearly linearly, at least with some common usage patterns.

Future directions:

Beyond this design, we may want to look at how to more effectively move clients around the cluster based on their access patterns. For instance, we may want to explore assigning a NFS server to each CephFS MDS in a multimds cluster, and leverage NFS migration to have clients access objects on the NFS server closest to the authoritative MDS for an inode.

That sort of design might be very interesting coupled with pNFS. A flexfiles layout is something of a natural fit over CephFS.

The inode->i_version counter in Linux

I’ve written previously about the NFSv4 change attribute and CephFS. The Linux kernel has also supported it for a long time via a field i_version field in the inode structure.

Most kernel subsystems just treat this field as an opaque value. Some filesystems use it internally, primarily to track changes to directories (mostly for handling directory changes during readdir). The kernel NFS client uses that field to store the NFSv4 change attribute.

The exception is when the filesystem sets the MS_I_VERSION flag in the superblock. That flag tells the kernel that it should increment i_version on regular files when it goes to change the timestamps during a data or metadata change. Only a few filesystems in the kernel do this: xfs and btrfs always have it enabled. ext4 has it off by default but you can turn it on with the i_version mount option.

With that enabled, the kernel NFS server will also populate the NFSv4 change attribute from that value as well. This is what allows us to (more or less) eliminate timestamp granularity problems for NFSv4 clients when exporting those filesystems.

The important bit to note here is that there is a lot of inconsistency on how this field is handled in the kernel. Some filesystems have very different semantics from others. AFS for instance uses this as the data version counter which is not incremented on metadata changes. For the most part though, these differences haven’t really mattered as the users of this field are limited and are confined to the kernel.

But, that may change in the not too distant future. There is (hopefully!) a new syscall on the horizon: statx(). With that, exposing the i_version field to userland could be very useful. There are a lot of applications that would like to have a cheap and reliable indicator that something changed on an inode. Before we can expose that field however, we’ll need to sort out the inconsistencies between the implementations, and clearly define what this field should mean for applications.

Now we circle back to Ceph. The Ceph change attribute implementation exploits the fact that we don’t necessarily need to bump the counter atomically on every change. When userland ceph clients are able to cache a change to the metadata, they bump the counter in their local copy. When the change counter is queried by an application (which uses an interface very similar to statx), all of the cached copies are synched up before handing the value back to the caller.

Given that we have at least one filesystem that already benefits from looser semantics for the i_version counter, can we also relax this requirement in in-kernel filesystems to reduce the number of metadata updates that are required on a write-heavy workload?

Can we make it go faster?

A few months ago, a colleague of mine that primarily works on local filesystems asked about the feasibility of turning of the i_version counter. Apparently some of the Red Hat QA folks were concerned about an XFS performance regression when moving from RHEL6 to RHEL7. They had tracked it down to the fact that XFS had revved its on disk metadata format, and the i_version counter had been switched on universally.

The workload in this case was very small I/Os in iozone which is clearly synthetic, but it got me thinking that it might be possible to reduce the number of situations where we need to dirty the inode metadata.

Basically the idea is to set a flag in the inode when we query the i_version counter. Then, when we go to bump it, we only bump the counter if that flag is set and then we clear the flag. Then, we fix the filesystems to only log the change to disk if the counter was incremented and the times didn’t change.

For the vastly common case where no one has queried for it, we can reduce the number of times that the inode metadata is dirtied down to a frequency on par with the timestamp granularity.

Which is neato and may really helps those really small buffered I/O workloads, where fsync is only infrequently called. As a pessimal case, we can play with time and dd with 1 byte writes:

$ time dd if=/dev/zero of=/mnt/scratch/foo bs=1 count=10000000

Which is a silly workload of course, but it should give us the most dramatic results. On XFS on my fairly crappy test rig on an unpatched kernel:

real 0m13.840s
user 0m0.964s
sys 0m12.824s

After my draft patch series:

real 0m6.753s
user 0m0.617s
sys 0m6.114s

…about half the time. Again though, this is not at all a real-world workload, so I don’t think it’s sufficient to merge this based on that alone. I see some small speedup on kernel builds but nothing to get too excited about.

That said, ext4 has the i_version counter off by default and you have to specify an option to turn it on. I don’t think they’d do that unless they thought the counter were cost-prohibitive. Clearly there may be a use case there?

So, I pushed this into my kernel tree on and few days later, I got this from the Intel kernel test robot:


FYI, we noticed a 43.5% improvement of vm-scalability.throughput due to

commit: fa629d46c4da556a77c7b8c7760e734dd88d1f3e ("fs: only set S_VERSION
when updating times if it has been queried")

43%! That certainly sounds significant. The docs for the vm-scalability testsuite say this about the test that showed the improvement:

Create a sparse file with size of $MemTotal, before creating N threads, it
will preallocate and prefault 1/2 memory space with mmap using this sparse
file as backing store and then the N threads will all write data there
using the preallocated space. When this is done, use msync to flush
change back to the file.

Is that representative of any sort of real-world workload?

CephFS and the NFSv4 Change Attribute

Since I’ve rejoined Red Hat, I’ve been doing some work within cephfs to better understand the code. One of the projects I’ve been working on is to add a statx-like interface to libcephfs (the userland ceph client code). As part of that, I’m adding in support for file create times (also known as the “btime”), and a NFSv4 change attribute with the aim of adding support for it to nfs-ganesha.

The NFSv4 Change Attribute

Good NFS performance relies on being able to cache file data aggressively. Clients use inode metadata to tell whether they should invalidate their caches. Basically they look at certain metadata in the inode (usually via a GETATTR operation) and then invalidate the cache if it looks like it has changed.

When doing something like a write() syscall however, POSIX says only that you must restamp the modification time (mtime) of the inode with the current time. Since that action changes the metadata, the inode change time (ctime) must also be stamped with the current time.

Older versions of NFS relied heavily on watching for timestamp changes in order to tell if the cache was valid. This was always problematic though. Timestamp granularity on Linux is generally 1 millisecond, and a lot can happen within that time, which can cause the clients to end up with stale data in the file when they don’t see an apparent change to the times.

So, when NFSv4 was drafted, it mandated a new metadata attribute — the change attribute. The general idea is that it is a unsigned 64-bit number, that must increase in value whenever there has been either a data or metadata change to the file. When a client queries it, and the file has changed in some fashion then it must be larger than when the client last queried for it.

Servers have a bit of latitude on how they implement the change attribute, but the end result is always semi-opaque to the client. It can only look at the change attribute and see whether it has increased since the last time it was checked. If so, then it can infer that something has changed in the file (either data or metadata).

Note that clients are not allowed to inspect the value to see how much it has changed. The only thing they can do is look and see if it’s larger than the previous value. At that point, any data or metadata cached on that inode is suspect.

CephFS capabilities

First a bit of background on CephFS. CephFS is quite different from other network filesystems in that parts of the data and metadata can be delegated to different clients in a very granular way. These are called capabilities (aka caps) in CephFS parlance. For instance, the owner, group and mode are all under the AUTH capability, whereas the mtime and size are under the FILE capability.

These generally come in two flavors: shared and exclusive. Shared caps allow clients to cache the existing values, and exclusive caps allow them to be changed by the client and buffer that change until the capability is revoked.

Furthermore, the CephFS metadata server can grant different exclusive caps to different clients on the same inode, allowing them to cache changes to the metadata and data. The MDS will recall these capabilities in order to ensure that other clients see the changes that those clients may be caching, prior to satisfying a request that would provide those attributes. At that point, the clients will return their capabilities and update the server with the new metadata.

Thus it’s possible that one client can have the AUTH exclusive cap (allowing it to cache a change to the mode, for instance), and then another client may hold the FILE exclusive cap which allows it to buffer up writes to the file and the associated changes to the mtime.

By way of shorthand, we often refer to the capabilities with a capital letter, and then a lowercase letter denoting whether it’s shared or exclusive.

Adding a NFSv4 Change Attribute to CephFS

The implementation adds the change attribute to the FILE capability since writes are more common than explicit metadata changes, and it often changes in conjunction with the mtime.

But! Suppose we have one client that holds AUTH exclusive caps (Ax) and another with the necessary caps to  buffer writes to the file and cache mtime changes (we’ll denote this Fx for now, but the FILE cap is actually a bit more complex).

Both clients start out seeing the same change attribute (call it 1), and the same mode on the file. The Ax client does a change of the mode, and bumps his change attribute to 2. The Fx client does a write, bumps his change attribute to 2, and then does a GETATTR to fetch the mode and change attribute. What happens now?

The MDS then recalls the Ax cap from the first client, sees that the change attribute is now at 2 and changes the value in the MDS cache. It then sends updated inode attributes to the other client (the one holding Fx caps). That client sees the change attr at 2.

But wait — he already has a value of 2. Is that a violation of the semantics?

The answer (I believe) is no. Recall that the client is not allowed to infer anything from the actual value of the change attribute. It can only check to see if it’s newer than what it had before. So, while the Fx client did have the same change attribute cached after its write, its cached inode attributes (including the mode) became invalid on the transition from 1 to 2.

IOW, it’s perfectly legitimate to “bundle” several changes under one change attribute value change, as long as you ensure that no one has queried for the change attribute in the meantime.

Note that there are even more complex situations that can involve NFSv4 delegations, but the key points here are that all cached metadata becomes invalid once the change attribute changes, and that the MDS will recall (and sync up) changes to the change attribute when there are exclusive caps outstanding to another client that cover the metadata being queried.

Current Progress

The patchset has been through several iterations and there is currently a pull request up to add the first pile of patches. nfs-ganesha will require adding several other ceph_ll_* interfaces, but the heavy lifting is already done here. We’re hoping this will make the next CephFS release (Kraken).

File-private POSIX locks (aka un-POSIX locks)

The Problems with POSIX Locking

At last year’s Linux Storage and Filesystem summit, we had a lively discussion about what Linux needs in order to make it easier to implement fileservers on top of it. One of the bullet points brought up by Boaz Harrosh was the brokenness of the POSIX file locking standard. There are really two different problems:

POSIX Locks are tracked on a per-process basis such that lock requests by the same process cannot conflict with one another. Thus, they are useless for synchronization between threads within the same process. This also means that locks are not inheritable across a fork() since they are intimately tied to the process.

The bigger issue is that all POSIX locks held by a process are dropped any time the process closes any file descriptor on the inode that the process holds open. This is the bit that catches most programmers by surprise. It requires that you carefully delay closing any file descriptors held open on a file until you’re sure that all locks on that file can be released. As Jeremy Allison points out in his blog “this is not behavior that you would ever want!” (he also has a good description of how this standard came about — see the section entitled “First Implementation Past the Post”).

It also leads to the “library problem”. Occasionally libraries may need to open a file in order to read or write to it and then close it again. If a library does this while the application has the file open with locks held on it, those locks are then lost. In many cases, this is not obvious to the calling program.

BSD locks (via flock() mechanism) had much more sane semantics with respect to inheritance and behavior on close. Locks are tied to the open file table entry, not the process. If a process opens a file twice and tries to set exclusive locks on both, the second one will be denied. BSD locks are only released automatically when the last reference to the open file on which they were acquired is closed. The only problem is that BSD locks are whole-file locks only. They aren’t byte-range aware.

What we need is a new lock type that is a hybrid of the two. We want BSD behavior for inheritance and close, but the locks need to be byte-range locks. Furthermore, since we have a large legacy codebase of programs that use “classic” POSIX locks, these new locks need to conflict with them so that programs using the new locks will interoperate correctly with them.

A Possible Solution

I have a preliminary patchset to add a lock that behaves like this. I’m calling them “file-private” locks for now, but I’m amenable to better names if anyone has one. The current patchset adds a new set of cmd values to fcntl(2). The interface is very similar to the traditional POSIX lock interface, even down to the struct flock argument that is passed in. The main difference is that these locks behave more like BSD locks with respect to inheritance and close. The new cmd values are (in a nutshell):

F_SETLKP:  Acquire or release a file-private lock.
F_SETLKPW: Acquire or release a file-private lock. If the lock isn't available then wait for it.
F_GETLKP:  Test whether a file-private lock would be able to be acquired.

Whereas classic POSIX locks are “owned” by the process, these are owned by the open file. Thus, these locks will conflict with “classic” POSIX locks, even if they are taken on the same file descriptor by the same process. If however, a fd is inherited by a child process via fork then locks acquired via that same fd in both the parent and child will not conflict (since the open file table is the same).

Also, the locks would only be automatically released on the last close of that file descriptor. So, if the parent takes a lock on a fd and forks, and the child then closes that fd, then the lock will stay around until the parent closes it. By the same token, if the parent closes the fd first, then the lock will stay around until the child closes it.

With this, someone implementing a fileserver for linux that needs to manage byte range locks on behalf of a client can simply ensure that they just keep an open file for each “lockowner”.

I have an initial implementation of this code for Linux, and have opened a RFE with the Austin group to have them consider adopting this as part of the formal POSIX standard. So far they seem amenable to the idea.

Active/Active NFSv4 Serving — Userspace NFS Servers

Much of the difficulty in creating a Linux-based active/active NFS server is the fact that we’ll need to upcall so much since there are so many things that are too difficult to handle in the kernel.

The existing (single-node) NFSv4 server in the kernel already has to upcall for:

  • mountd/exportfs information
  • ID mapping
  • Client ID tracking (technically, this isn’t an upcall currently. The kernel manages files in a well-known directory via VFS interfaces. That’s pretty gross though, and we plan to switch that to an upcall soon)

Now, consider all of the information that will need to flow between nodes with a clustered NFS server. That’s at least another upcall of some sort as well, and probably more than one

At some point, we need to ask ourselves: “Why do any of this in the kernel at all?”

Doing this in userspace smooths over a lot of the hassles. krb5 auth and ID mapping for instance become much more trivial to handle, and we diminish the round trips between kernel and userspace for some things.. Also, if we want to do something like plug in Gluster on the backend, we no longer have to deal with multiple round trips through the kernel and back. We can just plug the userspace nfs server into the Gluster userspace libs and let ‘er rip.

Userspace NFS servers do have their own sorts of hassles as well:

  • filehandles: traditionally the main reason for an in-kernel NFS server was to eliminate the need to track filehandle-to-path information. When a server gets a NFS filehandle it has to figure out which inode that refers to. Most userspace servers have relied on tracking that info in a separate database, but that approach has many problems if something else is manipulating the filesystem too. With recent Linux kernels, we have open-by-filehandle support, which obviates much of the need for this. There are apparently still some issues with this support, but I think we can probably fix those as they are identified.
  • ownership/permissions: Most current userspace NFS servers always just run as root and try to fix up and enforce permissions after the fact. This is less than ideal since it’s racy and means that you have to enforce permissions in userspace. The main reason for this is that the POSIX setuid() family of syscalls work on a per-process basis. If you have a multithreaded process, then changing uids will change it on every thread. This makes it really difficult to handle the situation where you want to do something like open/create a file as a particular user. An open() is a synchronous syscall so ideally you’d like to spawn a thread to handle that call so we can process other calls arriving in  parallel. Unfortunately, you may need to change uids to handle the next call coming in. So, there is also a need to be able to set per-thread credentials. glibc has no support for this currently. The Samba server guys are calling the setfsuid() syscall directly to handle this, but that’s less than ideal. It would be nice to have some
    sort of officially blessed interface for this.
  • Asynchronous IO: buffered AIO is non-existent under Linux. There have been many attempts to make that work, but none have come to fruition. The current fashion is to spawn threads that do standard synchronous I/O calls. It’s kludgey but as long as you have per-thread credentials, that approach works properly.
  • recvfile(): ideally, we’d like to be able to move pages of data straight from a socket to a file. We have the reverse currently — sendfile() works and can inline pages from the pagecache directly into a skbuf fraglist. This allows a userspace server to implement zero-copy read calls. It would be nice to be able to do zero-copy writes as well by yanking pages out of a receive buffer and dropping them directly into the pagecache. Some OS’ have recvfile() for that purpose. Linux has splice() which is supposed to be a way to implement recvfile(). Unfortunately, splice() is broken in this regard — it always copies the data to the pagecache instead of replacing/moving pages into it. This would offer up a nice performance boost if it could be fixed.

There are other problems too of course. The TI-RPC userspace libraries are notoriously buggy when you use threads. The NFS Ganesha folks have done a bunch of work on them, but they haven’t pushed their changes upstream yet. Hopefully that will happen soon. There was talk about that at the Bake-a-Thon recently, so hopefully it’ll come to fruition soon.

Active/Active NFSv4 on Clustered Filesystems: avoiding the VFS layer

So far, most of my posts about this proposed design have been about teaching the VFS and underlying filesystems how to handle the state properly for NFSv4 semantics. We could however, consider another design. We could have a clustered daemon that runs and tracks the state independently of the VFS. In other words, we could go with a more Samba/CTDB type design here.

knfsd would upcall to talk to a daemon and it would be the final arbiter. Such a daemon could be integrated with Samba/CTDB such that knfsd and samba are aware of each others state. In principle, that would allow you to export the same filesystem via both knfsd and Samba/CTDB and get reasonable semantics for both. Oplocks and delegations would get broken appropriately, share mode reservations should work, and (possibly) we could make locking work in a more consistent fashion than we do today.

Of course, there’s a catch — nothing else would be aware of that state. It would all be tied up inside the CTDB database (or whatever clustered DB infrastructure we’d end up using). The result would be a NFS+SMB “appliance”. I think that’s a less desirable design than one that’s more general purpose, but it might be easier to achieve and we might be able to hammer that out quicker since we’d be able to avoid a lot of the VFS-layer work.

In the near term, we don’t really need to make this decision. Either way, we’ll still need to be able to swap in the correct set of operations to handle it, so the focus for now can be on simply abstracting out the parts of the server code that we’ll need to swap out to handle this later. It should even be possible to do this sort of design as an interim step, and then add what the VFS would need for a more general solution later.