I recently wrote a post for the Ceph community blog on how to use Rook to deploy a CephFS cluster with an NFS gateway on top. It glosses over some important bits (like securing the thing properly) but it shows what is possible with Rook+Ceph+nfs-ganesha.
A little over a year ago, I made a post about a new design for active/active NFS clustering. Since then, an implementation of it has been merged into nfs-ganesha.
The aim in this article is to deploy as simple a configuration as possible. We’ll be deploying a two nfs ganesha servers, layered on top of an existing CephFS cluster. The servers in this case are simple daemons running on regular hosts (VMs in my case, but this could be done on bare metal as well).
We will not be using any sort of traditional clustering software (e.g. pacemaker). This deployment is two independent NFS servers that are serving out the same backend filesystem. They will have separate addresses, and clients will need to be distributed between them manually.
“Failover” in this configuration will mean manually unmounting and remounting the other server on the client. If the server dies, we’ll just restart the whole thing.
For this, you’ll need an existing CephFS cluster. Setting this up is outside the scope of this article, but I deployed one under kubernetes with rook. For this configuration, you really want to be running a Nautilus (v14) cluster as there are many bugfixes and some new interfaces tailored for this use-case that make things much more reliable. A the time of this writing, nautilus is not yet released but should be in spring(-ish) 2019.
You’ll also need two hosts on which to run nfs-ganesha. Any sort of host will do. It’s even possible to run nfs-ganesha in an unprivileged docker or kubernetes container as it’s just a userland application that has no real need for local storage.
You can add more hosts if you like, but there is no need for quorum to handle split-brain scenarios here. Everything is coordinated via RADOS objects, so as long as the necessary objects are accessible by all of the cluster hosts they should be fine.
The nfs-ganesha packages need to be v2.7 or above. Earlier releases did not have the rados_cluster recovery backend driver. I used two virtual machines running Fedora Rawhide (the Fedora 30 prerelease). That way the packages I needed would be available via the standard package repos.
Install the OS on both hosts. I installed Fedora 29, and then updated to Fedora Rawhide from there. Install the necessary packages to run ganesha. For convenience, I also installed the ceph-common package, which allowed me to test the ability for the host to contact the ceph cluster:
# dnf install nfs-ganesha nfs-ganesha-ceph nfs-ganesha-rados-grace ceph-common
Configure the hosts to be able to access the ceph cluster. This part will depend on your particular ceph deployment. You’ll need the mon addresses at a minimum, and if you’re running with encryption enabled, the keyring file.
# cat /etc/ceph/ceph.conf
mon_host = 10.101.90.22:6790,10.110.5.39:6790,10.110.139.207:6790
keyring = /etc/ceph/keyring
Here’s my keyring file. This should be readable only by root (or whatever user the ganesha daemon will be running as):
# cat /etc/ceph/keyring
key = AQCafw5cEiphIBAAkmlazVh1O72yH+mXVrO6Tw==
Note that for this tutorial I’m not being at all careful about the cephx principals in use, and am just allowing the ganesha servers to use client.admin. For a real production deployment, you’d want to have the ganesha daemons use dedicated principals that just allow them the minimal access needed to the cluster.
Once you have this set up, you should be able to run commands like this to verify that the cluster is operational:
$ rados lspools
Here we see the data and metadata pools for our CephFS.
Add a ganesha.conf file on both hosts. The nfs-ganesha-ceph package ships with a sample ceph.conf file that we can tweak for our purposes. Here’s mine with the comments stripped out:
grep -v '^\s*#' /etc/ganesha/ganesha.conf | grep -v '^\s*
Enable_NLM = false;
Enable_RQUOTA = false;
Protocols = 4;
RecoveryBackend = rados_cluster;
Minor_Versions = 1,2;
Dir_Chunk = 0;
NParts = 1;
Cache_Size = 1;
Protocols = 4;
Transports = TCP;
Path = /;
Pseudo = /ceph/;
Access_Type = RW;
Attr_Expiration_Time = 0;
Name = CEPH;
pool = "cephfs.a.meta";
namespace = "ganesha";
nodeid = "a";
Note that we disable anything prior to NFSv4.1 here. This is because enabling legacy versions of NFS preclude us from being able to lift the grace period early, and that is a huge win on a server restart. There is also some tuning in there to disable most of ganesha’s caching. We’re already caching aggressively in the ceph libraries, so there’s no need to duplicate it here.
The rados_cluster recovery backend stores its info in RADOS objects. It’s not generally a lot of data, but we want it to be highly available. We piggyback onto the CephFS metadata pool for this purpose, and declare a new “ganesha” namespace in it to keep it distinct from cephfs objects.
Most of the configuration is identical between the hosts, but the nodeid setting in the RADOS_KV block requires special mention. It must be unique string for each node in the cluster. Here, we’ve set it to “a”. The second host in this cluster has this set to “b”.
It’s possible to not set this, in which case nfs-ganesha will default to using the hostname. That’s a perfectly valid choice for nodeid, as long as it is unique in the environment and is persistent across reboots.
Populate the Cluster Grace Database:
Now, we need to ensure that all of the nodes in the cluster know about each other. This done via a RADOS object that is shared between the hosts. Ganesha uses this object to communicate it current state and need with respect to a grace period.
The nfs-ganesha-rados-grace package contains a command-line tool for querying and manipulating this database. We’ll use that to create the DB and add both nodeids. One one of the hosts, we run:
$ ganesha-rados-grace -p cephfs.a.meta -n ganesha add a
$ ganesha-rados-grace -p cephfs.a.meta -n ganesha add b
$ ganesha-rados-grace -p cephfs.a.meta -n ganesha
This creates the grace database and adds both “a” and “b” to it. The last command has it dump the current state. Newly added hosts are always considered to be enforcing the grace period so they both have the E flag set.
The “cur” and “rec” values show the current and recovery epochs, which is how we keep track of what hosts are allow to perform recovery and when.
Open firewall ports:
On Fedora, I needed to allow clients to get to the server. On both hosts:
# firewall-cmd --permanent --add-service=nfs
# firewall-cmd --reload
This opens port 2049 so that the server is accessible from other hosts.
Start the Servers:
Start up the server on each host:
# systemctl start nfs-ganesha
# systemctl enable nfs-ganesha
Created symlink /etc/systemd/system/multi-user.target.wants/nfs-ganesha.service → /usr/lib/systemd/system/nfs-ganesha.service.
If all went well, ganesha.nfs should be running on both hosts. Now check the grace database:
$ ganesha-rados-grace -p cephfs.a.meta -n ganesha
Note that both nodes at this point have cleared their E flags, indicating that they are no longer enforcing the grace period and are now in normal operation mode.
From here, you should be able to mount either server from clients and do normal NFS operations against them.
This configuration contains an implicit assumption that if one of these servers goes down, that we will restart it within 5 minutes. After 5 minutes, the Ceph MDS may cancel the session that the ganesha client held and all of the state associated with it. If the session’s caps (stateful objects) get cancelled before the rest of the cluster goes into the grace period, the server’s clients may not be able to recover all of their state.
This is one of the reasons Red Hat is focused on containerized deployments for this. Restarting a container is quite a bit cheaper and has a much smaller footprint than an entire VM.
This is just a very quick overview of how to deploy a cluster in as simple a way as possible. I don’t consider this to be a configuration that many people would want to use, but it illustrates what is possible with other, layered technologies to improve resiliency.
Since returning to Red Hat, I’ve done a fair bit of work in enabling nfs-ganesha to serve out cephfs. The operating idea here has been to push as much infrastructure into CephFS as we can to enable us to properly implement NFS semantics.
nfs-ganesha then is just another cephfs client. It hands out NFS state to the clients and holds the corresponding CephFS state on their behalf.
With a single NFS server, this works reasonably well. We’ll even have delegation support (coming soon!). The question now is — can we stand up an arbitrary set of servers that only interact with one another via ceph?
The advent of technologies like kubernetes makes it attractive to be able to stand up cookie-cutter NFS servers dynamically as the load requires. How loosely can we couple them without breaking NFS recovery semantics?
Note that in the discussion below, I’m deliberately ignoring other applications besides NFS that are interacting with the cluster during the grace period. That’s potentially a real problem, but for now I’m ignoring it for the sake of simplicity.
The basic problem:
The main issue here is that NFS server state is not persistently stored in a fine-grained way. Most of the state (opens, locks, delegations, layouts) are only stored in memory. That’s not usually a problem on a standalone NFS server.
When a NFS server crashes, it comes up with zero state held by clients. We inform them of the crash, and they then start reclaiming their state within a certain window (usually 2 lease periods — around 90-120s depending on the implementation). During the grace period, no new state can be established — clients can do reclaims only. Once the grace period has ended, the reverse is true.
There are some corner cases here involving network partitions and multiple reboots that require us to maintain a list of clients that hold at least one open on stable storage. That’s also fairly simple to handle with a single server.
Scaling out, the simple way:
There are ways to scale out without needing to do a lot of complex coordination. If you know that an inode can only ever be served out by a single NFS server, then you can really treat it as wholly independent of other servers. There can be no conflicting state acquired via other NFS servers that way.
If you imagine a set of servers that only serve out a single export each via an individual IP address, with no hardlinks that cross export boundaries, then that should work just fine today without any external coordination.
With something like kubernetes, where you could potentially restart a cookie-cutter NFS server very quickly this might be a workable solution. You effectively allow yourself to be able to scale out as the number of exports grows.
Serving the same content in parallel:
Those configurations are a potentially limiting though. We really want to be able to scale out according to the number of clients we have, not the number of exports.
That means that we need to allow multiple servers to export the same data in parallel.
Suppose we have a setup that looks like this:
+----------------+ +----------------+ | | | | | NFS Client 1 +------+ NFS Server 1 +----------+ | | | | | +----------------+ +----------------+ +-----+----+ | | | CephFS | | | +----------------+ +----------------+ +-----+----+ | | | | | | NFS Client 2 +------+ NFS Server 2 +----------+ | | | | +----------------+ +----------------+
In this configuration, we don’t float IP addresses between NFS Server 1 and 2 like we would with a traditional clustered configuration. Instead, we assign an address to each NFS server, and reconstitute the server from scratch if it crashes.
All of the stable storage info for the NFS server is stored in RADOS, though each NFS server has its own database for recovery.
Now, suppose NFS Server 1 crashes. At that point, NFS Client 1 knows what NFS state it holds. The CephFS server knows what capabilities and state NFS Server 1 held before it crashed.
Kubernetes or whatever detects that its down and creates a new instance of NFS Server 1. NFS Server 1 comes up knowing only the list of clients that are allowed to reclaim their state. What we need to happen at this point is for CephFS to release the state that was held by the previous NFS Server 1 incarnation, and allow NFS Client 1 to reclaim it.
Between the window of time when the CephFS caps held by the earlier NFS Server 1 are trashed, and NFS Client 1 reclaims its state, NFS Client 2 talking to NFS Server 2 could race in and set some state that conflicts with what NFS Client 1 holds.
We must prevent this from occurring. The simplest way to do that is to force all of the other NFS servers into the grace period prior to cephfs revoking the caps that the old client held.
In practical terms, this means that we need to ensure that all of the remaining NFS servers are enforcing the grace period prior to restarting a crashed NFS server in the cluster. They are all only allowed to exit the grace period once all of the servers have stopped allowing reclaim.
That also neatly handles the case where we have a clean start from zero state in the cluster. All servers will go into their grace period until clients are finished reclaiming (or all of the individual grace periods time out).
Handling excessively tardy server restart:
It’s also possible for a NFS server to go down while clients still hold state on it, but then take so long to come back that the MDS decides to give up on it and revokes its caps anyway.
Clients of these servers must not be allowed to reclaim any state, as we cannot guarantee consistency in that situation. The simplest way to do that is to simply trash its persistent client database if the caps that it held previously were revoked outside of the grace period. That will prevent any reclaim. Those surviving clients will likely need to be restarted, but that’s really the best we can do in that situation.
Decommissioning a server instance with active clients:
If we know that we want to decommission one of the NFS servers (scaling down the cluster at times of low-load, for instance), we should be able to use migration to move them to other servers (support for that would need to be added to ganesha).
Alternately we could merge its IP addresses and client databases into another existing server. There are several options here. The main point is that we do need to take proactive steps to “drain” a server before decommissioning it.
So, that more or less lays out the rules for how we need various NFS servers to behave. The question now is what will ensure that we follow them.
Traditional clustering does this with stuff like corosync and pacemaker, but having to use that when we already have clustering infrastructure in ceph itself seems wrong.
The current thinking (credit to Patrick Donnelly) is to add a ceph monitor service and map for the NFS gateways. That map would be consumed by a new daemon that would manage the ganesha daemon, possibly using dbus to handle the grace period of a running instance.
Alternately, we implement this as a new shared library in ceph that ganesha would link in. This part is still being designed/determined.
Overall, I’m pretty excited by the potential of the overall design here though. I think it should offer us the ability to scale nearly linearly, at least with some common usage patterns.
Beyond this design, we may want to look at how to more effectively move clients around the cluster based on their access patterns. For instance, we may want to explore assigning a NFS server to each CephFS MDS in a multimds cluster, and leverage NFS migration to have clients access objects on the NFS server closest to the authoritative MDS for an inode.
That sort of design might be very interesting coupled with pNFS. A flexfiles layout is something of a natural fit over CephFS.
I’ve written previously about the NFSv4 change attribute and CephFS. The Linux kernel has also supported it for a long time via a field i_version field in the inode structure.
Most kernel subsystems just treat this field as an opaque value. Some filesystems use it internally, primarily to track changes to directories (mostly for handling directory changes during readdir). The kernel NFS client uses that field to store the NFSv4 change attribute.
The exception is when the filesystem sets the MS_I_VERSION flag in the superblock. That flag tells the kernel that it should increment i_version on regular files when it goes to change the timestamps during a data or metadata change. Only a few filesystems in the kernel do this: xfs and btrfs always have it enabled. ext4 has it off by default but you can turn it on with the i_version mount option.
With that enabled, the kernel NFS server will also populate the NFSv4 change attribute from that value as well. This is what allows us to (more or less) eliminate timestamp granularity problems for NFSv4 clients when exporting those filesystems.
The important bit to note here is that there is a lot of inconsistency on how this field is handled in the kernel. Some filesystems have very different semantics from others. AFS for instance uses this as the data version counter which is not incremented on metadata changes. For the most part though, these differences haven’t really mattered as the users of this field are limited and are confined to the kernel.
But, that may change in the not too distant future. There is (hopefully!) a new syscall on the horizon: statx(). With that, exposing the i_version field to userland could be very useful. There are a lot of applications that would like to have a cheap and reliable indicator that something changed on an inode. Before we can expose that field however, we’ll need to sort out the inconsistencies between the implementations, and clearly define what this field should mean for applications.
Now we circle back to Ceph. The Ceph change attribute implementation exploits the fact that we don’t necessarily need to bump the counter atomically on every change. When userland ceph clients are able to cache a change to the metadata, they bump the counter in their local copy. When the change counter is queried by an application (which uses an interface very similar to statx), all of the cached copies are synched up before handing the value back to the caller.
Given that we have at least one filesystem that already benefits from looser semantics for the i_version counter, can we also relax this requirement in in-kernel filesystems to reduce the number of metadata updates that are required on a write-heavy workload?
Can we make it go faster?
A few months ago, a colleague of mine that primarily works on local filesystems asked about the feasibility of turning of the i_version counter. Apparently some of the Red Hat QA folks were concerned about an XFS performance regression when moving from RHEL6 to RHEL7. They had tracked it down to the fact that XFS had revved its on disk metadata format, and the i_version counter had been switched on universally.
The workload in this case was very small I/Os in iozone which is clearly synthetic, but it got me thinking that it might be possible to reduce the number of situations where we need to dirty the inode metadata.
Basically the idea is to set a flag in the inode when we query the i_version counter. Then, when we go to bump it, we only bump the counter if that flag is set and then we clear the flag. Then, we fix the filesystems to only log the change to disk if the counter was incremented and the times didn’t change.
For the vastly common case where no one has queried for it, we can reduce the number of times that the inode metadata is dirtied down to a frequency on par with the timestamp granularity.
Which is neato and may really helps those really small buffered I/O workloads, where fsync is only infrequently called. As a pessimal case, we can play with time and dd with 1 byte writes:
$ time dd if=/dev/zero of=/mnt/scratch/foo bs=1 count=10000000
Which is a silly workload of course, but it should give us the most dramatic results. On XFS on my fairly crappy test rig on an unpatched kernel:
real 0m13.840s user 0m0.964s sys 0m12.824s
After my draft patch series:
real 0m6.753s user 0m0.617s sys 0m6.114s
…about half the time. Again though, this is not at all a real-world workload, so I don’t think it’s sufficient to merge this based on that alone. I see some small speedup on kernel builds but nothing to get too excited about.
That said, ext4 has the i_version counter off by default and you have to specify an option to turn it on. I don’t think they’d do that unless they thought the counter were cost-prohibitive. Clearly there may be a use case there?
So, I pushed this into my kernel tree on samba.org and few days later, I got this from the Intel kernel test robot:
Greeting, FYI, we noticed a 43.5% improvement of vm-scalability.throughput due to commit: commit: fa629d46c4da556a77c7b8c7760e734dd88d1f3e ("fs: only set S_VERSION when updating times if it has been queried")
43%! That certainly sounds significant. The docs for the vm-scalability testsuite say this about the test that showed the improvement:
case-msync-mt: Create a sparse file with size of $MemTotal, before creating N threads, it will preallocate and prefault 1/2 memory space with mmap using this sparse file as backing store and then the N threads will all write data there using the preallocated space. When this is done, use msync to flush change back to the file.
Is that representative of any sort of real-world workload?
Since I’ve rejoined Red Hat, I’ve been doing some work within cephfs to better understand the code. One of the projects I’ve been working on is to add a statx-like interface to libcephfs (the userland ceph client code). As part of that, I’m adding in support for file create times (also known as the “btime”), and a NFSv4 change attribute with the aim of adding support for it to nfs-ganesha.
The NFSv4 Change Attribute
Good NFS performance relies on being able to cache file data aggressively. Clients use inode metadata to tell whether they should invalidate their caches. Basically they look at certain metadata in the inode (usually via a GETATTR operation) and then invalidate the cache if it looks like it has changed.
When doing something like a write() syscall however, POSIX says only that you must restamp the modification time (mtime) of the inode with the current time. Since that action changes the metadata, the inode change time (ctime) must also be stamped with the current time.
Older versions of NFS relied heavily on watching for timestamp changes in order to tell if the cache was valid. This was always problematic though. Timestamp granularity on Linux is generally 1 millisecond, and a lot can happen within that time, which can cause the clients to end up with stale data in the file when they don’t see an apparent change to the times.
So, when NFSv4 was drafted, it mandated a new metadata attribute — the change attribute. The general idea is that it is a unsigned 64-bit number, that must increase in value whenever there has been either a data or metadata change to the file. When a client queries it, and the file has changed in some fashion then it must be larger than when the client last queried for it.
Servers have a bit of latitude on how they implement the change attribute, but the end result is always semi-opaque to the client. It can only look at the change attribute and see whether it has increased since the last time it was checked. If so, then it can infer that something has changed in the file (either data or metadata).
Note that clients are not allowed to inspect the value to see how much it has changed. The only thing they can do is look and see if it’s larger than the previous value. At that point, any data or metadata cached on that inode is suspect.
First a bit of background on CephFS. CephFS is quite different from other network filesystems in that parts of the data and metadata can be delegated to different clients in a very granular way. These are called capabilities (aka caps) in CephFS parlance. For instance, the owner, group and mode are all under the AUTH capability, whereas the mtime and size are under the FILE capability.
These generally come in two flavors: shared and exclusive. Shared caps allow clients to cache the existing values, and exclusive caps allow them to be changed by the client and buffer that change until the capability is revoked.
Furthermore, the CephFS metadata server can grant different exclusive caps to different clients on the same inode, allowing them to cache changes to the metadata and data. The MDS will recall these capabilities in order to ensure that other clients see the changes that those clients may be caching, prior to satisfying a request that would provide those attributes. At that point, the clients will return their capabilities and update the server with the new metadata.
Thus it’s possible that one client can have the AUTH exclusive cap (allowing it to cache a change to the mode, for instance), and then another client may hold the FILE exclusive cap which allows it to buffer up writes to the file and the associated changes to the mtime.
By way of shorthand, we often refer to the capabilities with a capital letter, and then a lowercase letter denoting whether it’s shared or exclusive.
Adding a NFSv4 Change Attribute to CephFS
The implementation adds the change attribute to the FILE capability since writes are more common than explicit metadata changes, and it often changes in conjunction with the mtime.
But! Suppose we have one client that holds AUTH exclusive caps (Ax) and another with the necessary caps to buffer writes to the file and cache mtime changes (we’ll denote this Fx for now, but the FILE cap is actually a bit more complex).
Both clients start out seeing the same change attribute (call it 1), and the same mode on the file. The Ax client does a change of the mode, and bumps his change attribute to 2. The Fx client does a write, bumps his change attribute to 2, and then does a GETATTR to fetch the mode and change attribute. What happens now?
The MDS then recalls the Ax cap from the first client, sees that the change attribute is now at 2 and changes the value in the MDS cache. It then sends updated inode attributes to the other client (the one holding Fx caps). That client sees the change attr at 2.
But wait — he already has a value of 2. Is that a violation of the semantics?
The answer (I believe) is no. Recall that the client is not allowed to infer anything from the actual value of the change attribute. It can only check to see if it’s newer than what it had before. So, while the Fx client did have the same change attribute cached after its write, its cached inode attributes (including the mode) became invalid on the transition from 1 to 2.
IOW, it’s perfectly legitimate to “bundle” several changes under one change attribute value change, as long as you ensure that no one has queried for the change attribute in the meantime.
Note that there are even more complex situations that can involve NFSv4 delegations, but the key points here are that all cached metadata becomes invalid once the change attribute changes, and that the MDS will recall (and sync up) changes to the change attribute when there are exclusive caps outstanding to another client that cover the metadata being queried.
The patchset has been through several iterations and there is currently a pull request up to add the first pile of patches. nfs-ganesha will require adding several other ceph_ll_* interfaces, but the heavy lifting is already done here. We’re hoping this will make the next CephFS release (Kraken).
The Problems with POSIX Locking
At last year’s Linux Storage and Filesystem summit, we had a lively discussion about what Linux needs in order to make it easier to implement fileservers on top of it. One of the bullet points brought up by Boaz Harrosh was the brokenness of the POSIX file locking standard. There are really two different problems:
POSIX Locks are tracked on a per-process basis such that lock requests by the same process cannot conflict with one another. Thus, they are useless for synchronization between threads within the same process. This also means that locks are not inheritable across a fork() since they are intimately tied to the process.
The bigger issue is that all POSIX locks held by a process are dropped any time the process closes any file descriptor on the inode that the process holds open. This is the bit that catches most programmers by surprise. It requires that you carefully delay closing any file descriptors held open on a file until you’re sure that all locks on that file can be released. As Jeremy Allison points out in his blog “this is not behavior that you would ever want!” (he also has a good description of how this standard came about — see the section entitled “First Implementation Past the Post”).
It also leads to the “library problem”. Occasionally libraries may need to open a file in order to read or write to it and then close it again. If a library does this while the application has the file open with locks held on it, those locks are then lost. In many cases, this is not obvious to the calling program.
BSD locks (via flock() mechanism) had much more sane semantics with respect to inheritance and behavior on close. Locks are tied to the open file table entry, not the process. If a process opens a file twice and tries to set exclusive locks on both, the second one will be denied. BSD locks are only released automatically when the last reference to the open file on which they were acquired is closed. The only problem is that BSD locks are whole-file locks only. They aren’t byte-range aware.
What we need is a new lock type that is a hybrid of the two. We want BSD behavior for inheritance and close, but the locks need to be byte-range locks. Furthermore, since we have a large legacy codebase of programs that use “classic” POSIX locks, these new locks need to conflict with them so that programs using the new locks will interoperate correctly with them.
A Possible Solution
I have a preliminary patchset to add a lock that behaves like this. I’m calling them “file-private” locks for now, but I’m amenable to better names if anyone has one. The current patchset adds a new set of cmd values to fcntl(2). The interface is very similar to the traditional POSIX lock interface, even down to the struct flock argument that is passed in. The main difference is that these locks behave more like BSD locks with respect to inheritance and close. The new cmd values are (in a nutshell):
F_SETLKP: Acquire or release a file-private lock. F_SETLKPW: Acquire or release a file-private lock. If the lock isn't available then wait for it. F_GETLKP: Test whether a file-private lock would be able to be acquired.
Whereas classic POSIX locks are “owned” by the process, these are owned by the open file. Thus, these locks will conflict with “classic” POSIX locks, even if they are taken on the same file descriptor by the same process. If however, a fd is inherited by a child process via fork then locks acquired via that same fd in both the parent and child will not conflict (since the open file table is the same).
Also, the locks would only be automatically released on the last close of that file descriptor. So, if the parent takes a lock on a fd and forks, and the child then closes that fd, then the lock will stay around until the parent closes it. By the same token, if the parent closes the fd first, then the lock will stay around until the child closes it.
With this, someone implementing a fileserver for linux that needs to manage byte range locks on behalf of a client can simply ensure that they just keep an open file for each “lockowner”.
I have an initial implementation of this code for Linux, and have opened a RFE with the Austin group to have them consider adopting this as part of the formal POSIX standard. So far they seem amenable to the idea.
Much of the difficulty in creating a Linux-based active/active NFS server is the fact that we’ll need to upcall so much since there are so many things that are too difficult to handle in the kernel.
The existing (single-node) NFSv4 server in the kernel already has to upcall for:
- mountd/exportfs information
- ID mapping
- Client ID tracking (technically, this isn’t an upcall currently. The kernel manages files in a well-known directory via VFS interfaces. That’s pretty gross though, and we plan to switch that to an upcall soon)
Now, consider all of the information that will need to flow between nodes with a clustered NFS server. That’s at least another upcall of some sort as well, and probably more than one
At some point, we need to ask ourselves: “Why do any of this in the kernel at all?”
Doing this in userspace smooths over a lot of the hassles. krb5 auth and ID mapping for instance become much more trivial to handle, and we diminish the round trips between kernel and userspace for some things.. Also, if we want to do something like plug in Gluster on the backend, we no longer have to deal with multiple round trips through the kernel and back. We can just plug the userspace nfs server into the Gluster userspace libs and let ‘er rip.
Userspace NFS servers do have their own sorts of hassles as well:
- filehandles: traditionally the main reason for an in-kernel NFS server was to eliminate the need to track filehandle-to-path information. When a server gets a NFS filehandle it has to figure out which inode that refers to. Most userspace servers have relied on tracking that info in a separate database, but that approach has many problems if something else is manipulating the filesystem too. With recent Linux kernels, we have open-by-filehandle support, which obviates much of the need for this. There are apparently still some issues with this support, but I think we can probably fix those as they are identified.
- ownership/permissions: Most current userspace NFS servers always just run as root and try to fix up and enforce permissions after the fact. This is less than ideal since it’s racy and means that you have to enforce permissions in userspace. The main reason for this is that the POSIX setuid() family of syscalls work on a per-process basis. If you have a multithreaded process, then changing uids will change it on every thread. This makes it really difficult to handle the situation where you want to do something like open/create a file as a particular user. An open() is a synchronous syscall so ideally you’d like to spawn a thread to handle that call so we can process other calls arriving in parallel. Unfortunately, you may need to change uids to handle the next call coming in. So, there is also a need to be able to set per-thread credentials. glibc has no support for this currently. The Samba server guys are calling the setfsuid() syscall directly to handle this, but that’s less than ideal. It would be nice to have some
sort of officially blessed interface for this.
- Asynchronous IO: buffered AIO is non-existent under Linux. There have been many attempts to make that work, but none have come to fruition. The current fashion is to spawn threads that do standard synchronous I/O calls. It’s kludgey but as long as you have per-thread credentials, that approach works properly.
- recvfile(): ideally, we’d like to be able to move pages of data straight from a socket to a file. We have the reverse currently — sendfile() works and can inline pages from the pagecache directly into a skbuf fraglist. This allows a userspace server to implement zero-copy read calls. It would be nice to be able to do zero-copy writes as well by yanking pages out of a receive buffer and dropping them directly into the pagecache. Some OS’ have recvfile() for that purpose. Linux has splice() which is supposed to be a way to implement recvfile(). Unfortunately, splice() is broken in this regard — it always copies the data to the pagecache instead of replacing/moving pages into it. This would offer up a nice performance boost if it could be fixed.
There are other problems too of course. The TI-RPC userspace libraries are notoriously buggy when you use threads. The NFS Ganesha folks have done a bunch of work on them, but they haven’t pushed their changes upstream yet. Hopefully that will happen soon. There was talk about that at the Bake-a-Thon recently, so hopefully it’ll come to fruition soon.
So far, most of my posts about this proposed design have been about teaching the VFS and underlying filesystems how to handle the state properly for NFSv4 semantics. We could however, consider another design. We could have a clustered daemon that runs and tracks the state independently of the VFS. In other words, we could go with a more Samba/CTDB type design here.
knfsd would upcall to talk to a daemon and it would be the final arbiter. Such a daemon could be integrated with Samba/CTDB such that knfsd and samba are aware of each others state. In principle, that would allow you to export the same filesystem via both knfsd and Samba/CTDB and get reasonable semantics for both. Oplocks and delegations would get broken appropriately, share mode reservations should work, and (possibly) we could make locking work in a more consistent fashion than we do today.
Of course, there’s a catch — nothing else would be aware of that state. It would all be tied up inside the CTDB database (or whatever clustered DB infrastructure we’d end up using). The result would be a NFS+SMB “appliance”. I think that’s a less desirable design than one that’s more general purpose, but it might be easier to achieve and we might be able to hammer that out quicker since we’d be able to avoid a lot of the VFS-layer work.
In the near term, we don’t really need to make this decision. Either way, we’ll still need to be able to swap in the correct set of operations to handle it, so the focus for now can be on simply abstracting out the parts of the server code that we’ll need to swap out to handle this later. It should even be possible to do this sort of design as an interim step, and then add what the VFS would need for a more general solution later.
Regardless of what method we ultimately choose, a more immediate engineering task is to abstract out the operations that we ultimately want to delegate to the lower filesystems. Since we want to eventually allow filesystems to override the default grace-period handling, we’ll need to do provide some way to do that. Most likely, we’ll just add some new export_ops to allow for this (or maybe a new struct that hangs off the export ops).
Here’s a first stab at what I envision for this, mostly based on looking at where nfsd currently checks the grace period:
- nfsd_open operation: this will be the most complex of the of the set. It’ll need to do several things semi-atomically (at least wrt to other nfsd threads)
- check the grace period and determine whether the file is allowed to be opened
- (optionally) check share reservations that are being requested and that are already set, and determine whether the file is allowed to be opened
- open the file (assuming the above checks passed)
- set share reservation (if any, we may eventually combine that with the actual open if we ever get VFS support for share mode reservations)
- get a lease/delegation (if needed)
- …and hand the result back to nfsd
- a nfsd_lock operation (for F_SETLK/F_GETLK):
- check the grace period and see if we’re allowed to set a new lock or reclaim an old one
- acquire or release the lock
- an overridable grace_disallows_io to handle mandatory locks. This will prevent stateid-based ops from performing I/O during the grace period if necessary.
- a nfsd_remove and nfsd_rename op: to allow the blocking of metadata ops during the grace period
That’s a start anyway. The actual API itself will need to be determined for this. In the near term, I plan to do some reshuffling of the code to make it easier to break this stuff out later. Once those pieces have a more sharply defined box around them it should just be a matter of turning them into a set of operations that filesystems can override in some fashion.
In my last post, I talked about POSIX lock recovery in this configuration once the cluster is up and running. The assumption there is that the lock manager (DLM in the case of GFS2) has a record of the lock that we want the client to be able to reclaim.
We do however, also need to consider the case where the cluster is being cold-started. At that point, no node in the cluster will have any active locks. How do we know whether a client is allowed to reclaim locks at that point? To illustrate the problem, consider the following (pathological) case:
A client has mounted a filesystem from two different cluster nodes, for instance:
node1:/export mounted on /mnt1 node2:/export mounted on /mnt2
The client takes locks under both mountpoints, and then loses contact with node2. Eventually the client’s lease expires on node2 and that node releases all of the locks that the client held on it. The entire cluster is then rebooted, and when it comes back up, the client mounts both filesystems from node1 and attempts to reclaim all the locks under /mnt1 and /mnt2.
How does node1 know which locks are allowed to be reclaimed? Clearly, everything under /mnt1 should be allowed and nothing under /mnt2, but the cluster doesn’t currently store reboot recovery info at this level of detail. Right now, it only stores information about clients themselves, but nothing about the resources they hold.
After spending some time working through different scenarios, I’ve come up with the following “rule”:
The flexibility of client configurations is directly proportional to the granularity of lock information tracked on stable storage.
In other words: the more information that we keep about active locks on stable storage, the better able we are to cope with different client configurations. In real terms, that means that we have some choices, and implications for those choices:
- We could avoid tracking any information at all on stable storage. We’d just return NFS4ERR_NO_GRACE on any lock reclaim request that didn’t match an existing lock. On cold-boot of the cluster, there would be no lock reclaim. This is very easy to implement, but clearly non-optimal. I don’t think we’ll find this acceptable.
- We could just track information about the clients themselves on stable storage. In particular, we’d just track the client name “strings” that are sent in the SETCLIENTID or EXCHANGE_ID call. This is basically what knfsd does today and what my proposed nfsdcld daemon does today. The problem here is that it can’t handle the above situation I outlined. In order for this to work, we’d have to insist that clients can only mount from one node in the cluster at a time.
- We could track client name strings along with the destination IP address for each call. If a SETCLIENTID call comes in then we’d track a tuple of information that includes the client name string and the server address where the call came in. With this, we could reasonably support a “floating address” cluster where clients can be moved between servers by moving the server’s address to another cluster node. This might involve a hack of some sort to fix the server’s ID string to be destination address specific for v4.1. The downside here is that this would prevent clients from being able to reclaim locks against a different address from which they were acquired. That means that NFSv4 replication and migration using fs_locations info would not work and that reduces flexibility. It might also be difficult or impossible to integrate this configuration with pNFS serving if that ever becomes desirable.
- We could track client name strings along with the fsid of the filesystem where the resources are held. This would involve changing when we upcall/downcall to store this information. It would have to happen whenever we acquire our first resource on a fs and when the last resource is released from a fs. With this, we could reasonably allow for floating addresses and v4 fs_locations based migration/replication provided that the clients don’t mount the same filesystem off of more than one node at a time. In other words, the scenario described earlier would be unsupportable. Worse, nothing would prevent someone from doing that (other than documentation), so there is the potential for people to shoot themselves in the foot. They may not even realize it since it would basically work until they hit a network partition + cold boot scenario.
- We could track information about every lock on stable storage. This has the most performance impact, but it allows for complete flexibility in client configurations. Clients could migrate freely between hosts using any mechanism. There would also be no NFS4ERR_GRACE returns to clients. Clients would be able to make new lock requests as soon as the cluster starts. For this, we’d probably do something a bit different than the previous mechanisms. We’d probably want to integrate this tracking with the userspace lock manager (DLM or equivalent). What I’d envision is that DLM would keep info on stable storage about what locks it holds. When the cluster is started, we’d use that info to “preload” the reclaimable locks into DLM. Then as the clients come up, they’d take ownership of them. Any that aren’t reclaimed during an initial grace period would be released. Then we’d just enable NFS serving and start letting clients reclaim those locks. This simplifies the knfsd piece, but adds quite a bit of complexity to DLM.
Those are the possibilities I see for doing this. While it sounds a bit tougher to do, I’m starting to warm up to #5. It seems like the best option for future flexibility. It will make locking performance even slower than it is today however, so we’ll need to weigh carefully whether that’s reasonable.