Much of the difficulty in creating a Linux-based active/active NFS server is the fact that we’ll need to upcall so much since there are so many things that are too difficult to handle in the kernel.
The existing (single-node) NFSv4 server in the kernel already has to upcall for:
- mountd/exportfs information
- ID mapping
- Client ID tracking (technically, this isn’t an upcall currently. The kernel manages files in a well-known directory via VFS interfaces. That’s pretty gross though, and we plan to switch that to an upcall soon)
Now, consider all of the information that will need to flow between nodes with a clustered NFS server. That’s at least another upcall of some sort as well, and probably more than one
At some point, we need to ask ourselves: “Why do any of this in the kernel at all?”
Doing this in userspace smooths over a lot of the hassles. krb5 auth and ID mapping for instance become much more trivial to handle, and we diminish the round trips between kernel and userspace for some things.. Also, if we want to do something like plug in Gluster on the backend, we no longer have to deal with multiple round trips through the kernel and back. We can just plug the userspace nfs server into the Gluster userspace libs and let ‘er rip.
Userspace NFS servers do have their own sorts of hassles as well:
- filehandles: traditionally the main reason for an in-kernel NFS server was to eliminate the need to track filehandle-to-path information. When a server gets a NFS filehandle it has to figure out which inode that refers to. Most userspace servers have relied on tracking that info in a separate database, but that approach has many problems if something else is manipulating the filesystem too. With recent Linux kernels, we have open-by-filehandle support, which obviates much of the need for this. There are apparently still some issues with this support, but I think we can probably fix those as they are identified.
- ownership/permissions: Most current userspace NFS servers always just run as root and try to fix up and enforce permissions after the fact. This is less than ideal since it’s racy and means that you have to enforce permissions in userspace. The main reason for this is that the POSIX setuid() family of syscalls work on a per-process basis. If you have a multithreaded process, then changing uids will change it on every thread. This makes it really difficult to handle the situation where you want to do something like open/create a file as a particular user. An open() is a synchronous syscall so ideally you’d like to spawn a thread to handle that call so we can process other calls arriving in parallel. Unfortunately, you may need to change uids to handle the next call coming in. So, there is also a need to be able to set per-thread credentials. glibc has no support for this currently. The Samba server guys are calling the setfsuid() syscall directly to handle this, but that’s less than ideal. It would be nice to have some
sort of officially blessed interface for this.
- Asynchronous IO: buffered AIO is non-existent under Linux. There have been many attempts to make that work, but none have come to fruition. The current fashion is to spawn threads that do standard synchronous I/O calls. It’s kludgey but as long as you have per-thread credentials, that approach works properly.
- recvfile(): ideally, we’d like to be able to move pages of data straight from a socket to a file. We have the reverse currently — sendfile() works and can inline pages from the pagecache directly into a skbuf fraglist. This allows a userspace server to implement zero-copy read calls. It would be nice to be able to do zero-copy writes as well by yanking pages out of a receive buffer and dropping them directly into the pagecache. Some OS’ have recvfile() for that purpose. Linux has splice() which is supposed to be a way to implement recvfile(). Unfortunately, splice() is broken in this regard — it always copies the data to the pagecache instead of replacing/moving pages into it. This would offer up a nice performance boost if it could be fixed.
There are other problems too of course. The TI-RPC userspace libraries are notoriously buggy when you use threads. The NFS Ganesha folks have done a bunch of work on them, but they haven’t pushed their changes upstream yet. Hopefully that will happen soon. There was talk about that at the Bake-a-Thon recently, so hopefully it’ll come to fruition soon.