The Problems with POSIX Locking
At last year’s Linux Storage and Filesystem summit, we had a lively discussion about what Linux needs in order to make it easier to implement fileservers on top of it. One of the bullet points brought up by Boaz Harrosh was the brokenness of the POSIX file locking standard. There are really two different problems:
POSIX Locks are tracked on a per-process basis such that lock requests by the same process cannot conflict with one another. Thus, they are useless for synchronization between threads within the same process. This also means that locks are not inheritable across a fork() since they are intimately tied to the process.
The bigger issue is that all POSIX locks held by a process are dropped any time the process closes any file descriptor on the inode that the process holds open. This is the bit that catches most programmers by surprise. It requires that you carefully delay closing any file descriptors held open on a file until you’re sure that all locks on that file can be released. As Jeremy Allison points out in his blog “this is not behavior that you would ever want!” (he also has a good description of how this standard came about — see the section entitled “First Implementation Past the Post”).
It also leads to the “library problem”. Occasionally libraries may need to open a file in order to read or write to it and then close it again. If a library does this while the application has the file open with locks held on it, those locks are then lost. In many cases, this is not obvious to the calling program.
BSD locks (via flock() mechanism) had much more sane semantics with respect to inheritance and behavior on close. Locks are tied to the open file table entry, not the process. If a process opens a file twice and tries to set exclusive locks on both, the second one will be denied. BSD locks are only released automatically when the last reference to the open file on which they were acquired is closed. The only problem is that BSD locks are whole-file locks only. They aren’t byte-range aware.
What we need is a new lock type that is a hybrid of the two. We want BSD behavior for inheritance and close, but the locks need to be byte-range locks. Furthermore, since we have a large legacy codebase of programs that use “classic” POSIX locks, these new locks need to conflict with them so that programs using the new locks will interoperate correctly with them.
A Possible Solution
I have a preliminary patchset to add a lock that behaves like this. I’m calling them “file-private” locks for now, but I’m amenable to better names if anyone has one. The current patchset adds a new set of cmd values to fcntl(2). The interface is very similar to the traditional POSIX lock interface, even down to the struct flock argument that is passed in. The main difference is that these locks behave more like BSD locks with respect to inheritance and close. The new cmd values are (in a nutshell):
F_SETLKP: Acquire or release a file-private lock. F_SETLKPW: Acquire or release a file-private lock. If the lock isn't available then wait for it. F_GETLKP: Test whether a file-private lock would be able to be acquired.
Whereas classic POSIX locks are “owned” by the process, these are owned by the open file. Thus, these locks will conflict with “classic” POSIX locks, even if they are taken on the same file descriptor by the same process. If however, a fd is inherited by a child process via fork then locks acquired via that same fd in both the parent and child will not conflict (since the open file table is the same).
Also, the locks would only be automatically released on the last close of that file descriptor. So, if the parent takes a lock on a fd and forks, and the child then closes that fd, then the lock will stay around until the parent closes it. By the same token, if the parent closes the fd first, then the lock will stay around until the child closes it.
With this, someone implementing a fileserver for linux that needs to manage byte range locks on behalf of a client can simply ensure that they just keep an open file for each “lockowner”.
I have an initial implementation of this code for Linux, and have opened a RFE with the Austin group to have them consider adopting this as part of the formal POSIX standard. So far they seem amenable to the idea.