Monthly Archives: September 2016

CephFS and the NFSv4 Change Attribute

Since I’ve rejoined Red Hat, I’ve been doing some work within cephfs to better understand the code. One of the projects I’ve been working on is to add a statx-like interface to libcephfs (the userland ceph client code). As part of that, I’m adding in support for file create times (also known as the “btime”), and a NFSv4 change attribute with the aim of adding support for it to nfs-ganesha.

The NFSv4 Change Attribute

Good NFS performance relies on being able to cache file data aggressively. Clients use inode metadata to tell whether they should invalidate their caches. Basically they look at certain metadata in the inode (usually via a GETATTR operation) and then invalidate the cache if it looks like it has changed.

When doing something like a write() syscall however, POSIX says only that you must restamp the modification time (mtime) of the inode with the current time. Since that action changes the metadata, the inode change time (ctime) must also be stamped with the current time.

Older versions of NFS relied heavily on watching for timestamp changes in order to tell if the cache was valid. This was always problematic though. Timestamp granularity on Linux is generally 1 millisecond, and a lot can happen within that time, which can cause the clients to end up with stale data in the file when they don’t see an apparent change to the times.

So, when NFSv4 was drafted, it mandated a new metadata attribute — the change attribute. The general idea is that it is a unsigned 64-bit number, that must increase in value whenever there has been either a data or metadata change to the file. When a client queries it, and the file has changed in some fashion then it must be larger than when the client last queried for it.

Servers have a bit of latitude on how they implement the change attribute, but the end result is always semi-opaque to the client. It can only look at the change attribute and see whether it has increased since the last time it was checked. If so, then it can infer that something has changed in the file (either data or metadata).

Note that clients are not allowed to inspect the value to see how much it has changed. The only thing they can do is look and see if it’s larger than the previous value. At that point, any data or metadata cached on that inode is suspect.

CephFS capabilities

First a bit of background on CephFS. CephFS is quite different from other network filesystems in that parts of the data and metadata can be delegated to different clients in a very granular way. These are called capabilities (aka caps) in CephFS parlance. For instance, the owner, group and mode are all under the AUTH capability, whereas the mtime and size are under the FILE capability.

These generally come in two flavors: shared and exclusive. Shared caps allow clients to cache the existing values, and exclusive caps allow them to be changed by the client and buffer that change until the capability is revoked.

Furthermore, the CephFS metadata server can grant different exclusive caps to different clients on the same inode, allowing them to cache changes to the metadata and data. The MDS will recall these capabilities in order to ensure that other clients see the changes that those clients may be caching, prior to satisfying a request that would provide those attributes. At that point, the clients will return their capabilities and update the server with the new metadata.

Thus it’s possible that one client can have the AUTH exclusive cap (allowing it to cache a change to the mode, for instance), and then another client may hold the FILE exclusive cap which allows it to buffer up writes to the file and the associated changes to the mtime.

By way of shorthand, we often refer to the capabilities with a capital letter, and then a lowercase letter denoting whether it’s shared or exclusive.

Adding a NFSv4 Change Attribute to CephFS

The implementation adds the change attribute to the FILE capability since writes are more common than explicit metadata changes, and it often changes in conjunction with the mtime.

But! Suppose we have one client that holds AUTH exclusive caps (Ax) and another with the necessary caps to¬† buffer writes to the file and cache mtime changes (we’ll denote this Fx for now, but the FILE cap is actually a bit more complex).

Both clients start out seeing the same change attribute (call it 1), and the same mode on the file. The Ax client does a change of the mode, and bumps his change attribute to 2. The Fx client does a write, bumps his change attribute to 2, and then does a GETATTR to fetch the mode and change attribute. What happens now?

The MDS then recalls the Ax cap from the first client, sees that the change attribute is now at 2 and changes the value in the MDS cache. It then sends updated inode attributes to the other client (the one holding Fx caps). That client sees the change attr at 2.

But wait — he already has a value of 2. Is that a violation of the semantics?

The answer (I believe) is no. Recall that the client is not allowed to infer anything from the actual value of the change attribute. It can only check to see if it’s newer than what it had before. So, while the Fx client did have the same change attribute cached after its write, its cached inode attributes (including the mode) became invalid on the transition from 1 to 2.

IOW, it’s perfectly legitimate to “bundle” several changes under one change attribute value change, as long as you ensure that no one has queried for the change attribute in the meantime.

Note that there are even more complex situations that can involve NFSv4 delegations, but the key points here are that all cached metadata becomes invalid once the change attribute changes, and that the MDS will recall (and sync up) changes to the change attribute when there are exclusive caps outstanding to another client that cover the metadata being queried.

Current Progress

The patchset has been through several iterations and there is currently a pull request up to add the first pile of patches. nfs-ganesha will require adding several other ceph_ll_* interfaces, but the heavy lifting is already done here. We’re hoping this will make the next CephFS release (Kraken).