CephFS and the NFSv4 Change Attribute

Since I’ve rejoined Red Hat, I’ve been doing some work within cephfs to better understand the code. One of the projects I’ve been working on is to add a statx-like interface to libcephfs (the userland ceph client code). As part of that, I’m adding in support for file create times (also known as the “btime”), and a NFSv4 change attribute with the aim of adding support for it to nfs-ganesha.

The NFSv4 Change Attribute

Good NFS performance relies on being able to cache file data aggressively. Clients use inode metadata to tell whether they should invalidate their caches. Basically they look at certain metadata in the inode (usually via a GETATTR operation) and then invalidate the cache if it looks like it has changed.

When doing something like a write() syscall however, POSIX says only that you must restamp the modification time (mtime) of the inode with the current time. Since that action changes the metadata, the inode change time (ctime) must also be stamped with the current time.

Older versions of NFS relied heavily on watching for timestamp changes in order to tell if the cache was valid. This was always problematic though. Timestamp granularity on Linux is generally 1 millisecond, and a lot can happen within that time, which can cause the clients to end up with stale data in the file when they don’t see an apparent change to the times.

So, when NFSv4 was drafted, it mandated a new metadata attribute — the change attribute. The general idea is that it is a unsigned 64-bit number, that must increase in value whenever there has been either a data or metadata change to the file. When a client queries it, and the file has changed in some fashion then it must be larger than when the client last queried for it.

Servers have a bit of latitude on how they implement the change attribute, but the end result is always semi-opaque to the client. It can only look at the change attribute and see whether it has increased since the last time it was checked. If so, then it can infer that something has changed in the file (either data or metadata).

Note that clients are not allowed to inspect the value to see how much it has changed. The only thing they can do is look and see if it’s larger than the previous value. At that point, any data or metadata cached on that inode is suspect.

CephFS capabilities

First a bit of background on CephFS. CephFS is quite different from other network filesystems in that parts of the data and metadata can be delegated to different clients in a very granular way. These are called capabilities (aka caps) in CephFS parlance. For instance, the owner, group and mode are all under the AUTH capability, whereas the mtime and size are under the FILE capability.

These generally come in two flavors: shared and exclusive. Shared caps allow clients to cache the existing values, and exclusive caps allow them to be changed by the client and buffer that change until the capability is revoked.

Furthermore, the CephFS metadata server can grant different exclusive caps to different clients on the same inode, allowing them to cache changes to the metadata and data. The MDS will recall these capabilities in order to ensure that other clients see the changes that those clients may be caching, prior to satisfying a request that would provide those attributes. At that point, the clients will return their capabilities and update the server with the new metadata.

Thus it’s possible that one client can have the AUTH exclusive cap (allowing it to cache a change to the mode, for instance), and then another client may hold the FILE exclusive cap which allows it to buffer up writes to the file and the associated changes to the mtime.

By way of shorthand, we often refer to the capabilities with a capital letter, and then a lowercase letter denoting whether it’s shared or exclusive.

Adding a NFSv4 Change Attribute to CephFS

The implementation adds the change attribute to the FILE capability since writes are more common than explicit metadata changes, and it often changes in conjunction with the mtime.

But! Suppose we have one client that holds AUTH exclusive caps (Ax) and another with the necessary caps to  buffer writes to the file and cache mtime changes (we’ll denote this Fx for now, but the FILE cap is actually a bit more complex).

Both clients start out seeing the same change attribute (call it 1), and the same mode on the file. The Ax client does a change of the mode, and bumps his change attribute to 2. The Fx client does a write, bumps his change attribute to 2, and then does a GETATTR to fetch the mode and change attribute. What happens now?

The MDS then recalls the Ax cap from the first client, sees that the change attribute is now at 2 and changes the value in the MDS cache. It then sends updated inode attributes to the other client (the one holding Fx caps). That client sees the change attr at 2.

But wait — he already has a value of 2. Is that a violation of the semantics?

The answer (I believe) is no. Recall that the client is not allowed to infer anything from the actual value of the change attribute. It can only check to see if it’s newer than what it had before. So, while the Fx client did have the same change attribute cached after its write, its cached inode attributes (including the mode) became invalid on the transition from 1 to 2.

IOW, it’s perfectly legitimate to “bundle” several changes under one change attribute value change, as long as you ensure that no one has queried for the change attribute in the meantime.

Note that there are even more complex situations that can involve NFSv4 delegations, but the key points here are that all cached metadata becomes invalid once the change attribute changes, and that the MDS will recall (and sync up) changes to the change attribute when there are exclusive caps outstanding to another client that cover the metadata being queried.

Current Progress

The patchset has been through several iterations and there is currently a pull request up to add the first pile of patches. nfs-ganesha will require adding several other ceph_ll_* interfaces, but the heavy lifting is already done here. We’re hoping this will make the next CephFS release (Kraken).


10 thoughts on “CephFS and the NFSv4 Change Attribute

  1. Frank Filz

    How do you guarantee though that the client involved in the FILE change doesn’t querry JUST the attributes covered by it’s change, and thus not necessarily trigger the mds realizing the other attributes need to be updated also?

    Or is it covered because the ceph client with Ax actually needs to get Fx to update the change attr?

    As long as we can guarantee that when the NFS clients see change 2, they have also got all attribute changes that ceph is associating with change 2, then I guess we’re ok. If clients can see change 2 based JUST on what one of the two ceph clients (the one with Ax or the one with Fx) have cached but not hardened then I see a problem.

    I also note that NFS clients tend to ask for attributes after any operation that affects the change attribute.

  2. Jeff Layton Post author

    Good questions.

    Sure, Fx (and Fs) clients can always fetch their change_attr out of the cache without consulting the MDS. But…once the change_attr is bumped once from where it when the client last queried for it, the NFS client can no longer trust any attributes in its cache. This is true even when the change is due to that client’s own activity.

    So, once the change_attr is incremented, the client has to refetch the mode (for instance) from the MDS anyway, at which point Ax caps get recalled and the change_attr is also synched up.

    The Ax client doesn’t actually need the Fx to update the change_attr. But, once anyone without any FILE caps tries to query for it, the MDS must recall any outstanding Fx caps.

    Now all of that said, I think I’ve worked through the implications of this, and that the design is sound, but I’m far from sure. If you can find any scenarios where this breaks down, I’d be very interested to hear them!

  3. Frank Filz

    Yea, ok, so once a client actually acts on seeing change 2, it forces all the ceph clients to sync up BEFORE ceph responds to the NFS client’s GETATTR prompted by change 2, which means a whole set of changes can safely be part of change 2.

    Make sure we document what the guarantees are in case there is anything else that builds on this (clearly ceph itself does not depend on the change attribute at all).

  4. Greg

    I definitely buy your argument that these semantics support NFS. If every time the change attribute increments, the (ceph-external) client does a full getattr, then CephFS will do a full sync and see all changes.

    I’m just not convinced that’s the semantic we want to support. It seems like once statx exists, NFS et al will start using it and might incur fewer CephFS-internal syncs. Non-NFS clients might want access to the change attr but be aware of how caps work (eg, the samba vfs).

    It doesn’t seem like it’s *hard* to detect that multiple clients have bumped the attribute and just increment it for each of them, right?
    I’m also a little concerned about having this inode field that isn’t covered in the traditional way by our caps (in that it can be bumped by a client holding any exclusive cap, not just one of them). I think it’s the right way to go, but it makes me very cautious about the transparent nature of simultaneous increments.

  5. Jeff Layton Post author

    None of this dependent on any sort of NFS client or server behavior. AFAICT, this is entirely consistent no matter how clients access the attributes. I don’t believe it’s possible that userland can end up not seeing a newer change attr once there have been changes to the file, but if you can see a way to break it, please do let me know.

    FWIW, I’m now contemplating whether we can do something similar in the kernel for the i_version counter. Its only consumer (currently) is knfsd, and we have some reports that enabling the i_version counter can really harm write-heavy workloads as every write entails updating the inode, even when the times haven’t changed recognizably. Allowing a single i_version bump to represent several changes to the file could really help there, particularly when the fs is not even exported.

  6. Jeff Layton Post author

    Actually, that last comment is not 100% correct. The change attr is not reliable if you only check it by itself and find that it has not changed. It’s possible in that situation that we may not see a change that is still being cached on the other clients.

    That should be fixable though — what we’d probably want to do is simply ensure that when any client queries the change attr that we revoke all exclusive caps, so that all of the change_attrs are synced up. Given that we generally are going to do that anyway from ganesha due to the fact that it’s going to query for all attrs, then I don’t think we’d lose anything there. I’ll take a look and see how we might clean that up.

    Thanks for the discussion so far — this has been helpful.

  7. Pingback: Links 2/9/2016: GNOME 3.22 Beta 2, LLVM 3.9 | Techrights

  8. Bruce Fields

    You’re correct that the change attribute is usable only as a sign that something has changed, not as some kind of count to the number of changes. In fact, it doesn’t even have to increase, it just has to change.

    Also, it’s not allowed by the spec, but knfsd just uses ctime for change attribute on filesystems (most of them!) that don’t support change attribute. There’s not much else it could do.

    NFSv4.2 provides a new attribute that lets a server tell you if its change attributes are stronger or weaker than as originally specified in 4.0:


    That’s unimplemented. And note there’s been skepticism from filesystem developers about whether it’s possible to defined reasonable semantics in the NFS4_CHANGE_TYPE_IS_VERSION_COUNTER case.

    1. Jeff Layton Post author

      Thanks for that pointer, I had forgotten about change_attr_type. The MONOTONIC_INCR vs VERSION_COUNTER distinction is a little subtle, but I think we can probably use MONOTONIC_INCR for ceph. I gather that with that one, you can only look to see if one is larger than another and so you can’t infer anything else. VERSION_COUNTER would be pretty hard to implement in ceph, I think.

  9. Pingback: The inode->i_version counter in Linux | Jeff Layton

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s