Active/Active NFS over CephFS

Since returning to Red Hat, I’ve done a fair bit of work in enabling nfs-ganesha to serve out cephfs. The operating idea here has been to push as much infrastructure into CephFS as we can to enable us to properly implement NFS semantics.

nfs-ganesha then is just another cephfs client. It hands out NFS state to the clients and holds the corresponding CephFS state on their behalf.

With a single NFS server, this works reasonably well. We’ll even have delegation support (coming soon!). The question now is — can we stand up an arbitrary set of servers that only interact with one another via ceph?

The advent of technologies like kubernetes makes it attractive to be able to stand up cookie-cutter NFS servers dynamically as the load requires. How loosely can we couple them without breaking NFS recovery semantics?

Note that in the discussion below, I’m deliberately ignoring other applications besides NFS that are interacting with the cluster during the grace period. That’s potentially a real problem, but for now I’m ignoring it for the sake of simplicity.

The basic problem:

The main issue here is that NFS server state is not persistently stored in a fine-grained way. Most of the state (opens, locks, delegations, layouts) are only stored in memory. That’s not usually a problem on a standalone NFS server.

When a NFS server crashes, it comes up with zero state held by clients. We inform them of the crash, and they then start reclaiming their state within a certain window (usually 2 lease periods — around 90-120s depending on the implementation). During the grace period, no new state can be established — clients can do reclaims only. Once the grace period has ended, the reverse is true.

There are some corner cases here involving network partitions and multiple reboots that require us to maintain a list of clients that hold at least one open on stable storage. That’s also fairly simple to handle with a single server.

Scaling out, the simple way:

There are ways to scale out without needing to do a lot of complex coordination. If you know that an inode can only ever be served out by a single NFS server, then you can really treat it as wholly independent of other servers. There can be no conflicting state acquired via other NFS servers that way.

If you imagine a set of servers that only serve out a single export each via an individual IP address, with no hardlinks that cross export boundaries, then that should work just fine today without any external coordination.

With something like kubernetes, where you could potentially restart a cookie-cutter NFS server very quickly this might be a workable solution. You effectively allow yourself to be able to scale out as the number of exports grows.

Serving the same content in parallel:

Those configurations are a potentially limiting though. We really want to be able to scale out according to the number of clients we have, not the number of exports.

That means that we need to allow multiple servers to export the same data in parallel.

Suppose we have a setup that looks like this:

+----------------+      +----------------+
|                |      |                |
|  NFS Client 1  +------+  NFS Server 1  +----------+
|                |      |                |          |
+----------------+      +----------------+    +-----+----+
                                              |          |
                                              |  CephFS  |
                                              |          |
+----------------+      +----------------+    +-----+----+
|                |      |                |          |
|  NFS Client 2  +------+  NFS Server 2  +----------+
|                |      |                |
+----------------+      +----------------+

In this configuration, we don’t float IP addresses between NFS Server 1 and 2 like we would with a traditional clustered configuration. Instead, we assign an address to each NFS server, and reconstitute the server from scratch if it crashes.

All of the stable storage info for the NFS server is stored in RADOS, though each NFS server has its own database for recovery.

Now, suppose NFS Server 1 crashes. At that point, NFS Client 1 knows what NFS state it holds. The CephFS server knows what capabilities and state NFS Server 1 held before it crashed.

Kubernetes or whatever detects that its down and creates a new instance of NFS Server 1. NFS Server 1 comes up knowing only the list of clients that are allowed to reclaim their state. What we need to happen at this point is for CephFS to release the state that was held by the previous NFS Server 1 incarnation, and allow NFS Client 1 to reclaim it.

Between the window of time when the CephFS caps held by the earlier NFS Server 1 are trashed, and NFS Client 1 reclaims its state, NFS Client 2 talking to NFS Server 2 could race in and set some state that conflicts with what NFS Client 1 holds.

We must prevent this from occurring. The simplest way to do that is to force all of the other NFS servers into the grace period prior to cephfs revoking the caps that the old client held.

In practical terms, this means that we need to ensure that all of the remaining NFS servers are enforcing the grace period prior to restarting a crashed NFS server in the cluster. They are all only allowed to exit the grace period once all of the servers have stopped allowing reclaim.

That also neatly handles the case where we have a clean start from zero state in the cluster. All servers will go into their grace period until clients are finished reclaiming (or all of the individual grace periods time out).

Handling excessively tardy server restart:

It’s also possible for a NFS server to go down while clients still hold state on it, but then take so long to come back that the MDS decides to give up on it and revokes its caps anyway.

Clients of these servers must not be allowed to reclaim any state, as we cannot guarantee consistency in that situation. The simplest way to do that is to simply trash its persistent client database if the caps that it held previously were revoked outside of the grace period. That will prevent any reclaim. Those surviving clients will likely need to be restarted, but that’s really the best we can do in that situation.

Decommissioning a server instance with active clients:

If we know that we want to decommission one of the NFS servers (scaling down the cluster at times of low-load, for instance), we should be able to use migration to move them to other servers (support for that would need to be added to ganesha).

Alternately we could merge its IP addresses and client databases into another existing server. There are several options here. The main point is that we do need to take proactive steps to “drain” a server before decommissioning it.

Implementation:

So, that more or less lays out the rules for how we need various NFS servers to behave. The question now is what will ensure that we follow them.

Traditional clustering does this with stuff like corosync and pacemaker, but having to use that when we already have clustering infrastructure in ceph itself seems wrong.

The current thinking (credit to Patrick Donnelly) is to add a ceph monitor service and map for the NFS gateways. That map would be consumed by a new daemon that would manage the ganesha daemon, possibly using dbus to handle the grace period of a running instance.

Alternately, we implement this as a new shared library in ceph that ganesha would link in. This part is still being designed/determined.

Overall, I’m pretty excited by the potential of the overall design here though. I think it should offer us the ability to scale nearly linearly, at least with some common usage patterns.

Future directions:

Beyond this design, we may want to look at how to more effectively move clients around the cluster based on their access patterns. For instance, we may want to explore assigning a NFS server to each CephFS MDS in a multimds cluster, and leverage NFS migration to have clients access objects on the NFS server closest to the authoritative MDS for an inode.

That sort of design might be very interesting coupled with pNFS. A flexfiles layout is something of a natural fit over CephFS.

Advertisements

2 thoughts on “Active/Active NFS over CephFS

  1. Bruce Fields

    “It’s also possible for a NFS server to go down while clients still hold state on it, but then take so long to come back that the MDS decides to give up on it and revokes its caps anyway.

    Clients of these servers must not be allowed to reclaim any state, as we cannot guarantee consistency in that situation.”

    I’m not sure I follow. Couldn’t you forcibly reset the slow server? Or is your assumption that it’ll have the same problem coming back up from that forced reset?

    There are a lot of potential corner cases when you consider possibly overlapping crashes, lost contact with clients, and other events. I’d like to see a really careful description of how the stable storage information and the grace periods are managed and an argument that it will always guarantee locking consistency. I think that’s doable.

    “Alternately we could merge its IP addresses and client databases into another existing server.”

    I’m more skeptical that this can be made correct in every case. I could be wrong.

    Can we instead run multiple ganesha instances on the same server in this case? (Either in completely separate containers or just listening on different IP addresses?)

    Reply
  2. Jeff Layton Post author

    > I’m not sure I follow. Couldn’t you forcibly reset the slow server? Or is your assumption that
    > it’ll have the same problem coming back up from that forced reset?

    This could happen for all sorts of reasons, not just server being under too much load. Software bugs or administrative mishap are also possible. Maybe part of the datacenter goes offline and takes the NFS server with it but not the Ceph MDS?

    The point though is that Ceph is somewhat like NFSv4 in that your state can expire if you don’t constantly renew it. As long as you do constantly renew it, then you know that no other client can acquire state that would conflict with the state that you hold. CephFS prevents that as part of its normal duties.

    Suppose however that a NFS server goes down for a long time, while one of its NFS clients is holding state on that server (e.g. a file lock). Eventually, the Ceph MDS just sees a Ceph client that has gone missing. It won’t hold that state forever however — eventually it’ll blacklist the client (aka fence it), and then forcibly revoke the state that it held.

    Now, suppose a little while later, that blacklist entry is cleared (aka host is unfenced) and then we start the NFS server back up. If we’re not careful then a lingering NFS client from the previous incarnation of the server could try to reclaim that file lock. Maybe it gets it and then it _thinks_ that it had it the whole time. Now we potentially have silent data corruption.

    That’s the situation that we need to prevent. Once any state is forcibly revoked by the MDS in this situation, we need to prevent the NFS server from allowing any reclaim after it comes back up. We cannot guarantee consistency there.

    That mechanism should be very simple though — a flag stored on stable storage someplace that says whether you had your state revoked…or maybe we just clean out the stable-storage client db for that server instance in that case? There are several ways to do it.

    The rules I’m laying out here should cover any situation with network partitions and reboots, as I’m attempting to phrase all of this in terms of whether or not conflicting state could have been set while the NFS server was down. That is really the fundamental principle behind all of this — we have to know whether it’s safe to reclaim or not, and that is the only way to know. That said, if you see problematic situations here, then please do let me know.

    > I’m more skeptical that this can be made correct in every case. I could be wrong.
    > Can we instead run multiple ganesha instances on the same server in this case? (Either in
    > completely separate containers or just listening on different IP addresses?)”

    Agreed that having to split and merge recovery databases is pretty nasty. I think it’s possible to make it work, but it’s definitely non-trivial.

    I’m using the term “server” here loosely too. A server in this case may very well be a container with just nfs-ganesha running inside it, with a single IP address and export. It’s quite possible we’ll have several running on a physical host at any given time. Being able to spin up a new instance of this on demand for load balancing or just isolation from other “tenants” in the cloud is one of the goals.

    Still, I think we will eventually have to allow for you to split and merge server instances, in the event that you want to dynamically reduce the number of running daemons after scaling out to handle a large load. I’m not looking to make that work in the initial phase of this, though we do want to be careful to design things in such a way that it’s possible to add that ability later.

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s