Over the last few months, I’ve been talking to a lot of people, doing conference sessions about cloud computing and I’ve found out a lot about the different architectures in this space. I’m still very happy with our architecture (perhaps even happier) even after of all these discussions. In past blog posts and in our whitepaper, we’ve explained how the new Resin 4 cloud architecture works. Now I’ll talk a little about why it’s a nice alternative to other approaches.
Forming the cluster
Our view of cloud support for an application server is centered around clustering. Clustering offers a number of advantages, notably:
- Shared caching of pages and objects
- Shared sessions
- Shared application view
- Intelligent load balancing
In other words, if you’re not clustering in the cloud, you’re not taking full advantage of horizontal scaling. Of course, providing all of these features is no easy feat. We need to consider a whole set of issues in order to produce this functionality:
- Network topology/transport
- Storage (location, redundancy, replication speed)
- Adding and removing cluster nodes (dynamic nodes)
The network and storage questions have already been addressed by past generations of clustering technologies, but the addition of dynamic cluster nodes means we need to reconsider everything; these concerns are not orthogonal. In fact, dynamic nodes are a cross-cutting concern, so how we do network and storage will determine how we can and should deal with them. Let’s start with the network.
IP multicast lets network nodes register to listen to a broadcast address and all datagrams that are sent to that same address are then delivered to the listeners. One of the traditional ways to use multicast is to use it to stream media and other one way communication. Clustering using IP multicast is a rather clever use in that all the nodes in the cluster are both senders and listener on a multicast channel. The cool thing about IP multicast is that you can push a lot of logic down to the network layer. In particular, maintaining the list of and communicating with the cluster members is easy. One issue is that now you need to worry about reliabilty at whatever level you’re implementing the cluster. This isn’t a huge problem, in that you can do regular heartbeats or just resend the packets, but then you either incur a high message latency or you could potentially be wasting a lot of bandwidth for a small set of faulty nodes. The real problem with IP multicast is that it’s not really supported on many commercial cloud platforms for security and management issues. JBoss Groups (Infinispan) and Hazelcast use IP multicast to show cool features like node discovery, at least in demos.
To get around the problems of IP multicast, a lot of implementations also allow for TCP or UDP connections. The most obvious topology for these connections is to have each node connect to all of the other nodes. This approach lets you talk to anyone in the group at any time, which makes implementation easy. The drawbacks are that you may have a lot of network overhead with everyone talking with everyone else and that dynamic nodes in the cluster are more difficult to manage than with IP multicast. TCP versus UDP is the usual argument of reliability versus overhead.
Resin uses TCP, but does so in a less heavyweight way. To show why, let’s discuss the node management issue in more depth, because this is one area where Resin really differentiates itself. The other platforms I’ve seen use static configuration files for maintaining the list of nodes in a TCP or UDP based cluster. Resin does have a base configuration file used by all nodes, but it only lists the first 3 static nodes in the cluster called the triad. Nodes that are not in the triad use this configuration file to contact the triad and authenticate themselves when joining the cluster. This technique allows us to have nodes both join and leave the cluster dynamically, but get around the IP multicast issue on networks that don’t allow them. By having the triad, we also have only O(N) TCP connections up at any given time — only the triad nodes have connections up to all the other nodes. Non-triad nodes only have 3 constant connections.
Storage of data in the cluster has two major aspects: how much to replicate and where. Again, these are not orthogonal subjects…
The easiest way (conceptually) to distribute data throughout the cluster is just to replicate it everywhere and you get maximum locality and redundancy. Of course the maintenance is painful, especially when you add and remove nodes at runtime. I don’t think any of the implementations I’ve seen do NxM replication. The next approach is to pick a constant number of copies (C) for each piece of data and distribute to that many nodes. The advantage of this is that when you add a node, you may only have to replicate a small number of data to it on startup. When you remove/lose a node, you don’t lose any data either. But if you lose more than C nodes at a time, you may start to lose data. Everyone does this, but how they distribute the copies is where the different products distinguish themselves.
Let’s say that we’ve decided to replicate each piece of data 3 times. JBoss Cache (at least in older versions) used to use a policy called buddy replication, where each node has a group of “buddies” to which it distributes all of its data. This approach implies that each node is a primary store for a certain set of data. A piece of data would therefore only be lost if the primary and the two buddies replicating it went down. The drawback of this scheme is that if the primary and the two buddies go down, you’ll lose all of the primary’s other data as well. Buddy replication uses nodes as the primary key in determining the replication scheme for a piece of data. The next approach is to use the data as the primary key.
Resin has done a form of this “data-distributed replication” for quite a while, along with Hazelcast and newer versions of JBoss Groups. Essentially you use some piece of metadata to decide which nodes replicate which data. Resin 3.1 and earlier had distributed sessions which picked a primary, secondary, and tertiary store for each session. This avoids the problem with buddy replication above to some extent. Hazelcast and JBoss Groups also do this now, but they add the ability to manage the data across a changing set of nodes. They essentially do a secure hash of the data’s metadata and do a mod N operation (where N is the current number of nodes) to determine the primary store. The second and third stores can be found by going to the next significant digit or some other similar operation. The issue here is that N can change or the constituents of the cluster can change at any time. Each time they do change, you must recompute and reshuffle the data. Now it’s not quite as bad as it might seem at first glance — there are a lot of collisions when doing mod N operations — but there could be potentially large shifts.
Resin 4 takes a different approach, which is to store the data in triplicate on the three triad nodes mentioned earlier. This may seem unsafe at first, because now we’re pinning down all our data in three servers, but it does have some advantages. First as administrators, if you keep these three nodes up and running, you won’t lose any data. With the more distributed data schemes, you may lose data if any three nodes go down. You won’t lose it all of course, but now you have to pay attention to every node in the cluster instead of just three. The other advantage is that there’s no data shuffling when nodes are added or removed from the cluster.
All of the approaches and techniques I mentioned have merit and appropriate applications. Which one you choose should therefore be based on which is appropriate for your situation in terms of management, performance, and the reliability assumptions of your environment. Resin 4 offers a unique approach which assumes that IP multicast is not always available and that administrators can maintain 3 specific nodes more effectively than any 3 in a cluster. With these assumptions, we benefit in manageability as well as network and storage overhead and performance on cluster reconfiguration.
Have you tried Resin 4 or any other similar technologies? What are your thoughts? Are there any cool approaches that I left out?