Distributed Caching

Because the main purpose of Resin 4.0 was support for dynamic servers, we needed to upgrade our distributed session management to handle the case where servers can appear and disappear frequently. Resin 3.1 session replication relies on a static set of servers to choose backup and triplicate servers. Dynamic servers change that model because a 3.1 session backup might be shut down indefinitely. So we needed a new architecture.
At the same time as we redesigned sessions, I wanted to generalize the distributed store to support standard caching and storage using the javax.cache API, while retaining the scalability and reliability of our original design.
- A hub of 3 fully-redundant servers for reliability (the triad)
- Other servers dynamically appear and disappear for deployment flexibility
- Updates using lightweight messaging using BAM/HMTP
- Cache entry ownership and leasing for performance
- Support storage (infinite expire), caching (timed expire), and session (timed idle invalidation)
- Support serialized objects (using Hessian), and binary data
The cache architecture is heavily influenced by the messaging and extra complications of distribution.
Splitting Data from Map.Entry
The most fundamental design decision was splitting the data (payload) of a cache entry from the Map.Entry tuple. The split has a few benefits:
- small, fixed size Map.Entry updates for get requests, put updates, and startup synchronization
- synchronization and transaction uses small messages, so rollbacks and conflicts are cheap to resolve
- data is constant, named by its hash like the .git repository, so never needs rollbacks or synchronization
- startup and crash recovery can query with small, efficient packets and only needs to transfer missing data
- put/remove are unified with a single message (simplifying testing considerably)
The Map.Entry is a small, fixed-length item with three key pieces of information: the key hash (sha-256), the value hash (sha-256), and an update version:
<key, value, version>
The size of the tuple is fixed because the key and value are fixed-length hashes, and the version is a 64-bit integer: 32 bytes for the key, 32 bytes for the value, and 8 bytes for the version.
The 32 byte hash for the key and value is used to lookup the actual payload. If a server doesn’t have the data, it can query a triad server for the missing information. On startup, a server can use any data is already has locally, and only needs to exchange messages for the Map.Entry to validate changes. The triad serves as the backup for all the servers in a cluster pod containing up to 64 servers. Clusters with more than 64 servers split into multiple pods. Each cache Map.Entry and data is assigned a triad server as its owner, based on the 32-byte hash. The data hash determines the owning server for the data, and the key hash determines the owner server for the Map.Entry. So it’s entirely possible for a cache.put() to have different owners for the Map.Entry and the actual data. Assigning ownership helps the triad manage load, because each triad server only owns a third of the keys and a third of the data. So the traffic and cpu requirements are split into three. The ownership also helps with synchronization because the owning server for a key can manage the cache versions in a single JVM, avoiding the complications of a fully-distributed synchronization. Each dynamic server stores a local copy of a cache entry, synchronizing with the triad as necessary. Because the dynamic server only has a cached copy, it can be safely taken down or started at any time. For efficiency, a server can request a lease on a cache entry, minimizing traffic with the triad as long as the lease is active. A HTTP session will use a 5 minute lease together with sticky sessions to keep all traffic local to a server for the duration of the session, making distributed sessions almost as fast as a locally managed session. To make the cache discussion concrete, it’s probably best to show the actual messages (somewhat in progress.) Hopefully by tomorrow I’ll have some time to add another post with those details.
<348f...c0a1, 8efa...365f, 45>
Replication and Scaling the Triad
Dynamic Servers
Onward to messages

February 23rd, 2009 at 3:48 pm
Would it be possible to use Terracotta for distributing the cache? have you considered an official Terracotta support (TIM)?
Thanks,
Ophir
February 23rd, 2009 at 8:23 pm
That doesn’t make sense in this case, because Resin users can always use Terracotta. Having Terracotta distribute the cache really would mean using Terracotta, so we wouldn’t be adding any value with that kind of integration.
This cache is primarily intended to replace something like ehcache in situations where adding and removing dynamic servers is important.