Resin Pro Health System now and in the future
Resin Pro Health System now and in the future
Resin Pro made significant improvements in its already capable Health System. Improvements include reporting, and post mortem triggering. Resin Pro provides a level of reliability, and system transparency that is unparalleled in the Java EE space. You can see what is going on in every server node in the cloud (or just a single server for smaller shops). If a problem occurs with your code, or with library code that you use, Resin can give you a snapshot of the server state. Imagine just in time profile data and information needed to diagnose a problem. Agile devops!
Genesis of the Resin Health System:
The Health System grew organically not in a vacuum. When you buy Resin Pro, you get our world class support. When you send a question or issue, our core engineering team answers. Our core engineers use the Resin Health System to develop Resin, and to provide support. For example, the anomaly detection feature, described below, came while supporting a customer. We added the Health System because we needed it to provide support. You can use it to improve supporting your applications.
Recently a fortune 100 customer needed some help diagnosing some very tricky issues. We employed the Resin Health System. The issues were in their code and some 3rd party library code. To track down these issue using conventional mechanisms would have been arduous. As as reward they greatly expanded their deployment of Resin Pro. The Resin Health System was born and forged out of real life support needs.
Non Stop Resin uses the Resin Health System:
As you may know, Resin runs in a Non Stop Resin mode. Non Stop Resin mode differentiates Resin from the crowd of application servers. It is one of the reasons that Resin Pro is the server of choice for very large deployments and OEM products like Network Appliances where reliability is paramount. Non Stop reliability is the difference between sleeping at night and getting support calls at 3 AM. You don’t have to be Salesforce.com or Cisco to have mission critical requirements and even small department level deployments can be mission critical.
To achieve Resin Non Stop mode, Resin employs a lightweight watchdog process that monitors the responsiveness of the Resin process and restarts the resin application server if it becomes unresponsive. For a while now the watchdog process works in concert with the Resin Health System to improve the ability to detect issues. As the Health System improves so does Resin’s Non Stop mode. The issues that causes the restart can be a bug in your code, a bug in library code that you use, a denial of service attack an unexpected spike of use, queries that are suddenly taking a lot longer to run, etc.
Resin Health System:
The Resin Health System has the ability to track runtime Resin server, operating system and Java virtual machine metrics like request count per minute, heap space, tenured memory, GC time, thread count, block thread count, SQL query time, CPU utilization, file descriptor usage, and much more. The Resin Health System has a web interface where you can visualize what is going on with every node in your deployment along with baseline data to see if anything has changed. While this web interface is useful it is mere eye candy compared to what you can do with the Resin Health System.
The key to the Resin Health System is the ability to set limits and rules that trigger actions like performing a thread dump, running a CPU profile, running a heap dump, generating a report, sending an email and restarting the server. Resin Pro is preconfigured so you can take advantage of this system right away.
Resin Health System Reporting:
Resin uses this monitoring data to create summary reports, post mortem reports and other custom reports (PDF reports). These reports contain key configuration data, identity data, and graphs of server metrics. You can configure Resin to generate a weekly server status report. You can keep these reports around as a historical record of the state of your servers. These reports are essential for diagnosing issues that might happen in the future. Baseline historical data can give you a lot of perspective in interpreting current server state when problems occur. To know where you are, it is important to know where you have been.
Health Monitor Triggers:
A good way to describe this is as follows: you can setup action triggers which are based on maximum limits, for example if the server CPU is at 95% for three minutes straight, then go ahead and generate a thread dump, heap dump, a lightweight CPU profile, then restart Resin. It is a bit more complicated then this as Resin does this in an efficient way with a primary monitor and then a recheck monitor if some limit is first met. This capability has been in Resin 4 for a while.
Post Mortem Analysis, critical data when it is needed most:
In this release, the post mortem analysis report produces a much fuller snapshot of the server. When Resin restarts for a health problem, the Resin Health System takes the data collected about the state of the server and generates a post mortem PDF report. This post mortem report has the following:
- a summary of the system
- metering graphs on everything Resin Health System tracks (heap space, number of threads, etc.)
- a complete sorted and filtered thread dump
- a CPU profile
- warnings and errors from the log file
- JMX parameter dump
- and much more
The post mortem report can tell you exactly what was going on in the system just before it died. For example, not only does the post mortem report tell you that the CPU was busy, it tells you what the CPU was busy doing.
You can also trigger snapshot summary reports from the command line or from the admin page (for example if you are running a load test and want to take a snapshot at different points during the load test).
Think of the alternative of not having such a system, you have a system that is about to die, say a memory leak or a denial of service attack or a long running query that causes a back up in threads that causes objects to stay around longer, which causes memory to become tenured instead of collected in the eden space, which causes high GC times, and an eventual out of memory exception, etc. Whew!
Now the system goes down. What happen? Will it happen again? With Resin Pro, not only will the server restart and continue to operate, but a report generates so you can diagnose what happened.
Now for a small deployment one to 3 servers, this can be convenient and helpful. For a large cloud deployment, this is a lifeline. Cloud enabled Java EE application servers need a Non Stop mode with health snapshots. The more server nodes, the more you need JIT profiling and snapshots.
Near future developments: Anomaly detection, detecting problems before they become problems
Another big improvement coming up in the next few releases of Resin Pro (Resin Pro releases every two to three weeks) is anomaly detection. Not only can we detect limits but we can detect anomalies. Problem detection can occur before they become big problems.
This is best understood with a scenario. Let’s say that you have a page that has a thread blocking issue. This page does not get hit very often. Sometimes not for a whole day, but sometimes ten times in one hour. Also coming up soon is the end of the quarter and the accounting department is going to start crunching numbers and this page is going to get hit 1000 times a day. Do you want to detect issues and diagnose problems in a month when there is the big end of the quarter deadline or now?
To continue this scenario, normally, there are not many threads in the blocked state. Resin Pro’s anomaly detection can detect if the number of blocked threads increases suddenly. Anomaly Detection can fire off an action like creating a summary snapshot PDF, or send an email or just log the anomaly. This anomaly detection is based on the statical analysis of the baseline and can be setup for any server metric. The anomaly detection is tuned to avoid false positive and not to do heavy weight processing.
With Resin Pro you don’t wait until a problem causes the server to run out of a resource, you detect and help diagnose the problem before it becomes a big problem.
Future of our Health System:
Now after years of providing support by staring at thread dumps, heap dumps, and meter graphs then diagnosing issues like locking issues, slow SQL queries, floating garbage, memory leaks to GC deadlocks, we learned a thing or two about what to look for and key indicators of problems. We plan on putting some of that information in the post mortem snapshots PDF to make your job even easier when you run into problems. In our world view, it is not enough to have all of this information and data at your fingertips, why not go ahead and do some analysis of the data and get you moving in the right direction. Stay tuned when we add this new feature, we will talk about it here.
Conclusion:
The Health System grew organically not in a vacuum. When you buy Resin Pro, you get our world class support. When you send a question or issue, our core engineering team answers. The Resin Health System is used by Caucho to provide world class support. It makes our life easier. Our core engineers use it to develop Resin, and to provide support.
Caucho’s focuses on product development and not consulting and services. Other companies make their money from consulting and services. Creating a product that is easier to support does not motivate them, and could hurt their bottom line. Conversely, Resin Pro is easy to setup, monitor and use because our revenue is from selling Resin Pro licenses. The Health System helps us to provide world class support more effectively.
When considering an application server, remember true ROI. Free is not free if your site goes down and you don’t know why. Resin Pro comes with web caching, true cloud support, web proxies, URL rewriting, distributed caching, messaging, and everything you need to write scalable web applications. The Health System fits well with the Resin Pro philosophy of batteries included application server.
