I wrote in a previous post called Cloud computing’s impact on operating models how the elastic cloud model requires more engineering than the package and ship software model. The elastic cloud model also requires a more advanced monitoring strategy since the cloud service provider is providing a real-time 24×7 autoscaling solution to its customers.
With shipped software, the customer is responsible for monitoring the infrastructure and the application. The customer is also responsible for capacity planning to ensure that additional infrastructure is procured and ready in time when usage reaches certain thresholds. With the cloud model, the cloud vendor must perform these tasks in real time and instantly scale the system automatically when certain thresholds are hit. The best monitoring strategy for the cloud is a proactive strategy that detects problems before they have a broad impact on the overall system and on the user experience.
There are a number of categories that should be monitored:
- Service Level Agreements (SLAs)
- User metrics
- Log file analysis
- Key Performance Indicators (KPIs)
Monitoring also occurs within the different layers of the cloud stack:
- User Layer
- Application Layer
- Application Stack Layer
- Infrastructure Layer
In addition, there are three distinct domains that need to be monitored
- Cloud vendor environment
- Cloud application environment
- User experience
In my book called Architecting the Cloud: Design Decisions for Cloud Computing Service Models (SaaS, PaaS, and IaaS), I go into detail on the strategies for each category, each layer of the cloud stack, and for each domain and the differences within each cloud service model. In this post I’ll briefly cover a few of these.
Aligning Monitoring Strategy with Development
A best practice for building cloud based systems is to standardize as much as possible so that a high level of automation can be put in place. Two areas that are important to standardize on for monitoring are error trapping and logging. I briefly discussed this topic in the post Logging Strategies in the Cloud where I stated:
…build a utility service for writing application messages with a common log message format, with standard http error codes, RFC 5424 severity levels, and use a common vocabulary in error descriptions in order to optimize searches and get consistent results.
This advice is critical. In a highly distributed environment with hundreds or thousands of servers, if there are no standard error messages the contents of the log files cannot be used to determine patterns and may have very little value. The image below is an example of a cloud stack where each layer except for the database layer are multi-tenant.
The database layer is sharded by customer. This is a common pattern used to minimize the number of virtual machines required in all of the non-database layers, while providing a layer of segregation and additional security by customer at the data layer. But what about the logs? All of the logs for all of these servers should be piped to a central logging server (with a primary and a backup) where it is likely stored in a NoSQL database for retrieval. The logging servers are a shared resource so all logs for all customers are in the same NoSQL database. With a standard error message naming convention and standard logging messages, the logging application can serve up client specific logs to the end user from the shared logging environment.
In order to accomplish a strategy like this, the application team must implement the standards and the design specifications in the strategy. It is important that this is put in place early in the development of the product(s) being built because retrofitting it in when there are hundreds or thousands of servers is very high risk.
Proactive vs. Reactive
In the cloud a proactive approach to monitoring is a must. Often when people think of monitoring in the cloud they think about implementing Nagios to monitor disk, cpu, and memory of virtual machines. That is definitely needed but this does not go far enough. Everything that runs should be monitored and baselined. Every API should be tracked, every piece of infrastructure should be monitored, user metrics should be closely watched, and suspicious activity should be identified as it happens.
For APIs, proactive monitoring will track the number of calls per minute, hour, day, week, etc. and the average performance for each time period. Anytime the system detects an outlier whether it is in number of calls or the performance of calls, someone should be alerted. For example, if an API usually gets called 1000 times a minute but only gets called 5 times in the last minute, there is a good chance that there is either something wrong with the API or something is wrong upstream preventing the code from getting to the API. On the flip side, if all of the sudden there are 10,000 calls to that API someone should be alerted. There may be a burst in traffic or a surge in new users which is great, but somebody should check to see if the rest of the system is scaling graciously. On the other hand, maybe there is some kind of malicious attack going on and somebody should check the logs for suspicious behavior. In my book I talk in length about proactive monitoring.
Another example of proactive monitoring is the tracking of KPIs. Each business has a set of KPIs that are relevant to their business model. Here are some examples:
- Cost per customer – # of customers on the platform/divided by cloud costs
- Revenue per customer
- Bounce Rate
- Avg transactions processed per day
KPIs are baselined and tracked over time. KPIs trending in the wrong direction may reveal a product strategy issue or a system issue. KPIs related to user activity should be monitored and logged and associated with each deployment. A drop in user activity may be tied to buggy code or performance issues caused by the latest deployment or possibly the users responded negatively to the change in product. By proactively monitoring these KPIs, a company can react quickly to identify issues before they become catastrophic and before users start leaving in droves.
Monitoring and cloud service models
For companies building solutions on top of IaaS, a great deal of development and/or integration with various monitoring tools is required. It is a core responsibility of the DevOps team to put a consolidated monitoring view in place to make it easy for operations to monitor the system. If companies are building on top of PaaS solutions, the DevOps team will only need to automate monitoring solutions for the application layer. The operations team will watch the application monitors and check on the system health of the PaaS. If a SaaS solution is used, operations should at least be pinging the SaaS urls and if the SaaS solution is mission critical, they should monitor the system health information provided by the SaaS solution (if this functionality exists).
Monitoring is a critical component of any cloud based system. A monitoring strategy should be put in place early on and continuously improved over time. There is no one monitoring tool that will ever meet the needs of a cloud solution. Expect to leverage a combination of SaaS and open source solutions and possibly even some home grown solutions to meet the entire needs of the platform. Managing a cloud solution without a monitoring strategy is like driving down the highway at night with the lights off. You might make it home safe, but you might not!