SLAs for an application or service built in the cloud is the combination of:
- SLAs from the cloud vendors (ex: AWS, Azure, Heroku, etc.)
- SLAs from the apps built on top of the cloud vendors
Cloud Vendor SLAs and compliance
I analyzed the SLAs of some of the top cloud service providers and was surprised about what I found. For most well established IaaS and PaaS providers, SLAs ranged from 99.9% to 100%. One major PaaS solution, Heroku, does not provide an SLA which is astounding to me. When I tried to find SLAs from some of the major SaaS solutions like Salesforce.com, Concur, and others, I could not find a published SLA. I did see in a forum that Salesforce.com has provided an SLA to certain customers that have requested it.
The only value these SLAs really have for the customer is the customer may get a refund of some sort if a significant outage occurs. Regardless of the amount of that refund, it does nothing to repair the collateral damage that the outage may have caused to the business and its customers. This is precisely why there are so many articles declaring that cloud SLAs are useless. What I find interesting though is many pundits settle on leveraging private clouds because they don’t want to rely on public cloud SLAs. Essentially what these people are saying is they believe they can manage uptime better than Amazon, Rackspace, Terremark and others whose core competency is running datacenters. In exchange for that belief and for the desire for more control, they give up much of the benefits of cloud computing such as rapid elasticity, speed to market, reduced capital expenses and more. Certain companies have a business case or a set of priorities that may justify it, but I believe many others simply default to private clouds out of personal preference rather than what’s best for the business.
What is important to understand about SLAs is that these service levels only represent the uptime of the infrastructure (for IaaS) or the platform (for PaaS), but it is up to the engineering teams to build highly available applications on top of it. In the case of AWS, if a company only leverages one zone, the best they can hope for is a 99.95 SLA because that is all AWS offers. However, companies can achieve much greater SLAs if they architect for across zone or across region redundancy.
What I find interesting is that many of the major SaaS players get away without publishing SLAs. My theory is that a lot of them were early pioneers in this space and built up a large install base before many customers started to demand high SLAs in the RFP process. I can tell you from the two startups that I have worked at, we could not get a single contract signed if we did not commit to an SLA. For companies building SaaS solutions today, it should be expected that customers will be requiring an SLA of at least 99.9. The more mission critical the service is the higher the customer expectation will be of that number.
All of the major cloud vendors, whether they are IaaS, PaaS, or SaaS, have made it a priority to become compliant with most of the major regulations such as SSAE16, SAS70, HIPAA, ISO, PCI (when applicable) and others. Most vendors would never make it through a vendor evaluation process if they did not have these certifications posted on their websites.
Application SLAs and compliance
The following list shows the types of SLAs I have seen in contracts with both startups I have worked for and the established companies I have consulted with:
- Overall Uptime of Application/Service
- Page load times
- Transaction processing times
- API response times
- Reporting response times
- Incident resolution times
- Incident notification times
From a regulatory, security, and privacy perspective, the following list shows typical demands I have seen in contracts from customers of SaaS and PaaS solutions for B2B type services:
- Security and privacy safeguards
- Published incident response plan (incident retainer also requested on occasion)
- Web vulnerability scans and reports
- Published disaster recovery plans
- Safe harbor agreement
- Data ownership declarations
- Backup and recovery processes document
- Source code escrow
Customers expect monthly reporting of these SLAs and often request the rights to perform their own annual audit. Knowing this, a company should create an SLA and compliance strategy. The implementation of this plan is a shared responsibility between business and IT. The business has to document and implement as many controls and processes as does IT. A company should not look at SLAs and compliance as an IT thing. A great place to start to understand what is required from a security and privacy perspective for both the business and IT is to download the Cloud Control Matrix from the Cloud Security Alliance.
For the IT tasks required, a company should create a separate work stream with its own roadmap. There is simply too much work involved to build all of this up front. In the name of MVP (minimal viable product), the product team must plan which user stories are the highest priority and incrementally deliver these user stories over time. For example, if it is January and the expectation is that the product will pass a specific audit by June, there is no need to get all of the security and regulatory stories in the first few sprints and sacrifice core business functionality that may attract additional customers.
The IT user stories are a combination of application development and DevOps. The application developers have to build the user stories that come out of the design sessions on topics such as security, data considerations, logging, monitoring, etc. Web based applications need to be secured against web vulnerabilities and data must be encrypted for security and privacy. All of this overhead impacts performance which needs to be addressed when building for scale.
The DevOps team plays a huge rule in SLA and regulatory management. DevOps is typically responsible for the following tasks:
- Continuous integration and deployments
- Image & patch management
- System administration
- Scaling (usually accomplished via autoscaling)
- Quality testing
- Infrastructure cost optimization
All of these steps are critical for fulfilling contractual obligations in SaaS and PaaS contracts. The DevOps team should work closely with the security team, the database team, and the team responsible for audits and controls. The security team plays the crucial role in helping define the security user stories and advising the team on best practices around encryption, key management, transport layer security, and more. DevOps works closely with the database team to ensure they have the monitoring and logging tools necessary to monitor the databases and proactively address issues before they become critical. For example, in one of my startups, the DBA used an open source monitoring tool called Cacti and was able to see bottlenecks and correct them before the impact was widespread. The DevOps team also is responsible for managing the scripts to manage the database images which include the controls around security and patching the database software.
I rarely recommend vendors on my blog but from a monitoring and SLA management standpoint there is nothing better than New Relic. By leveraging this SaaS tool we are able to track SLAs automatically without writing a line of code. We can track by API, customer, product, or any custom group of assets. When a user submits from a web page we can trace the entire transaction including the time spent on the client, in transport over the network, in the API, in the database, and anywhere that the data travelled. In a highly distributed environment it is extremely hard to troubleshoot code without a tool like this.
When evaluating cloud services, pay close attention to the SLAs. All IaaS providers will publish their SLAs but not all PaaS and SaaS providers do. For the ones who don’t, research their uptime history. Make sure these cloud vendors have the proper regulatory credentials that meet the needs of your product. When building on top of these cloud services, understand their strengths and weaknesses and design around them. Create a roadmap for all of the user stories around SLAs, compliance, and security. Make sure that application development, DevOps, security, and the audit team are working hand in hand on this roadmap. And finally, make sure the business is working on the non-IT related tasks required to pass any necessary audits.