My team and I have been architecting an enterprise platform for over a year now. Our platform is an aggregation model that connects multiple channel partners to multiple retail partners and is 100% cloud based. I wish I could share some of our architectural drawings with you, but since I can’t I have created some very generalized diagrams that allow me to discuss architecting for the cloud without giving away our secret sauce.
When building an enterprise platform that connects many companies together, we must be able to guarantee high availability and scalability while protecting against lost data. In this post I will discuss one approach to meeting these requirements. It is important to note that for this post I am focusing on a 100% cloud solution built on Amazon’s AWS platform, an IaaS (infrastructure as a service) solution.
The following image shows a logical representation of our approach. Keep in mind that to physically implement this approach, there is a significant amount of technology that is required which is not represented in this diagram.
|From Cloud Computing|
You can see at the top of the diagram, both web users and systems can trigger a request to the platform through elastic IPs. What is cool about elastic IPs is that you can create an IP that your channel partner’s system knows, but internally your IPs are changing regularly as you scale up and down your servers or perform maintenance. This allows you to make changes to internal IPs and images without requiring changes on your channel partners systems.
CloudWatch – Auto-scaling layer
CloudWatch is a relatively new web service from Amazon which allows you to set thresholds so that the platform can automatically scale up and down as needed. CloudWatch can auto-scale both your load balancers and your EC2 images. For each farm of EC2 images, you can set the minimum and maximum number of images you want CloudWatch to control and set thresholds based on a variety of metrics that trigger the scaling events.
Elastic Load Balancing – Load balancing layer
Elastic load balancing is a software based load balancing solution. This is another web service that Amazon provides to simplify the building and management of infrastructure. Where CloudWatch manages how many instances your platform needs at any given time, ELB is responsible for distributing the traffic in a manner that optimizes the resources that are up and running. ELB can detect when an EC2 image is starting to degrade in performance and reroute traffic to better performing images. This can be done within a zone or across zones. This is extremely important because Amazon has had an occasional outage within a zone but has never had all zones down at a single time. In fact, since these zones are located in different physical locations, it is extremely unlikely, barring some major catastrophic event that hits multiple areas of the country, that all zones will be down at any one time. That is why it is a best practice to distribute EC2 images across multiple zones and use the ELBs to reroute traffic in the case of a zone outage. Think of each zone as a virtual data center, without the cost of real estate, hardware, utilities, assets, and large staffs to manage it.
Server Farms – Web, Application, and Database layers
It is not a mandatory requirement to physically separate your web, application, and database logic onto separate EC2 instances. It really depends on things like budget, volume, required performance, memory requirements, and many other factors. We chose to break these out into separate layers because of the enormous volumes we anticipate, the strain we put on system memory, and some of our location specific data requirements. Within each layer (if you can afford it), I recommend having redundant servers in each zone. That way if any zone is down, there is still a fail-over solution within each zone. Note the difference between an Amazon zone and an Amazon region. Amazon regions are US-East, US-West, and Europe. Within each region there are multiple zones. There is no additional cost to route traffic across multiple zones within a region. There is a cost of routing traffic across regions.
Backup/Recovery – Leveraging S3
S3 is Amazon’s Simple Storage Service which automatically distributes data across multiple zones within a region for you. A simple S3 web service call automatically performs real time distribution of data across multiple zones thus eliminating the need for those clunky, expensive, and highly ineffective tape backups that we have been doing for years and years. S3 is a critical piece of the architecture. Not only is is good for backing up content, but you can deploy a sound backup/restore strategy on it too. The snapshot feature allows for frequent snapshots of database slaves as well as Apache and web logs. If there ever is a need to recover from a glitch, we can easily pull the last good snapshot off of S3 and restore it.
I glossed over these points at a high level. Logically this appears very simple but there is a lot of work required to build an architecture like this. However, this architecture can be built in a fraction of the time and for a fraction of the cost that an on-premise version of this architecture would take. The combination of the distributed capabilities of S3 coupled with a multi zone approach for deploying EC2 images, enables us to build a highly reliable and scalable platform. In addition, this concept of virtual data-centers allows us to provide our customers and channel partners with the confidence that we can execute an effective business continuity and disaster recovery plan.
From time to time I will write additional posts about building cloud architectures on Amazon. Look for my next post to discuss data services and hybrid clouds. As always, questions and feedback are welcome!