Simple Technology Solutions

Improving Cloud Operations with Proactive Resource Management

Description

Improving Cloud Operations with Proactive Resource Management by addressing AWS EC2 Instance limits with automated instance monitoring and reporting, resulting in the elimination of service interruptions.

Categories

Cloud Enterprise Architecture DevSecOps Cloud Governance & Compliance Serverless      

PDF

BACKGROUND

A large U.S. Government Agency was undergoing an enterprise-wide migration to the cloud under a very tight schedule. Every day, new teams were added into a set of shared AWS accounts. Simple Technology Solutions (STS) Cloud Engineers trained Agency Application Development teams in modern cloud deployment and DevOps practices. The training enabled the teams to provide their own infrastructure on demand for the first time, leading to an explosion of EC2 instances being created (over 2,000 Virtual Machines across five accounts). AWS accounts began reaching limits on the number of instances that could be provisioned in each account, leading to service interruptions across the business units. Reaching the maximum instances limit in their account, the Cloud Engineers could not provide any new EC2 instances. As a result, Product Owners had to open service tickets to request raising the limit of AWS EC2 instances. Inevitably, this extra step led to interruption of services before additional instances could be created.

ANALYSIS

STS Engineers determined the Product Owner needed a proactive solution to prevent maximum limit service interruption. The engineers determined that a new report should be generated daily during peak activity to account for development AWS instances that were automatically turned off after hours. The report highlighted accounts nearing maximum limits, which served as an alert for Product Owners to raise limits and prevent work interruptions.

SOLUTION

STS identified a serverless Lambda function as the best tool to minimize overhead and deliver the capability as quickly as possible. STS Engineers designed a Lambda function to collect EC2 usage information from all agency accounts. The number of active instances were compared to the maximum allowable limits, and stored in a DynamoDB table. The table could be easily updated when limit increases were completed. STS explored AWS Simple Email Service (SES) as an option to deliver the report, but the FedRAMP/GovCloud restrictions made it a non-viable solution. STS also explored AWS Simple Notification Service (SNS), but the reports did not show enough detail for Product Owners to read and utilize. Ultimately, Lambda’s extensibility allowed STS to use python library functions to send a well-formatted html email to the Product Owners to address the need to increase EC2 instances limit in each account.

BENEFIT

The limit checker report provided immediate value to the customer. In the month prior to this STS solution, there were three incidents where deployments were halted for several hours while a ticket was submitted to AWS to increase limits. Following the implementation of the solution, service interruptions were eliminated, allowing for significant performance improvements through faster deployments and higher systems uptime.

DIAGRAM

STS leveraged AWS native services and creative thinking to overcome the common challenge of account provisioning and overage when migrating to the cloud.

Share This