AWS EC2 Instance Recovery: how to deal with StatusCheckFailed_System and automate the creation of the alarm for your instances

If you are an experienced AWS User, you know, AWS sponsors the use of the EC2 instances as cattle, not pets ( https://devops.stackexchange.com/questions/653/what-is-the-definition-of-cattle-not-pets). Instances can spin up, live and terminate in a totally orchestrated way, and this is reality in the DevOps world 🙂

However, I’m sure that even the most “I-CODE-ALL-MY-INFRASTRUCTURE” person has some pet instance, somewhere 🙂

You should know that AWS doesn’t perform well with single EC2 instances (it’s not their goal), so the Service Level Agreement ( https://aws.amazon.com/it/compute/sla/ ) says that an instance is guaranteed to be up, running and available 90% of the time for every hour. Many of you, probably, never experienced that “10%” downtime…but since it’s agreed in the contract, what happens when you hit that kind of issue?

Well, EC2 instances runs on shared hardware, and hardware can break. When it breaks, the instance can be “retired” or “terminated”, and you have to do some manual work to put it up&running again. You have to start a new instance, to attach EBS volumes again, to attach Elastic IP…and you’ll lose all the metadata (tags, for example), Instance ID etc.

How to deal with this?

AWS says: create a StatusCheckFailed_System in CloudWatch and trigger a Recovery action (and you should, moreover, send an alert e-mail when it happens). Easy: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-recover.html

However, let’s say that you applied the previous guide to an instance, but now you have 40 or 400 other instances and you want to do this job for *all of them*…maybe automatically.

So this is what I did: at a first step, I created a CloudFormation template: https://gist.github.com/lombax85/565d4199317627aacb64eb140d597c3a

With this template, you can create a Cloudformation Stack, and simply inserting the EC2 ARN + the SNS Topic to Alert, it will create the alarm *automagically* ?

But this still needs to be done for *each* of your instances. Unfortunately, CloudFormation doesn’t have the concept of “loops”, so you can’t say “for each instance, apply the alarm”. So I decided to switch to CDK https://aws.amazon.com/en/cdk/

CDK, explained in “simple words”, is an automation tool can be “programmed” in your favorite language (in my case I used TypeScript), and can “compile” a CloudFormation template. So you can, for example, create a TypeScript source code that loops through all your EC2 instances, and then apply the alarm on them (for example basing on a tag). I attach an example: https://gist.github.com/lombax85/d385da90fd6efe7f27d086fd39d043b6

With this, and a simple “cdk deploy” command, you can directly create a single CloudFormation Stack that adds all the needed status check.

Some warnings: CDK doesn’t provide a way to list all your instances, so I integrated the AWS SDK and used it to retrieve the instances via the describeInstances api call. This is not considered a best practice, since CDK rely on “Custom Resources” ( https://cdk-advanced.workshop.aws/sample/source-construct/custom-resource.html ) to do this kind of things (for example a Custom Resource can create a Lambda Function which is called onUpdate of the stack, and returns the list of your EC2 instances).

But…it works, and it was enough for me and for sharing here ? feel free to comment if you have any question. For the moment I can’t publish the complete project repository (there are some reserved information and it need some cleanup), so if some passage is not clear (for example when you need to inject the AWS SDK credentials) let me know. Bye!

LombaX's Web Site

AWS EC2 Instance Recovery: how to deal with StatusCheckFailed_System and automate the creation of the alarm for your instances

Lascia un commento Annulla risposta