Lessons learned from AWS Gameday

11 September, 2019

aws user group tlse

I participated to an AWS gameday, organized by the welcoming Toulouse AWS user group.

Experience was great, so much that the workshop felt too short !

What is a gameday ?

You should expect any system to fail !

And the team response will be more efficient if people are familiar with the situation, this means trained in advance.

A gameday is a similar to fire or emmergency response training. Grouped in team, you are given a working aws infrastructure... then the instructor breaks something, and your have to fix it.

fire training

To make the situation more realistic and feel the clock ticking (like when angry users are flooding the call center) your team has to fix the problem... before the other teams.

Lessons learned

You will not know the details of the failing system

It seems always harder to fix a system you did not build (but this is often the case on real incidents). You have no real idea of its internal complexity or potential points of failure. This is where a runbook or checklist can be very handy.

But let's not forget the foundations of building a shared comprehension of the problem at hand. Guillaume highlighted this important aspect in the post comments.

An architecture schema and a whiteboard goes a long way in troubleshooting prod issues ! (Guillaume Treins).

In retrospect, we would certainly have identified solutions in less time, had we started to build a visual and shared representation of the architecture, symptoms of the problem, and things already attempted or verified.

Time pressure affects your response

The competition and the need for a quick response can make you forget key aspects of response !

You are training to collaborate

The exercise is not only technical, but also a collaboration effort. Working with people you generally do not know, on a team assembled on the spot, with different skills and experiences is always interesting. In retrospect, we should have spend a little more time identifying our respective skills, and sharing tasks when necessary.

Now beats perfect

What could seem like the best fix in absolute is not necessarily the most appropriate response from a end user perspective. You should sometime optimize for speed instead of trying to find immediately the root cause and write a long term fix (i.e. better reestablish a minimal or degraded service quickly, and use the extra time to investigate the details, as often practiced in by defensive teams working in cyber security).

For example in a scenario where we saw application fail, first reflex of our team was to connect to machine, check the logs and correct failing code. It worked, but it turned out that there was a quicker solutions.

The best immediate response was to just kill the instances and let the load balancer recreate them from their AMI (i.e. the good old "reboot your computer" from support teams). This kind of quick win does not guarantee a result in all cases, but given the short time needed to try this workaround, it would have make sense to try it first.

Bonus point for the meetup animator who suggested that you should save the state of the instance before deleting or modifying it... so that you can investigate later for a long term solution.

You should also make sure your attempt to fix something is revertible and does not make situation worse the situation ;-)

The fix is always simple .. after

Even if your team could not find the solution in time, you are likely to say "I knew it !" or "why did we not test this ?".

This is the whole point of the exercise: refresh your memory, have typical scenario in mind to be faster the next time !

The need for a checklist or runbook

This exercise highlighted the need for a checklist or incident response runbook. This would be even more relevant on a real situation where the pressure from management or users will certainly affect negatively the response of the team.

This document should be very accessible (paper could be good for this) and list the points to test in a very systematic way, relieving you from the mental load of "did we check this" ? It should also contain the common error messages / behavior (timeout, cannot connect) of network related issues and their common cause.

A very basic list:

[] Load balancers and instances health check
[] Just reboot the damn thing ;-), or simplify temporarily infrastructure like keeping a single node (bonus point for saving state for later investigation)
[] Security groups configuration
[] NACL configuration
[] IAM policies and related instance roles
[] DNS check

For a very detailed runbooks, we can inspire from those used internally by gitlab.com (covering also organization, communication channels and escalation paths):

https://gitlab.com/gitlab-com/runbooks

Conclusion

A gameday is a very good way to identify your weak spots, exchange with others on what works or not and prepare for a better handling of incidents.

But as with any training, to be efficient, the exercise should be performed regularly with outcomes reviewed for improvement... and make sure that your checklist is kept up-to-date.