In software development, automated testing has long been accepted as best practice.
Test-driven development (TDD) or behavior-driven development (BDD) approaches often go as far as writing tests for functionality before they are implemented. Continuous integration and continuous delivery (CI/CD) pipelines are commonplace, automatically running tests against your codebase whenever pushing a change to version control.
This kind of testing rigor, where your code is comprehensively and automatically exercised to ensure it does what it's supposed to do, and that recent changes haven't broken existing functionality (often referred to as introducing a "regression"), accelerates the software development process allowing teams to go faster safely.
But what about your infrastructure? "Infrastructure as Code" (IaC) brings many of the benefits and tooling around modern software development to infrastructure—the servers, network components, backend storage, and so on that our application code runs on. As this approach to infrastructure becomes more widespread, naturally people want to apply the same kind of testing rigor to their infrastructure code as their application code.
In this article, I'll look at some of the challenges with automated testing for IaC, some of the tools you might use to execute both functional and conformance tests, and some examples of these tests in action.
Author's note: I'm mainly going to discuss Terraform as the code part of IaC, and AWS as the cloud provider. This is because those are very common choices for defining and running infrastructure. These ideas and principles are also applicable to other IaC technologies and providers, and nothing here is meant as a criticism of either Terraform or AWS. Infrastructure testing is hard, whether you're deploying Terraform code to AWS or Ansible code on Google Cloud. The same problems apply.
Although they're both "code", testing infrastructure code is different from testing application code, and each comes with its own challenges. Let's look at two of them: costs and time.
First, costs. Running a suite of IaC tests that actually builds and then tears down cloud infrastructure generally costs a lot more than running application tests, where you're just paying whatever it costs for the compute power to execute your test code. However, the hosting costs of briefly running some test infrastructure are almost always less than the cost of an engineer's time spent finding and fixing problems that could have been avoided with better testing.
Secondly, the larger problem: time.
TDD/BDD works best when you have fast feedback. You'll often run your tests automatically in another window whenever you save your file, so you immediately see any problems.
When your code is creating infrastructure, this kind of fast feedback is impossible. Spinning up servers, creating virtual private clouds (VPCs), and setting up load-balancers takes time—often several minutes, depending on the type of infrastructure resource and the cloud provider. For example, creating an AWS RDS instance takes approximately 20 minutes, possibly longer if you're creating a cluster or setting up read replicas. That's not a criticism of AWS—all of the major cloud providers have similar limitations. Building infrastructure just takes time, even in a modern, cloud-centric environment.
However, there are ways you can minimize these delays. For instance, you can use a cloud datastore instead of creating database servers, or launch docker containers instead of virtual servers, but they're impossible to eliminate. And the more your development environment diverges from your production infrastructure, the less reliable your IaC tests become.
One way to get faster feedback from your IaC tests is emulation. So rather than actually building the infrastructure your code defines, you use emulation to try to gain insights into its correctness.
A simple example of this would be to run terraform plan on your Terraform code and see if it looks like it's going to do what you expect.
Although emulation can add value, the problem is that you're now exposed to multiple sources of errors, such as errors in your IaC code and in your emulation layer, which provides you with an inaccurate representation of the behavior of your infrastructure provider.
When you run terraform plan the output is saying, "These are the AWS API calls I made, and these are the results I expect those calls will have." Very often, terraform plan will be completely correct, but sometimes the API has behaviors that are not emulated correctly.
Name lengths is one example of this parameter. It's quite common for terraform plan to be completely happy with some code, but then the AWS API rejects a particular call because the name assigned to an RDS, for example, is too long.
This is just one common example that I've seen frequently, and I'm sure there are others. Again, this isn't a criticism of Terraform in particular; emulating the entire API of a cloud provider is a huge task, and it's not surprising that no one does it perfectly (as far as I know).
Now that we understand the importance of testing IaC code, and some of the challenges, let's look at several IaC testing tools. We'll look at both functional and conformance testing.
Let's first look at functional testing. Does your IaC code create the correct infrastructure setup for your needs?
Regardless of the IaC technology that you're using, there are usually several dedicated tools designed to help you create automated functional tests for it. Some examples include Terratest for Terraform, Litmus for testing Puppet modules, or test-kitchen for testing Chef code. (These are examples rather than endorsements.)
The landscape of IaC testing tools changes quite quickly, so it's worth doing some basic due diligence to ensure that any tool you're planning to use is still being actively supported and developed, before you sink a lot of time and effort into it.
Let's walk through a very simple example (inspired by this Terratest example) of using Terratest to test some IaC code. We're going to create some Terraform code that deploys an EC2 instance, and some Terratest code to check that our Terraform code tags our instance the way we want it to.
In an empty directory, create two directories terraform and test, and the following files:
This code is for Terraform version 0.13.3 (the latest version at time of writing).
We need AWS credentials to run this code:
We need the TF_VAR_ environment variables for our terraform code, but terratest also needs the AWS credentials to query the API and get details of our EC2 instance, so we need the same values with different environment variable names.
If you run this code, it will launch an EC2 instance with a Name tag with the value Webserver.
Don't forget to terraform destroy anything you create, or you may incur charges from AWS.
Now let's add a test to check that our terraform code applies the Name tag correctly:
To run this test, cd into your test directory and run:
The first time you run this, it will download all the packages it needs, and then it will start to apply the terraform code. You can follow along in the AWS console and watch it create and then destroy an AWS instance.
At the end of the test run, you should see something like this (the time taken to run the test will be different):
Try breaking the terraform code (e.g. by changing Name to InstanceName), and the test should fail.
Some of the benefits of terratest are:
On the other hand, the built-in go test framework, which terratest sits on top of, is quite basic compared with test frameworks in other languages such as ruby's RSpec (which underpins serverspec), and this can result in test code which is more verbose and harder to maintain.
If you already have experience writing tests for your application code, in whatever language that's written in, there's no reason you can't write your IaC tests in the same framework. Automated testing for IaC does have specific challenges, as discussed earlier, but fundamentally you're still setting up pre-conditions, making a change, and testing to see if you got the correct result.
Writing your IaC tests using the same framework as your application tests leverages the existing skills and experience of your engineering team, and allows you to take advantage of all the features and tooling you're used to.
Here's a simple example of using RSpec to test a kubernetes cluster:
Here, we are testing namespace access rules by using the --as flag to kubectl to "impersonate" a user from two different groups, and confirm that members of the sysadmin group can perform an operation (get pods in the kube-system namespace), which non-members cannot.
The downside of writing tests in this way is that, because the test framework is not specifically designed for testing infrastructure, it's likely that you'll have to build some libraries to support your test code (such as the can_i_get_pods function in our example). Many of these will be built-in parts of dedicated IaC testing tools.
Up to now, we've mostly been discussing functional testing. Does your IaC code create the correct infrastructure setup for your needs?
Conformance testing (aka compliance testing) is a slightly different approach which tests whether the setup complies with the standards or rules we want to apply to our infrastructure.
Examples of this could include things like:
Automation around compliance testing usually involves checking IaC code to ensure it complies with defined policies and rules, and rejecting code which fails.
This is analogous to scanning tools like Sonarqube and Rubocop for application software, where the tool scans your code for known anti-patterns and vulnerabilities, and code which fails to meet a pre-defined quality threshold is automatically rejected.
One popular tool for conformance testing, particularly in kubernetes (although it is useful in other environments) is Open Policy Agent (OPA).
According to the project documentation, OPA is a "general-purpose policy engine." It includes its own policy language Rego, in which you define the policies you want to enforce.
Let's look at an example, using OPA to apply some restrictions to a kubernetes cluster.
This policy ensures that no two ingresses in a kubernetes cluster are trying to handle traffic for the same hostname (this could be a problem in a cluster running multiple services, because it would be possible to accidentally "steal" traffic from a production service by defining the same hostname on a development ingress).
...evaluates to true, triggering the deny, if the other_ingress hostname matches our hostname, while this line:
...ensures that the policy doesn't fail every time, by comparing an ingress to itself.
Rego has its own test framework, enabling you to write tests for your policies to ensure they have the effects you intend. Here is an example of some tests for the policy we created above (in a real-world scenario, you would need to duplicate these tests for UPDATE as well as CREATE operations).
The last part of OPA I want to talk about is Conftest.
Conftest extends the idea of compliance testing to a wide variety of structured data formats including kubernetes configuration files, Dockerfiles, and terraform.
Here's a simple example of using conftest on some terraform code.
In this case, we are enforcing a policy that any S3 buckets must have encryption enabled.
To start with, we need terraform code to create S3 buckets. We're going to create one bucket with server-side encryption, and one without:
As usual, we need to supply our AWS credentials as environment variables:
You will need valid AWS credentials if you want to run this code, even though we only need to run terraform plan.
Put all of these files in a directory, supplying valid AWS credentials, and run it like this:
You should see output that includes this:
Now that we have our terraform code, let's see how we can use conftest to check it against our bucket encryption policy.
Create a policy directory, and add this file:
This is a trivial policy which looks at the changes terraform is going to make, and alerts us if there are any resources where server_side_encryption_configuration is empty.
You can find more information about writing policies in the OPA documentation.
Conftest works by scanning the JSON output from terraform plan, which we can create like this:
Now that we have our plan.json file, we can run conftest like this:
Conftest looks for policies in the policy directory by default. You can specify a different directory with the --policy/-p command-line option.
You should see output like this:
If you copy the server_side_encryption_configuration stanza into the cleartext bucket definition, and regenerate the plan.json file, the conftest output should change to:
Although we've used terraform in this example, to keep it simple, Kubernetes lends itself particularly well to this kind of approach. As well as finding out what your IaC code says about how your infrastructure should be set up, the kubernetes API enables extensive introspection. So, you can scan a kubernetes cluster and find out exactly what is actually running, and how it's configured.
Tools like Sonobuoy make this easier, allowing you to automatically run reports on the setup of your cluster and the code running on it.
This is a huge topic, and I've barely scratched the surface with this article. But I hope I've shown you some of the tools and techniques available to allow you to apply some of the same testing rigor to your IaC code and configuration that you already apply to developing your application code.
Testing infrastructure code has some challenges, but by taking a layered approach, using a combination of techniques at different points in your infrastructure development lifecycle, you can gain a lot of confidence in your setup, and minimise the risk of later changes introducing errors or vulnerabilities.