Sometimes folks think about Kubernetes misconfigurations as a security problem.
While it's true misconfigurations sometimes do result in security breaches or sensitive data leaks, it is just as common that they cause production stability issues, like outages and performance degradations.
For DevOps teams whose primary stakeholders are developers, such production issues are actually ones they encounter more often.
Kubernetes has so many different pieces to it and is highly extensible. Configuration options are aplenty. Unfortunately, the default configs aren't always the best configs.
Just because Kubernetes allows you to deploy a pod with access to the host network namespace, for example, it doesn't mean it's a good idea.
It's important to know the best practices and see how they'd be applicable to your use case. Often, that knowledge comes from lessons learned by others before you.
In this article we share 5 examples of Kubernetes misconfiguration mistakes. They were all fairly simple as far as mistakes go but did impact production.
These examples are just the tip of the iceberg, but we hope they inspire you to accumulate Kubernetes configuration best practices knowledge.
Target went from using Chef to Spinnaker for immutable infrastructure and Kubernetes in the data centers, taking deployment from something that consumes 80% of a developer's time to pretty much no time save a couple hours setup upfront. In terms of scale, they have thousands of pods hosting hundreds of applications, and their Kubernetes config files are 20,000+ lines long in 2018.
One container was taking all the ingress traffic for an entire cluster, and this took the cluster down. Simply because a developer changed the Ingress config and set the host as a wildcard in Ingress resource.
Lesson learned: Prevent users from specifying host as “*”
Learn more about this mistake by Traget.
In another incident at Target, 4,320 pods were created in 3 days. One failing cron job created these 4,320 pods that were constantly “restarting”. This cluster accumulated a few hundred CPUs during that time and cost them an expensive bill from GCP.
This is an example of a default config not being the best config - the default for CronJob is concurrencyPolicy: Allow, so when the job fails, the next job doesn't replace the previous job.
Lesson learned: verify concurrencyPolicy is always set to either “Forbid” or “Replace”
Learn more about this mistake by Traget.
Zalando is Europe's leading online fashion commerce company with over 6,000 employees. They're running 100 Kubernetes clusters in total serving 1,100 developers.
Zalando’s API server was down due to out of memory issues. The pod explosion was caused by a very innocent manifest. They used a CronJob with the following configs:
Which by themselves are fine. In fact if you noticed, they did avoid the misconfiguration mistake Target did in story #2. However, they were not placed correctly in the Kubernetes manifest...
This caused the concurrency policy to be ignored, so the CronJob didn't have any limits and were spawning pods that were never cleaned up in the API server.
Lesson learned: Always ensure that Kubernetes YAML structure is valid. Ideally through automated checks.
Learn more about this mistake by Zalando.
Datadog is a leading monitoring software company, but even they weren't immune to Kubernetes misconfigurations.
The DevOps team was hearing that jobs weren't starting, and upon closer look they saw error state pods related to image pulls. The team could see the number of image pulls sharply increased and sustained for several hours.
The image pulls were failing with 429 error message (too many requests), which indicated they were being rate limited by their image registry.
As it turned out:
Effectively the bad code caused a crash loop that `imagePullPolicy: Always` created a DDOS attack (from 3 NAT addresses) on their image registry provider.
Also identified as compounding the problem was the fact that developers were allowed to use floating tags like “latest” they used on images.
Lesson learned: don’t set “imagePullPolicy: Always” and don’t allow floating tags for images (“latest”) on all objects type.
Learn more about this mistake by Datadog.
Kubernetes is popular not just with large enterprises or established tech startups, but also smaller startups – and small autonomous units inside large engineering organizations – because it ”automagically” handles infrastructure for the developers.
In the fifth story, we look at the mistake made by Blue Matador, a small startup making monitoring software, and how Kubernetes’ self-healing capabilities helped them.
The DevOps engineer noticed there were system OOM events on their production nodes preceded by high memory usage:
Upon closer inspection, the pods in question were their fluentd-sumologic pods hosting SumoLogic, a third-party application whose containers are memory hogs.
Turned out, because resources requests and limits were not configured on those pods, Kubernetes didn't stop those pods from taking up all the memory in the Node, which caused a System OOM event.
Lesson learned: Specify resource requests and limits, especially for third-party services and apps.
Learn more about this mistake by Blue Matador.
As mentioned, these 5 real-world examples are just the tip of the iceberg.
Getting Kubernetes configurations right is a worthy endeavor for any Engineering team looking to help developers autonomously self-serve infrastructure without risking production and security issues.
Contribute to Kubernetes Failure Stories on GitHub.