EKS best practices you should know

EKS best practices you should know

My DevOps journey kicked off when we started to develop Datree - an open-source CLI tool that aims to help DevOps engineers to prevent Kubernetes misconfigurations from reaching production. One year later, seeking best practices and more ways to prevent misconfigurations became my way of life.

In this article, we’ll explore some of the EKS best practices that I discovered, and learn how we can validate our custom resources against these best practices.

1. Ensure resources limits size are configured the same as requests

Description:  Pods are scheduled based on requests only, a pod will be placed on a node only if the node's free capacity allows the Pod’s [spec.reosurces.re](<https://app.tryeraser.com/workspace/spec.reosurces.re>)quests. Limits are not factored into pod scheduling but help protect from a single pod runaway with all resources on a node due to an error or bug.

When pods reach their CPU limit they are not evicted, just throttled. However, if they try to exceed their Memory limit they will be OOM and need to be evicted.

When pods in the nodes attempt to use more resources than are available the node gives priority to one pod over another. In order to make this decision, every Pod has its own Quality of Service (QoS) class. In any case that requests!=limits, the container also has its QOS reduced from Guaranteed to Burstable making it more likely to be evicted in the event of node pressure.

If resources are requests<limits and the node is anywhere close to the reservation capacity then the node will be overcommitted, this can lead to pods attempting to use more resources than the available in the first place.

datree logo

Correctly sized

Correctly sized requests are particularly important when using a node auto-scaling solution like Karpenter, Cluster AutoScaler, these tools look at your workload requests.

Resources:

2. hostPath should be restricted or, if necessary, prefixes may only be used if the volume is read-only and restrict prefix

Description:  A hostPath is a volume that mounts a file or directory from the host node’s filesystem into a Pod. By default pods that run as root will have write access to the file system exposed by hostPath, this could allow an attacker to modify the kubelet settings, create symbolic links to directories or files not directly exposed by the hostPath , e.g. /etc/shadow, install ssh keys, read secrets mounted to the host, and other malicious things.

Usually, application don’t need to use the hostPath, however it does necessary sometimes, for example when running cAdvisor in a Container (use a hostPath of /sys) or requiring a given hostPath directory to be existed prior to the Pod running so the Pod will need to first check whether this particular directory exists and if so only then to run. One scenario for this is a Pod that needs to communicate with an external system that produces/consumes data for the application such as a configuration directory, or logs from /var/log/anyApp. Another use case is when the app listen on a unix socket.

An important note here! Some may suggest that requiring a persistent storage is a use case for mounting hostPath volumes but it is not recommended at all. It’s true that hostPath volumes are the first type of persistent storage because when a pod gets deleted the hostPath volume’s contents don’t, so if a pod is deleted and the next pod point to the same hostPath volume, the new pod will see whatever was left behind by the previous pod, but only if it’s scheduled to the same node. This isn’t recommended for production clusters. Instead a cluster administrator would provision a network resource like Amazon EBS volumes.

Resources:

3. Limit capabilities needed by a container

Description: Linux kernel capabilities are a set of privileges. Docker, by default, runs with only a subset of capabilities that can be modified - we can drop some capabilities (using --cap-drop) to harden our docker containers or add some capabilities (using --cap-add) if needed. It's recommended to avoid the following capabilities: Undefined/nil, AUDIT_WRITE, CHOWN, DAC_OVERRIDE, FOWNER, FSETID,KILL, MKNOD, NET_BIND_SERVICE, SETFCAP, SETGID, SETPCAP, SETUID, SYS_CHROOT

Resources:

4.Ensure capacity in each AZ when using EBS volumes

We should create one node group in each AZ, so there is always enough capacity available to run pods that cannot be scheduled in other AZs.

In EKS, worker nodes are automatically added failure-domain.beta.kubernetes.io/zone label, which contains the name of the AZ and you can use node selectors to schedule a pod in a particular AZ. So, if you use Amazon EBS to provide Persistent Volumes then the Kubernetes scheduler knows which AZ a worker node is located in and it will always schedule a Pod that requires an EBS volume in the same AZ as the volume. However, if no worker nodes are available in the AZ where the volume is located, then the Pod cannot be scheduled.

Resources:

5. seccomp profile must not be explicitly set to Unconfined

Description:  In general, seccomp prevents from containers to execute unauthorized kernel requests. Running Kubernetes containers/Pods in seccomp=unconfined  means one less isolation layer to protect your cluster and is advised against by the security community. No container in your cluster should run as seccomp=unconfined, specially in production environments.

A container is simply a process running on a computer; it run in the "user-space", share the kernel with other containers and make kernel requests known as system calls. Without the kernel, containers can't actually do anything. As a result, Linux kernel developers introduced a powerful security feature called Seccomp-BPF, which allows restricting the syscalls that a process can make by creating a special filter (the seccomp-BPF filter). Filters are complied from seccomp profiles which are syscalls whitelists and are implemented with eBPF bytecode (an instruction set that is interpreted by the kernel). Over the time, seccomp has been integrated into container runtime and orchestration tools like Docker and Kubernetes.

Docker uses seccomp filters and has its own profile that compile down to seccomp BPF filters. The default Docker container’s seccomp profile is RuntimeDefault, which is suitable for most workloads. However, when we run Kubernetes, it replaces the default seccomp profile with Unconfined that does not restrict any system call. This means that all pods that do not specify a seccomp profile will automatically run with seccomp=unconfined (in 1.24 the default changed to RuntimeDefault)

Resources:

6. Schedule Deployment replicas across nodes

Description: running multiple replicas won’t be very useful if all the replicas are running on the same node and the node becomes unavailable,  This is where Pod affinity and pod anti-affinity comes to the rescue!

Pod affinity/anti-affinity setting allows a pod to specify next to which group of pods it can be placed with. In practice, it is a collection of rules that allows us to constrain which nodes our pod is eligible to be scheduled on based on the labels on other pods. When applied, the k8s scheduler uses these rules to decide on which selected nodes a pod should placed.

The manifest below tells Kubernetes scheduler to prefer to place pods on separate nodes and AZs.

apiVersion: apps/v1
kind: Deployment
metadata:
  ...
spec:
  replicas: 4
  selector:
    ...
  template:
    ...
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
	        - podAffinityTerm:
            labelSelector:
            matchExpressions:
            - key: app
              operator: In
              values:
              - web-server
              topologyKey: topology.kubernetes.io/zone
          containers:
          - ...

Resources:

6. Spread worker nodes and workloads across multiple AZs

Description:  it’s recommended to spread nodes and pods in multiple AZs. If you’re using  K8s version 1.18+  than you can use Pod Topology Constraints, which allow controlling how Pods are spread across your cluster among domains such as regions, zones, nodes, and other user-defined topology domains.For example, the following manifest spreads pods across AZs if possible:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-server
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web-server
  template:
    metadata:
      labels:
        app: web-server
    spec:
      topologySpreadConstraints:
        - maxSkew: 1
          whenUnsatisfiable: ScheduleAnyway
          topologyKey: topology.kubernetes.io/zone
          labelSelector:
            matchLabels:
              app: web-server

(Bonus) Ensure Pod’s topologySpreadConstraints are set, preferably to ScheduleAnyway

One of the Pod Topology Constraints setting is the whenUnsatisfiable which  tells the scheduler how to deal with Pods that don’t satisfy their spread constraints - whether to schedule them or not. Setting whenUnsatisfiable to DoNotSchedule will cause pods to be “unschedulable” if the topology spread constraint can't be fulfilled. It should only be set if it's preferable for pods to not run instead of violating the topology spread constraint.

Resources:

7. Prevent Windows containers from running privileged

Description: Windows HostProcess containers enable you to run containerized workloads on a Windows host. These containers operate as normal processes but have access to the host network namespace, storage, and devices when given the appropriate user privileges. With HostProcess containers, users can package and distribute management operations and functionalities that require host access.

With Windows pods offering the ability to run HostProcess containers which enable privileged access to the Windows node it's recommended to disallow privileged access to the host by ensuring none of the following fields uses Undefined/nil  or false:

8. Ensure setting the SELinux type is restricted, and setting a custom SELinux user or role option is forbidden

Description: SELinux defines access controls for the applications, processes, and files on a system. It uses security policies, which are a set of rules that tell SELinux what can or can’t be accessed, to enforce the access allowed by a policy.

It's recommended to make sure that seLinuxOptions not equal to one of the following values: Undefined/"", container_t, container_init_t, container_kvm_t , and that the seLinuxOptions.user or seLinuxOptions.role are not equal to something else than  Undefined/"" .

The way SELinux works is that when an application or process makes a request to access an object, like a file, SELinux checks for access based on cached permissions. If SELinux is unable to make a decision about access based on the cached permissions, it sends the request to the security server. The security server checks for the security context of the app or process and the file. The security context is applied from the SELinux policy database. Permission is then granted or denied. If permission is denied, an AVC: denied message will be available in /var/log.Messagaes. The access permissions are cached in an access vector cache (AVC).

Resources:

So... now What?

I’m a GitOps believer, I believe that every Kubernetes resource should be handled exactly the same as your source code, especially if you are using helm/kustomize. So, the way I see it, we should automatically check our resources on every code change.

You can write your policies using languages like Rego or JSONSchema and use tools like OPA ConfTest or different validators to scan and validate our resources on every change. Additionally, if you have one GitOps repository then Argo plays a great role in providing a centralized repository for you to develop and version control your policies.

However, writing policies might be a pretty challenging task on its own, especially with Rego.

Another way would be to look for tools like ‣ which already come with predefined policies, YAML schema validation, and best practices for Kubernetes and Argo.

How Datree works

The Datree CLI runs automatic checks on every resource that exists in a given path. After the check is completed, Datree displays a detailed output of any violation or misconfiguration it finds, with guidelines on how to fix it:

Scan your cluster with Datree

Scan on every apply with kubectl

kubectl datree test -- -n argocd

You can use the Datree kubectl plugin to validate your resources after deployments, get ready for future version upgrades and monitor the overall compliance of your cluster.

Automatically enforce best practices on every cluster change

Datree offers an open source cluster integration that allows you to validate your resources against built-in/user-defined rules upon pushing them into a cluster, by using an admission webhook.

The webhook catch CREATE, UPDATE, and EDIT events on every apply and verifies the applied resources are compliant with the best practices. If any misconfigurations are found, the webhook will reject the request and the resource will not be applied.

Datree admission-webhook GitHub

Use Datree to shift left

Disclaimer number #2: I’m a big shift-left believer. I believe that the sooner you identify the problem, the less likely it can take your production down.

The way I see it, one of the biggest challenges of shifting left Kubernetes knowledge is to provide guidance, support, and communicate the policies. The ultimate way to face this issue is by providing clear guidelines on every failure. This is why one of my favorite features in Datree is actually the “Message On Fail” which is the “how to fix” guidelines(that can be modified) that exists for every rule.

Scan your manifests in the CI

In general, Datree can be used in the CI, as a local testing library, or even as a pre-commit hook. To use datree you first need to install the CLI on your machine and then execute it with the following command:

datree test <path>

As I mentioned above, the way the CLI works is that it runs automatic checks on every resource that exists in the given path. Under the hood, each automatic check includes three steps:

(1) YAML validation: verifies that the file is a valid YAML file.

(2) Kubernetes schema validation: verifies that the file is a valid Kubernetes/Argo resource

(3) Policy check: verifies that the file is compliant with your Kubernetes policy (Datree built-in rules by default).

- Datree CLI in GitHub

Summary

In my opinion, governing policies are only the beginning of achieving reliability, security, and stability for your Kubernetes cluster. I was surprised to find out that centralized policy management might also be a key solution for resolving the DevOps vs Development deadlock once and for all.

Check out the Datree open source project - I highly encourage you to review the code and submit a PR, and don’t hesitate to reach out 😊

Learn from Nana, AWS Hero & CNCF Ambassador, how to enforce K8s best practices with Datree

Watch Now

🍿 Techworld with Nana: How to enforce Kubernetes best practices and prevent misconfigurations from reaching production. Watch now.

Headingajsdajk jkahskjafhkasj khfsakjhf

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

Reveal misconfigurations within minutes

3 Quick Steps to Get Started