Fixing Kubernetes OOMKilled Pods That Restart Without Warning
Your pod restarts at 2 AM, the logs vanish, and all you see in kubectl get pods is a climbing restart count and the status OOMKilled. There's no stack trace, no error message β just a dead container and a very unhappy on-call rotation. This guide walks you through exactly how to find the leak, set the right limits, and keep the restarts from coming back.
What you'll learn
- How to read the signals Kubernetes leaves behind after an OOMKill
- How to distinguish a misconfigured memory limit from a genuine memory leak
- How to profile memory usage inside a running container
- How to set resource requests and limits that won't strangle or bankrupt your workload
- Guardrails to add so you catch the next problem before Kubernetes does
Prerequisites
You'll need kubectl configured against your cluster, and ideally access to your cluster's metrics server or a Prometheus/Grafana stack. The examples use a generic Linux container, but the concepts apply to any runtime.
Understanding What OOMKilled Actually Means
OOMKilled is not a Kubernetes concept β it's a Linux kernel concept. When a container's memory usage crosses the limit set by its cgroup, the kernel's Out-Of-Memory (OOM) killer steps in and terminates the process with signal 9 (SIGKILL). Kubernetes notices the exit code, records the reason as OOMKilled, and restarts the pod according to its restart policy.
Because the kill happens at the kernel level, your application has no chance to flush logs or write a crash report. That's why the silence feels so jarring. The good news is that Kubernetes and the Linux kernel both leave breadcrumbs you can follow.
Reading the Post-Mortem Evidence
Start by looking at what Kubernetes recorded about the terminated container. The --previous flag on kubectl logs fetches the logs from the last container run before the current one:
kubectl logs <pod-name> --previous -n <namespace>
If the process had time to write anything before the kill, you'll see it here. Combine this with kubectl describe pod:
kubectl describe pod <pod-name> -n <namespace>
Look for the Last State block under the container section. It will tell you the exit code (137 for OOMKill, which is 128 + 9) and the reason. You'll also see the memory limit that was configured at the time of the crash.
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Mon, 10 Jun 2024 03:12:44 +0000
Finished: Mon, 10 Jun 2024 03:13:01 +0000
If you have the metrics server installed, check recent memory usage across your pods:
kubectl top pods -n <namespace> --sort-by=memory
This won't show you historical data, but it tells you which pods are currently memory-heavy and gives you a baseline to compare against your configured limits.
Distinguishing a Limit Too Low From a Genuine Leak
These two problems look identical from the outside but require completely different fixes. Setting the limit higher on a leaking container just delays the next crash.
Symptom: the limit is just too low
The application's memory usage is stable, predictable, and legitimate. It needs 600 Mi to run, but the limit is set to 256 Mi. The pod crashes almost immediately after startup or under any meaningful load. The fix is straightforward: raise the limit to match the real working set.
Symptom: memory grows over time
If you watch the pod's memory in Grafana (or via repeated kubectl top calls) and the usage climbs steadily over hours or days before crashing, you have a leak. Common causes include:
- Unbounded in-memory caches that never evict entries
- Event listeners or goroutines that are registered but never cleaned up
- Third-party libraries that hold references the GC can't collect
- Large objects being built up in response to each request without release
Raising the limit here only buys time. You need to profile the application.
Profiling Memory Inside a Running Container
Getting a memory profile depends on your runtime. Here are the practical approaches for the most common stacks.
Python
If you can add a debug endpoint, tracemalloc is built into the standard library and requires no additional dependencies:
import tracemalloc
tracemalloc.start()
# ... later, when you want a snapshot:
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')
for stat in top_stats[:10]:
print(stat)
For a running pod without a debug endpoint, kubectl exec into the container and use py-spy if it's available, or check /proc/<pid>/smaps for a raw breakdown of memory segments.
Node.js
Trigger a heap snapshot via a debug endpoint or the --inspect flag. If you're using a framework like Fastify or Express, a dedicated /debug/heap route that calls v8.writeHeapSnapshot() is the cleanest approach for a staging environment.
Go
The net/http/pprof package exposes a /debug/pprof/heap endpoint you can enable at startup. Fetch the profile and analyze it with go tool pprof locally.
kubectl port-forward <pod-name> 6060:6060 -n <namespace>
curl http://localhost:6060/debug/pprof/heap > heap.out
go tool pprof heap.out
JVM (Java, Kotlin, Scala)
Use kubectl exec to run jmap -histo <pid> inside the container for a quick class histogram, or trigger a full heap dump with jmap -dump and copy it out with kubectl cp.
Setting Resource Requests and Limits Correctly
The single most common source of OOMKilled pods is limits set by guesswork. Here's a systematic approach.
Requests vs. limits β the key distinction
A request is what Kubernetes reserves on the node during scheduling. A limit is the hard ceiling the kernel enforces. If your request is 256 Mi and your limit is 512 Mi, Kubernetes will schedule the pod on a node with at least 256 Mi free, but the container can burst up to 512 Mi. If it exceeds 512 Mi, the OOM killer fires.
Setting requests too low means you'll land on packed nodes and get killed when neighbors also burst. Setting limits too low means you kill your own workload under normal conditions.
A baseline to start from
Run the application under realistic load in staging, watch its memory ceiling with kubectl top or Prometheus, then set your limit at roughly 20β30% above that observed ceiling. Set your request at roughly 70β80% of the limit so the scheduler has an accurate picture of your workload's needs.
resources:
requests:
memory: "400Mi"
cpu: "250m"
limits:
memory: "600Mi"
cpu: "500m"
Do not copy these numbers. Measure your own workload.
Burstable vs. Guaranteed QoS classes
When requests equal limits, Kubernetes assigns the pod a Guaranteed QoS class β it's the last to be evicted under node memory pressure. When limits are higher than requests, the pod is Burstable. For critical services, consider matching request to limit to get the Guaranteed class.
Common Pitfalls That Catch People Out
- JVM off-heap memory: The JVM's heap is bounded by
-Xmx, but the JVM also allocates off-heap memory for metaspace, thread stacks, and native libraries. Many teams set-Xmxequal to the container limit and then wonder why OOMKill still happens. Leave at least 25% of the limit outside the heap. - Sidecar containers count against the pod: If you have a logging or tracing sidecar, its memory usage counts toward the node's allocation. Make sure you've added resource specs to every container in your pod spec, not just the main one.
- Init containers don't run concurrently, but their peak usage still matters during startup. A heavy init container can trigger an OOMKill before your main container even starts.
- Limits without requests on shared clusters: On a multi-tenant cluster, omitting requests means the scheduler can't make good decisions. Your pod may land on an already-pressured node and get killed not by its own usage but by the node running out of memory.
- Forgetting to account for page cache: Linux aggressively uses free memory as a page cache. The cgroup memory accounting includes page cache in some kernel versions, which can make your container appear to use more memory than your application actually does. Check
/sys/fs/cgroup/memory/memory.statinside the container for a breakdown.
Adding Observability So You Catch It Earlier
Reacting to OOMKills after they happen is painful. A few proactive measures cut the mean time to detection significantly.
If you're running Prometheus, the container_memory_working_set_bytes metric is the most relevant one to alert on β it excludes cache and reflects what the kernel actually considers
π€ Share this article
Sign in to saveRelated Articles
Comments (0)
No comments yet. Be the first!