DevOps Start

Posted on Jun 16 • Originally published at devopsstart.com

Fix CrashLoopBackOff in Kubernetes Pods

#kubernetes #debugging #pods

What CrashLoopBackOff Means

CrashLoopBackOff is not an error on its own. It is a status that tells you the container in a Pod started, exited, and Kubernetes is now waiting before it tries again. The kubelet restarts a failing container with an exponential backoff delay (10s, 20s, 40s, and so on, capped at 5 minutes). The "BackOff" part is the wait; the "CrashLoop" part is the repeated exit.

The key point: the container is doing exactly what you told it to do, then terminating. Your job is to find out why the process exits. The status itself never tells you the cause, so do not waste time staring at it. Go straight to the logs and events.

You can confirm the restart pattern with:

$ kubectl get pods -A | grep CrashLoop
$ kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[0].restartCount}'

A restart count climbing every minute or two confirms the loop. The official Pod lifecycle documentation explains how the kubelet drives these restart states.

Diagnose the Root Cause

Work through these four signals in order. One of them almost always points at the cause.

Container logs

Start with the current and previous container logs. The previous logs are critical because the running container may have already been killed:

kubectl logs <pod-name>
kubectl logs <pod-name> --previous
kubectl logs <pod-name> -c <container-name> --previous

The --previous flag shows output from the last crashed instance. Use -c when the Pod has more than one container, since logs default to the first container only.

Pod events and state

kubectl describe surfaces scheduling problems, image pull failures, probe failures, and the exact exit reason:

kubectl describe pod <pod-name>

Read the Events section at the bottom and the Last State block under the container status. Last State shows the Reason (for example Error or OOMKilled) and the Exit Code.

Exit codes

The exit code narrows the cause quickly:

kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'

Common values:

0: the process exited cleanly. The container ran a short task and finished. Use a Job, not a Deployment, or keep the main process running.
1: a generic application error. Check the logs.
137: the process was killed by SIGKILL, usually OOMKilled or a failed liveness probe.
139: a segmentation fault (SIGSEGV) inside the binary.
143: terminated by SIGTERM during shutdown.

Common Causes and Fixes

Bad command or entrypoint

If there are no application logs at all, the container often never ran your code. A wrong command, args, or a binary that is not on the image PATH produces an immediate exit. Check what the manifest overrides:

kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].command}'

Fix the command and args in the manifest, or correct the ENTRYPOINT/CMD in the Dockerfile, then rebuild.

Missing config or secret

A container that crashes on startup looking for an environment variable or mounted file is usually missing a ConfigMap or Secret. The Pod events will show CreateContainerConfigError if the reference itself is broken:

kubectl get configmap,secret -n <namespace>
kubectl describe pod <pod-name> | grep -A5 Events

Create the missing object or fix the name in envFrom, valueFrom, or the volume reference.

Failing liveness or readiness probe

A liveness probe that fails repeatedly restarts the container, which looks identical to a crash loop. Look for Liveness probe failed in the events. The usual causes are a probe path that does not exist, a port mismatch, or an initialDelaySeconds that is too short for a slow-starting app.

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
  failureThreshold: 3

For apps with long warmups, add a startupProbe so the liveness probe does not fire until startup completes.

OOMKilled or resource limits

Exit code 137 with Reason: OOMKilled means the container exceeded its memory limit and the kernel killed it. The diagnosis flow is close enough to a standalone OOM kill that the step-by-step OOMKilled guide is worth following when memory is the trigger:

kubectl describe pod <pod-name> | grep -i oom

Raise resources.limits.memory, or fix the leak in the app. Set a requests value too so the scheduler places the Pod on a node with enough memory:

resources:
  requests:
    memory: "256Mi"
  limits:
    memory: "512Mi"

Dependency not ready

If the container exits because a database or upstream service is unreachable at boot, do not let it crash. Use an initContainer to wait for the dependency, or add retry logic in the app. An init container that blocks until the dependency answers keeps the main container from looping:

initContainers:
  - name: wait-for-db
    image: busybox:1.36
    command: ['sh', '-c', 'until nc -z db 5432; do sleep 2; done']

Image issues

A wrong tag, a private registry without credentials, or a corrupt image shows as ImagePullBackOff or ErrImagePull rather than CrashLoopBackOff, but the two often get confused. Verify the image and pull secret:

kubectl describe pod <pod-name> | grep -i image

Fix the tag, or attach an imagePullSecrets entry to the Pod or ServiceAccount.

Verify the Fix

After editing the manifest, apply it and roll the workload:

kubectl apply -f deployment.yaml
kubectl rollout restart deployment/<deployment-name>
kubectl rollout status deployment/<deployment-name>

Watch the new Pods reach Running and confirm the restart count stays flat:

kubectl get pods -w
kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[0].restartCount}'

A restart count that holds steady for several minutes means the loop is broken. For a deeper walkthrough of the root causes behind a stuck container, see the CrashLoopBackOff root-cause breakdown.

Prevent It

Set both requests and limits for CPU and memory so the scheduler and kernel behave predictably.
Add a startupProbe for slow-starting apps and keep liveness probes lenient.
Test the image locally with docker run before deploying so entrypoint and command errors surface early.
Use initContainers to gate startup on real dependencies instead of crashing.
Ship a real /healthz endpoint that reflects actual readiness, not just process liveness.
Pin image tags to digests in production so a re-tagged image cannot break a running Deployment.

DEV Community