When a pod cannot be scheduled or has containers that are waiting, the pod will be in a stuck in the pending state. Common reasons for a container to be waiting include:
When troubleshooting a waiting container, make sure the spec for its pod is defined correctly. Use kubectl describe pod <pod name> to get detailed information on how and when the containers in that pod changed states.
When a container is out of memory, or OOM, it is restarted by its pod according to the restart policy. The default restart policy will eventually back off on restarting the pod if it restarts many times in a short time span.
To investigate a container that is going OOM, check that the pod spec memory request and limit are high enough for the running application. You can also debug your application's memory usage to figure out if there is a slow memory leak, or if there are other ways to reduce memory usage on the container.
In most cases, a container is expected to be long-lived. A restarting container can indicate problems with memory (see the Out of Memory section), cpu usage, or just an application exiting prematurely.
If a container is being restarted because of CPU usage, try increasing the requested and limit amounts for CPU in the pod spec. Remember that 1000m equals one virtual CPU on most providers. If the container does not always need tons of CPU, but has a bursty workload, you can set the requested value to be smaller (e.g. 100m) and the limit to be higher (e.g. 500m) to allow the container to use the CPU it needs without permanently taking up lots of scheduled CPU.
To debug an application exiting prematurely, it can be helpful to modify the command the container is started with. You can set the command to something like sleep 10000 so that you can connect to the container with kubectl exec -it <pod name> <container name> and then manually run the application, and check its exit code.
Blue Matador will detect when a pod could not be scheduled by checking for events in the kubernetes cluster. A pod may be unschedulable for several reasons:
In rare cases, it is possible for a pod to get stuck in the terminating state. This is detected by finding any pods where every container has been terminated, but the pod is still running. Usually, this is caused when a node in the cluster gets taken out of service abruptly, and the cluster scheduler and controller-manager do not clean up all of the pods on that node.
Solving this issue is as simple as manually deleting the pod using kubectl delete pod <pod name>.
A PreStop hook can be used to execute commands on your pod before it is terminated. Your pod may save state, update settings, or send a signal to other pods using PreStop hooks. Blue Matador detects when the PreStop hook fails and creates an anomaly with relevant error messages that should indicate why the hook failed. If the message indicates that the grace period for the pod was exceeded, then you may specify a higher grace period than the default 30s in your pod controller.
Since this event would usually happen during a deployment, you will want to debug it another way to avoid service disruptions. This can be accomplished by running a stand-alone pod with the same configuration, and then deleting it. Pods that exit normally, or are Completed do not cause PreStop hooks to be called; only external signals to terminate the pod will allow the hook to execute.