Docs

    Azure Kubernetes Service (AKS) provides a managed Kubernetes service for running containerized applications. This documentation offers guidance on diagnosing potential issues that may occur with this service. It identifies common sources of errors and provides actionable recommendations for effective resolution.

    We monitor the following metrics:

    • azure.aks.cluster.health
    • azure.aks.node_pool_nodes.count
    • azure.aks.cluster_autoscaler.health
    • azure.aks.node_pool_max_nodes.count
    • azure.aks.node_pool_cpu_usage.percent
    • azure.aks.node_pool_disk_usage.percent
    • azure.aks.node_pool_memory_working_set.percent

    Blue Matador continuously monitors these metrics on your AKS clusters, creating Events whenever thresholds are exceeded or significant changes occur. 

     

    Common Errors when Monitoring Azure Kubernetes Service


    When monitoring Azure Kubernetes Service, various issues may arise, affecting the reliability and performance of your clusters. Below are common errors and their potential resolutions.

     

    Node Issues


    Possible Causes

    • Node Not Ready
      • Ensure the node is properly registered and running.
      • Check for network connectivity issues between the node and the control plane.
      • Verify that the node has sufficient resources (CPU, memory) to operate.
    • Node Resource Exhaustion
      • Monitor CPU and memory usage on nodes.
      • Scale the number of nodes in the cluster to distribute the load.
      • Optimize workloads to reduce resource consumption.

     

    Performance Degradation


    Possible Causes

    • High CPU or Memory Usage
      • Monitor CPU and memory usage at both the node and pod levels.
      • Optimize application code and configurations to reduce resource usage.
      • Scale out the application by adding more replicas.
    • Network Latency
      • Check the network policies and configurations.
      • Monitor network traffic and investigate any bottlenecks.
      • Ensure that services are properly configured for load balancing.

     

    Pod Issues


    Possible Causes

    • Pod Pending
      • Check for resource requests and limits that might be too high.
      • Ensure there are sufficient resources available in the cluster.
      • Investigate any issues with node scheduling or taints and tolerations.
    • Pod CrashLoopBackOff
      • Examine pod logs to identify the cause of the crashes.
      • Verify that all dependencies and configurations are correctly set.
      • Adjust resource requests and limits to prevent resource starvation.

     

    By adhering to these troubleshooting steps and routinely monitoring your Azure Kubernetes Service with Blue Matador, you can adeptly manage and optimize your cluster resources. This practice ensures the reliability and efficiency of your Kubernetes infrastructure, promoting seamless operation and timely response to any performance or availability issues.

     

    Resources