Backend errors occur when a connection cannot be established between a load balancer and the hosts that traffic is routed to. In the case of Classic load balancers, this is measured by the BackendConnectionErrors metric. For Application load balancers, it is measured by the TargetConnectionErrorCount metric.
A high backend error count will cause the load balancer to retry the request to the backend instance or target. This can introduce latency for your clients and cause requests to fail. Because the load balancer retries failed connections, it is possible for the rate of backend errors to exceed the request rate. For Classic load balancers, the number of backend connection errors also includes any errors related to the health check.
The primary cause for connection errors is that the instance is overloaded and rejecting new connections. If the increased connection errors coincides with an increase in the request rate, then this is the likely culprit.
Another common cause for connection errors is that traffic is being routed to a port specified in the load balancer’s listener that an instance is not listening to. This can happen if the process listening on the expected port dies unexpectedly, or if a firewall or security group is not allowing access on the port. If the health check port is different than the listener port, it is possible for this to occur even when the health checks succeed.
First, make sure that the backend instances are reachable. If you can access some of the instances via ssh, you can make sure that the expected ports are being listened to by using netstat -tulp . Remember to make sure that all of the health check ports and listener ports are being listened to on the instance.
Then, you can check to see if the load balancer is able to reach the instance on the ports in the listener and health check by checking the security groups on the instances. Ensure that the security group for the load balancer is allowed in the inbound rules for the instances.
Finally, if all instances are are listening on the correct ports, and the load balancer is allowed to reach the instance on those ports, then you can investigate further by using curl to send requests to specific instances. Once you can pinpoint which instances are causing the errors, you can then check for web server access logs on those instances or do a package capture to investigate why the TCP connections are being closed. In the packet capture, look for RST packets. If the instances are simply overloaded with requests, try adding more capacity to the load balancer until it can handle the required amount of traffic.