Elastic Load Balancers route traffic to your application. You can generally expect a steady stream of requests to your load balancers; even some 400s and 500s are normal. However, the rate at which your load balancers service requests or produce 400s or 500s can be a good indicator of application health. An anomaly in these metrics can signal problems before they become apparently in other parts of the application. Blue Matador automatically detects these anomalies and alerts you about them.
A healthy application should see a relatively stable request rate. Both an anomalous increase or decrease in request count could signal a malfunctioning application. Possible causes of changes in request count include:
If the request count increased for legitimate reasons (increase in users or a new feature), you may need to add additional targets to your load balancer to handle the increased load.
When the rate of 4xx response codes increases, it’s likely the case that a client that makes requests to your ELBs is buggy. Possible reasons include:
When the rate of 5xx response codes increases, your problem is most likely due to a bug in your server side code. The increase can often be tied to a specific code release. Correlating with your release schedule should be your first place to look for clues as to what went wrong. Other possible causes include:
Access logs are very helpful when diagnosing issues with ELBs. By default, ELB does not collect access logs, but can be configured to send logs to S3. You can then configure your log management tool (or download the files and use grep) to look for endpoints that are causing problems.
For Classic load balancers, the Latency metric in CloudWatch measures the time it takes for a registered instance to send response headers after receiving a request from the load balancer. For Application load balancers, the equivalent CloudWatch metric is TargetResponseTime.
An increase in latency can indicate a performance issue with your application. If traffic patterns for your load balancer have not changed significantly, check to see if a downstream service such as a database is experiencing high latency, and propagating that time to your web server. If you have seen an increase in traffic, it is possible that your instances are overloaded and adding capacity to the load balancer may help. For low-traffic load balancers, it is also possible that the average latency is thrown off by a few requests taking a very long time.
For Application load balancers, the ProcessedBytes metric measures the total number of bytes going in and out of the load balancer. A change in this metric can be caused by two things:
Anomalies with bytes processed are mostly useful for correlating other issues.