This documentation provides guidance on monitoring Step Functions and outlines the possible error causes along with recommended resolutions. Step Functions are AWS’s visual workflow service that allows companies to build distributed applications. Through Step Functions, organizations can use multiple AWS Lambda functions to automate processes, organize serverless applications and microsservices, and create data and machine learning pipelines.
We monitor for the following CloudWatch metrics:
- ActivitiesStarted
- ActivitiesTimedOut
- ExecutionsStarted
- ExecutionsTimedOut
- LambdaFunctionsStarted
- LambdaFunctionsTimedOut
- ConsumedCapacity
- ProvisionedBucketSize
Execution Time Monitor
The interval, in milliseconds, between the time the execution starts and the time it ends. Blue Matador looks for outliers in this metric.
Prolonged Execution Time
Possible Causes
- Large Payloads
- Optimize your workflow by minimizing the size of input and output payloads.
- Network Latency
- Investigate potential network issues between your Step Functions and other AWS services. Optimize the placement of resources for reduced latency.
- Resource Constraints
- Check the resource allocation for your Step Functions. Increase the memory or adjust the timeout settings if necessary.
Executions Failed Monitor
Monitoring and addressing anomalies in failed executions of AWS Step Functions is crucial for maintaining the reliability and resilience of your serverless workflows.
High Number of Failed Executions
Possible Causes
- Permissions Issues
- Verify that the IAM roles associated with your Step Functions have the necessary permissions to access resources and perform required actions.
- Invalid Input
- Review the input provided to your Step Functions. Ensure that it adheres to the expected format and values.
- Faulty State Logic
- Review the logic within the states where failures are occurring. Correct any issues with state transitions or logic that might lead to failures.
State Transition Throttling Quota Monitor
Step Functions impose a quota on the number of State Transitions, which depends on your bucket size and refill rate per second.
AWS often imposes limits on the number of requests or operations you can perform on certain services within a specific time frame, and this limit is often expressed as a rate, such as requests per second. For instance, if you're using AWS API Gateway, there might be a rate limit on the number of requests your API can handle per second.
If the quota is reached, the execution will begin to throttle. And in order to monitor this throttling, we detect if the quota has been reached through the ConsumedCapacity and ProvisionedBucketSize metrics.
Excessive State Transition Throttling
Possible Causes
- Quota Exhaustion
- Review your state transition quotas and adjust them based on the workload requirements. Consider requesting a quota increase if the current limits are consistently reached.
- Rapid State Transitions
- Analyze your workflow for patterns of rapid state transitions. Optimize your workflow logic to reduce unnecessary or rapid state transitions.
- Bucket Size Mismatch
- Ensure that the configured bucket size aligns with the actual workload. Adjust the bucket size to accommodate the expected number of state transitions.
Inconsistent State Transition Throttling
Possible Causes
- Variable Workloads
- Review your application's usage patterns. If the workload varies significantly, adjust the state transition quotas dynamically to accommodate changing demands.
- Rate Limiting Misconfigurations
- Check for misconfigurations in your state machine definitions. Verify that the rate limits are set appropriately for each state to avoid unnecessary throttling.