Amazon Elasticsearch exposes the cluster’s overall health in many ways. Blue Matador monitors several indicators for a cluster’s overall health and helps you correlate issues and minimize downtime of your Elasticsearch cluster.
Health status is expressed by three colors: green, yellow, and red. A green status means that all primary shards and their replicas are allocated to nodes. A yellow status means that all primary shards are allocated to nodes, but some replicas are not. A red status means at least one primary shard is not allocated to any node.
A common cause of a yellow status is not having enough nodes in the cluster for the primary or replica shards. For example, if you had a 3-node cluster and created an index with 1 primary shards and 3 replicas, your cluster would be in a yellow state. This is because the primary shards can be allocated but only 2 of the replicas could be allocated.
A red cluster status can be a more severe issue. While a cluster is red, automatic snapshots will not take place. A red cluster is ultimately caused by an index that does not have all primary shards allocated. Two Elasticsearch APIs are extremely useful when debugging a red cluster: GET /_cluster/allocation/explain
and GET /_cat/indices?v
. These will help you identify problematic indices which can then be deleted or reconfigured, or you can add/replace nodes in your cluster or restore from a snapshot. The proper action depends on the data contained in the index and your tolerance how long the cluster can be red.
An Elasticsearch cluster that is blocking writes is almost always correlated with another issue in the cluster. A cluster in this state is blocking the creation of new indices or documents for all or part of the cluster. There are two common scenarios in which this can occur: low available storage space and high JVM pressure. In general, use the steps above for a red cluster to figure out which indices may be blocking writes, and ensure your master nodes are healthy.
When your Elasticsearch domain reports fewer nodes in the Nodes CloudWatch metric than are configured for a significant amount of time, your cluster may be unhealthy. This can occur if a node fails or as a result of a configuration change to the cluster.
Since Amazon Elasticsearch can take a considerable amount of time to apply changes to a domain, the node count may not match your expectations while changes are being applied. You might see a node count that is higher than what you’ve configured for the time it takes data to transfer from the old to new nodes, or you might see fewer nodes if not all nodes could start correctly. Enable error logs in your domain to troubleshoot any persistent issues with node count, or open up a support ticket with AWS.
Master reachability indicates when a master node is no longer responsive. It is measured in the MasterReachableFromNode CloudWatch metric and a value of 0 indicates that /_cluster/health/
is failing. When using dedicated masters, a new master should be elected and the old one could be replaced automatically. If this issue persists, you can open up a support ticket with AWS.
Amazon Elasticsearch supports data encryption using KMS keys. If your cluster has KMS enabled, there are two ways in which your data could become inaccessible.
If your KMS key has been disabled, then the KMSKeyError metric in CloudWatch indicates that your data can no longer be read by the cluster. Re-enabling the key will return your cluster to normal.
If your KMS key has been deleted, then the KMSKeyInaccessible metric in CloudWatch indicates that the data in the domain is permanently unavailable. The only path to recovering from this state is if you had a manual snapshot of your Elasticsearch data that could be restored in a new domain.