This project allows one to deploy an EKS cluster in their account and trigger various failure modes via a test client, in order to demonstrate functionality of DevOps Guru in a context of Kubernetes cluster.
In order to operate this test harness you will need the following:
- A PC with a unix-based opsystem (GNU/Linux or macOS) and a shell (bash, dash, zsh)
- Onboard used account to AWS DevOps Guru in one of the supported regions.
- Gradle
- Python 3.6+ with 'pip' utility
- Docker
- kubectl
- eksctl
- AWS CLI V2 - only v2 is supported
- Helm
In order to provision the cluster and install all the necessary elements:
- Authenticate into your AWS account using credentials that have mutating permissions.
aws configure
- Run the bootstrap script in the root folder of the repository.
./bootstrap.sh
If you would like to inspect the content of deployed EKS cluster, start kubectl proxy via the script in the root of the repository
./start_proxy.sh
This will allow you to view:
In order to stop the proxy process, run
./stop_proxy.sh
In order to get access token for Kubernetes dashboard, run
./get_dashboard_token.sh
Before running tests, please make sure that your cluster has been running for at least 60 minutes, to give DevOps Guru a chance to ingest and index all the metrics.
In order to run test cases, make sure you have Python 3.6+ interpreter installed and run:
./run_test.sh <test_name>
Currently supported tests scenarios:
- alb_4xx - triggers a series of 4XX errors in test API, producing ApplicationELB HTTPCode_Target_4XX_Count Anomalous insights in DevOps Guru. Please keep in mind, that this can take up to 15-20 minutes to trigger.
- alb_5xx triggers a series of 5XX errors in test API, producing ApplicationELB HTTPCode_Target_5XX_Count Anomalous insights in DevOps Guru. Please keep in mind, that this can take up to 15-20 minutes to trigger.
- stop_instance - stops one of underlying EC2 instances in EKS node group, producing ContainerInsights cluster_failed_node_count Anomalous In Stack eksctl-DevOpsGuruTestCluster-cluster insight in DevOps Guru.
- restart_instance - restarts all the underlying EC2 instances in EKS node group, ending the anomaly caused by stop_instance.
- enable_cpu_stress_test - enables CPU stress test mode, which brings overall cluster CPU utilization to above 90%. After 30 minutes, this produces an anomaly, which does not produce a separate insight, but will be shown as a part of alb_5xx, alb_4xx and stop_instance insights. Before enabling this mode, make sure that the cluster has been running for at least 60 minutes to establish baseline for utilization.
- disable_cpu_stress_test - disables CPU stress test mode mentioned in enable_cpu_stress_test
- trigger_pod_crash - installs a misconfigured deployment that induces a rolling pod crash due to a failing probe to demonstrate pod_number_of_container_restarts insights
- disable_pod_crash - restores normal deployment configuration after trigger_pod_crash
Anomalous metric values can be confirmed via CloudWatch console, and DevOps Guru produced anomalies can be seen in DevOps Guru console.
In order to clean up test harness resources from your account you can run:
./cleanup.sh
In case the cleanup script fails, you can attempt manual deletion of CloudFormation stack names eksctl-DevOpsGuruTestCluster-cluster.