Skip to content

Latest commit

 

History

History
78 lines (61 loc) · 4.11 KB

README.md

File metadata and controls

78 lines (61 loc) · 4.11 KB

DevOps Guru EKS Test Harness

This project allows one to deploy an EKS cluster in their account and trigger various failure modes via a test client, in order to demonstrate functionality of DevOps Guru in a context of Kubernetes cluster.

Requirements

In order to operate this test harness you will need the following:

Installing the harness

In order to provision the cluster and install all the necessary elements:

  • Authenticate into your AWS account using credentials that have mutating permissions.
aws configure
  • Run the bootstrap script in the root folder of the repository.
./bootstrap.sh

Inspecting the cluster

If you would like to inspect the content of deployed EKS cluster, start kubectl proxy via the script in the root of the repository

./start_proxy.sh

This will allow you to view:

In order to stop the proxy process, run

./stop_proxy.sh

In order to get access token for Kubernetes dashboard, run

./get_dashboard_token.sh

Running tests

Before running tests, please make sure that your cluster has been running for at least 60 minutes, to give DevOps Guru a chance to ingest and index all the metrics.

In order to run test cases, make sure you have Python 3.6+ interpreter installed and run:

./run_test.sh <test_name>

Currently supported tests scenarios:

  • alb_4xx - triggers a series of 4XX errors in test API, producing ApplicationELB HTTPCode_Target_4XX_Count Anomalous insights in DevOps Guru. Please keep in mind, that this can take up to 15-20 minutes to trigger.
  • alb_5xx triggers a series of 5XX errors in test API, producing ApplicationELB HTTPCode_Target_5XX_Count Anomalous insights in DevOps Guru. Please keep in mind, that this can take up to 15-20 minutes to trigger.
  • stop_instance - stops one of underlying EC2 instances in EKS node group, producing ContainerInsights cluster_failed_node_count Anomalous In Stack eksctl-DevOpsGuruTestCluster-cluster insight in DevOps Guru.
  • restart_instance - restarts all the underlying EC2 instances in EKS node group, ending the anomaly caused by stop_instance.
  • enable_cpu_stress_test - enables CPU stress test mode, which brings overall cluster CPU utilization to above 90%. After 30 minutes, this produces an anomaly, which does not produce a separate insight, but will be shown as a part of alb_5xx, alb_4xx and stop_instance insights. Before enabling this mode, make sure that the cluster has been running for at least 60 minutes to establish baseline for utilization.
  • disable_cpu_stress_test - disables CPU stress test mode mentioned in enable_cpu_stress_test
  • trigger_pod_crash - installs a misconfigured deployment that induces a rolling pod crash due to a failing probe to demonstrate pod_number_of_container_restarts insights
  • disable_pod_crash - restores normal deployment configuration after trigger_pod_crash

Anomalous metric values can be confirmed via CloudWatch console, and DevOps Guru produced anomalies can be seen in DevOps Guru console.

Cleaning up test resources

In order to clean up test harness resources from your account you can run:

./cleanup.sh

In case the cleanup script fails, you can attempt manual deletion of CloudFormation stack names eksctl-DevOpsGuruTestCluster-cluster.