Chaos and resiliency testing tool for Kubernetes and OpenShift
Kraken injects deliberate failures into Kubernetes/OpenShift clusters to check if it is resilient to failures.
$ pip3 install -r requirements.txt
Set the scenarios to inject and the tunings like duration to wait between each scenario in the config file located at config/config.yaml. Kraken uses powerfulseal tool for pod based scenarios, a sample config looks like:
kraken:
kubeconfig_path: /root/.kube/config # Path to kubeconfig
scenarios: # List of policies/chaos scenarios to load
- scenarios/etcd.yml
- scenarios/openshift-kube-apiserver.yml
- scenarios/openshift-apiserver.yml
node_scenarios: # List of chaos node scenarios to load
- scenarios/node_scenarios_example.yml
tunings:
wait_duration: 60 # Duration to wait between each chaos scenario
$ python3 run_kraken.py --config <config_file_location>
Assuming that the latest docker ( 17.05 or greater with multi-build support ) is intalled on the host, run:
$ docker pull quay.io/openshift-scale/kraken:latest
$ docker run --name=kraken --net=host -v <path_to_kubeconfig>:/root/.kube/config -v <path_to_kraken_config>:/root/kraken/config/config.yaml -d quay.io/openshift-scale/kraken:latest
$ docker logs -f kraken
Similarly, podman can be used to achieve the same:
$ podman pull quay.io/openshift-scale/kraken
$ podman run --name=kraken --net=host -v <path_to_kubeconfig>:/root/.kube/config:Z -v <path_to_kraken_config>:/root/kraken/config/config.yaml:Z -d quay.io/openshift-scale/kraken:latest
$ podman logs -f kraken
If you want to build your own kraken image see here
The report is generated in the run directory and it contains the information about each chaos scenario injection along with timestamps.
Cerberus can be used to monitor the cluster under test and the aggregated go/no-go signal generated by it can be consumed by Kraken to determine pass/fail. This is to make sure the Kubernetes/OpenShift environments are healthy on a cluster level instead of just the targeted components level. It is highly recommended to turn on the Cerberus health check feature avaliable in Kraken after installing and setting up Cerberus. To do that, set cerberus_enabled to True and cerberus_url to the url where Cerberus publishes go/no-go signal in the config file.
Kraken currently just supports pod and node based scenarios, we will be adding more soon.
Following node chaos scenarios are supported:
- node_start_scenario: scenario to stop the node instance.
- node_stop_scenario: scenario to stop the node instance.
- node_stop_start_scenario: scenario to stop and then start the node instance.
- node_termination_scenario: scenario to terminate the node instance.
- node_reboot_scenario: scenario to reboot the node instance.
- stop_kubelet_scenario: scenario to stop the kubelet of the node instance.
- stop_start_kubelet_scenario: scenario to stop and start the kubelet of the node instance.
- node_crash_scenario: scenario to crash the node instance.
NOTE: If the node doesn't recover from the node_crash_scenario injection, reboot the node to get it back to Ready state.
NOTE: node_start_scenario, node_stop_scenario, node_stop_start_scenario, node_termination_scenario, node_reboot_scenario and stop_start_kubelet_scenario are supported only on AWS as of now.
NOTE: With AWS as the cloud type, make sure AWS CLI is installed.
Node scenarios can be injected by placing the node scenarios config files under node_scenarios option in the kraken config. Refer to node_scenarios_example config file.
node_scenarios:
- actions: # node chaos scenarios to be injected
- node_stop_start_scenario
- stop_start_kubelet_scenario
- node_crash_scenario
node_name: # node on which scenario has to be injected
label_selector: node-role.kubernetes.io/worker # when node_name is not specified, a node with matching label_selector is selected for node chaos scenario injection
instance_kill_count: 1 # number of times to inject each scenario under actions
timeout: 120 # duration to wait for completion of node scenario injection
cloud_type: aws # cloud type on which Kubernetes/OpenShift runs
- actions:
- node_reboot_scenario
node_name:
label_selector: node-role.kubernetes.io/infra
instance_kill_count: 1
timeout: 120
cloud_type: aws
Following are the components of Kubernetes/OpenShift for which a basic chaos scenario config exists today. Adding a new pod based scenario is as simple as adding a new config under scenarios directory and defining it in the config.
Component | Description | Working |
---|---|---|
Etcd | Kills a single/multiple etcd replicas for the specified number of times in a loop | ✔️ |
Kube ApiServer | Kills a single/multiple kube-apiserver replicas for the specified number of times in a loop | ✔️ |
ApiServer | Kills a single/multiple apiserver replicas for the specified number of times in a loop | ✔️ |
Prometheus | Kills a single/multiple prometheus replicas for the specified number of times in a loop | ✔️ |