Chaos engineering on Amazon EKS using AWS Fault Injection Simulator(FIS)

Vardan Sharma
5 min readJul 6, 2023

--

Chaos engineering

What is Chaos Engineering

Chaos engineering is to improve the fault tolerance of a system that “intentionally causes a failure in a running system to expose the weaknesses of the system and improve the weaknesses so that the system can withstand actual failures”.

Kubernetes, which is the main axis of cloud native technology, is designed with an emphasis on “how quickly you can recover in the event of a failure”, and it has automatic recovery functions such as self-healing function, but failure patterns and recovery patterns If you do not organize (automatic recovery or manual recovery, etc.), you will not be able to withstand actual commercial failures.

AWS FIS

AWS Fault Injection Simulator is a fully managed service for running fault injection experiments on AWS that makes it easier to improve an application’s performance, observability, and resiliency.

The basic execution procedure is only the following two steps.

(1) Creating an experiment template
(2) Executing an experiment

In this blog, we perform the following experiments:

(1) Terminate node group instances

(2) Delete application pods

Prerequisites:

  • A running Amazon EKS cluster with Cluster Autoscaler and AmazonSSMManagedInstanceCore {AWS managed policy} attached with Node IAM Role.
  • kubectl locally installed to interact with the Amazon EKS cluster.

Create IAM Role

  1. Navigate to IAM. Services > IAM.
  2. Click Roles and then click Create role.
  3. Select AWS service. I don’t see Fault Injection Simulator (FIS) as a service option to select so selected EC2. We will need to add a trust policy for FIS. Click Next.
  4. Click Policies in left pane and click on Create policy.
  5. Select the JSON tab. Copy and paste the following policy and click Next.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "FISPermissions",
"Effect": "Allow",
"Action": [
"fis:*"
],
"Resource": "*"
},
{
"Sid": "ReadOnlyActions",
"Effect": "Allow",
"Action": [
"ssm:Describe*",
"ssm:Get*",
"ssm:List*",
"ec2:DescribeInstances",
"rds:DescribeDBClusters",
"ecs:DescribeClusters",
"ecs:ListContainerInstances",
"eks:DescribeNodegroup",
"cloudwatch:DescribeAlarms",
"iam:ListRoles"
],
"Resource": "*"
},
{
"Sid": "PermissionsToCreateServiceLinkedRole",
"Effect": "Allow",
"Action": "iam:CreateServiceLinkedRole",
"Resource": "*",
"Condition": {
"StringEquals": {
"iam:AWSServiceName": "fis.amazonaws.com"
}
}
}
]
}

6. Click Next.
7. On the Review policy page create a Name for your policy. Click Create policy.

8. Click on Create policy again, Select the JSON tab. Copy and paste the following policy and click Next.

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowFISExperimentRoleReadOnly",
"Effect": "Allow",
"Action": [
"ec2:DescribeInstances",
"ecs:DescribeClusters",
"ecs:ListContainerInstances",
"eks:DescribeNodegroup",
"iam:ListRoles",
"rds:DescribeDBInstances",
"rds:DescribeDbClusters",
"ssm:ListCommands"
],
"Resource": "*"
},
{
"Sid": "AllowFISExperimentRoleEC2Actions",
"Effect": "Allow",
"Action": [
"ec2:RebootInstances",
"ec2:StopInstances",
"ec2:StartInstances",
"ec2:TerminateInstances"
],
"Resource": "arn:aws:ec2:*:*:instance/*"
},
{
"Sid": "AllowFISExperimentRoleECSActions",
"Effect": "Allow",
"Action": [
"ecs:UpdateContainerInstancesState",
"ecs:ListContainerInstances"
],
"Resource": "arn:aws:ecs:*:*:container-instance/*"
},
{
"Sid": "AllowFISExperimentRoleEKSActions",
"Effect": "Allow",
"Action": [
"ec2:TerminateInstances"
],
"Resource": "arn:aws:ec2:*:*:instance/*"
},
{
"Sid": "AllowFISExperimentRoleFISActions",
"Effect": "Allow",
"Action": [
"fis:InjectApiInternalError",
"fis:InjectApiThrottleError",
"fis:InjectApiUnavailableError"
],
"Resource": "arn:*:fis:*:*:experiment/*"
},
{
"Sid": "AllowFISExperimentRoleRDSReboot",
"Effect": "Allow",
"Action": [
"rds:RebootDBInstance"
],
"Resource": "arn:aws:rds:*:*:db:*"
},
{
"Sid": "AllowFISExperimentRoleRDSFailOver",
"Effect": "Allow",
"Action": [
"rds:FailoverDBCluster"
],
"Resource": "arn:aws:rds:*:*:cluster:*"
},
{
"Sid": "AllowFISExperimentRoleSSMSendCommand",
"Effect": "Allow",
"Action": [
"ssm:SendCommand"
],
"Resource": [
"arn:aws:ec2:*:*:instance/*",
"arn:aws:ssm:*:*:document/*"
]
},
{
"Sid": "AllowFISExperimentRoleSSMCancelCommand",
"Effect": "Allow",
"Action": [
"ssm:CancelCommand"
],
"Resource": "*"
}
]
}

9. Go back to the Create role tab in your browser. Click the refresh button on the page.
10. Search for and select your newly created policies and click Next.
11. Click Next.
12. On the Review page create a Role name and click Create role.

Experiment 1: Terminate node group instances

  1. Navigate to AWS Fault Injection Simulator. Services > AWS FIS
  2. Click Create experiment template.
  3. IAM role: Select the IAM role we created
  4. Click Add action.
  5. Action type: aws:eks:terminate-nodegroup-instances
  6. For Target, choose Nodegroups-Target-1.
  7. For instanceTerminationPercentage, enter 40.
  8. Click Save.

9. Choose Edit target.

10. For Resource type, choose aws:eks:nodegroup.

11. For Target method, select Resource IDs.

12. For Resource IDs, enter resource ID of your nodegroups.

13. Choose Save.

14. Choose Create experiment template.

15. Check Cluster nodes kubectl get nodes.

16. On the AWS FIS console, navigate to the experiment template we created.

17. On the Actions menu, choose Start.

18. Enter start in the field.

19. Choose Start experiment.

Check the status of our cluster worker nodes, The process of adding a new node to the cluster takes sometime, but after a while we can see that Amazon EKS has launched new instances to replace the terminated ones.

Experiment 2: Delete application pods

  1. On the Systems Manager console, choose Documents.
  2. On the Create document menu, choose Command or Session.
  3. For Name, enter a name.
  4. In the Content section, enter the following code:
---
description: |
### Document name - Delete Pod## What does this document do?
Delete Pod in a specific namespace via kubectl## Input Parameters
* Cluster: (Required)
* Namespace: (Required)
* InstallDependencies: If set to True, Systems Manager installs the required dependencies on the target instances. (default True)## Output Parameters
None.schemaVersion: '2.2'
parameters:
Cluster:
type: String
description: '(Required) Specify the cluster name'
Namespace:
type: String
description: '(Required) Specify the target Namespace'
InstallDependencies:
type: String
description: 'If set to True, Systems Manager installs the required dependencies on the target instances (default: True)'
default: 'True'
allowedValues:
- 'True'
- 'False'
mainSteps:
- action: aws:runShellScript
name: InstallDependencies
precondition:
StringEquals:
- platformType
- Linux
description: |
## Parameter: InstallDependencies
If set to True, this step installs the required dependecy via operating system's repository.
inputs:
runCommand:
- |
#!/bin/bash
if [[ "{{ InstallDependencies }}" == True ]] ; then
if [[ "$( which kubectl 2>/dev/null )" ]] ; then echo Dependency is already installed. ; exit ; fi
echo "Installing required dependencies"
sudo mkdir -p $HOME/bin && cd $HOME/bin
sudo curl -o kubectl https://amazon-eks.s3.us-west-2.amazonaws.com/1.20.4/2021-04-12/bin/linux/amd64/kubectl
sudo chmod +x ./kubectl
export PATH=$PATH:$HOME/bin
fi
- action: aws:runShellScript
name: ExecuteKubectlDeletePod
precondition:
StringEquals:
- platformType
- Linux
description: |
## Parameters: Namespace, Cluster, Namespace
This step will terminate the random first pod based on namespace provided
inputs:
maxAttempts: 1
runCommand:
- |
if [ -z "{{ Cluster }}" ] ; then echo Cluster not specified && exit; fi
if [ -z "{{ Namespace }}" ] ; then echo Namespace not specified && exit; fi
pgrep kubectl && echo Another kubectl command is already running, exiting... && exit
EC2_REGION=$(curl -s http://169.254.169.254/latest/dynamic/instance-identity/document|grep region | awk -F\" '{print $4}')
aws eks --region $EC2_REGION update-kubeconfig --name {{ Cluster }} --kubeconfig /home/ssm-user/.kube/config
echo Running kubectl command...
TARGET_POD=$(kubectl --kubeconfig /home/ssm-user/.kube/config get pods -n {{ Namespace }} -o jsonpath={.items[0].metadata.name})
echo "TARGET_POD: $TARGET_POD"
kubectl --kubeconfig /home/ssm-user/.kube/config delete pod $TARGET_POD -n {{ Namespace }} --grace-period=0 --force
echo Finished kubectl delete pod command.

5. Choose Create document.

6. Create a new experiment template on the AWS FIS console.

7. For Action type, choose aws:ssm:send-command.

8. For documentARN, enter arn:aws:ssm:<region>:<accountId>:document/{Name-Step-3}.

9. For documentParameters, enter:

{“Cluster”:”{cluster-name}”, “Namespace”:”{namespace}”, “InstallDependencies”:”True”}

10. Choose Save.

11. For our targets, choose resource IDs and select id’s.

12. After you create the template successfully, start the experiment.

When the experiment is complete, check your application pods, a new pod replica was created.

--

--

Vardan Sharma
Vardan Sharma

Written by Vardan Sharma

0 Followers

Technical Lead(DevOps)

No responses yet