Automate Amazon EKS troubleshooting utilizing an Amazon Bedrock agentic workflow

As organizations scale their Amazon Elastic Kubernetes Service (Amazon EKS) deployments, platform directors face rising challenges in effectively managing multi-tenant clusters. Duties similar to investigating pod failures, addressing useful resource constraints, and resolving misconfiguration can devour vital effort and time. As a substitute of spending worthwhile engineering hours manually parsing logs, monitoring metrics, and implementing fixes, groups ought to deal with driving innovation. Now, with the ability of generative AI, you may remodel your Kubernetes operations. By implementing clever cluster monitoring, sample evaluation, and automatic remediation, you may dramatically scale back each imply time to determine (MTTI) and imply time to resolve (MTTR) for frequent cluster points.

At AWS re:Invent 2024, we introduced the multi-agent collaboration functionality for Amazon Bedrock (preview). With multi-agent collaboration, you may construct, deploy, and handle a number of AI brokers working collectively on advanced multistep duties that require specialised expertise. As a result of troubleshooting an EKS cluster entails deriving insights from a number of observability alerts and making use of fixes utilizing a steady integration and deployment (CI/CD) pipeline, a multi-agent workflow might help an operations group streamline the administration of EKS clusters. The workflow supervisor agent can combine with particular person brokers that interface with particular person observability alerts and a CI/CD workflow to orchestrate and carry out duties based mostly on person immediate.

On this submit, we show orchestrate a number of Amazon Bedrock brokers to create a complicated Amazon EKS troubleshooting system. By enabling collaboration between specialised brokers—deriving insights from K8sGPT and performing actions by the ArgoCD framework—you may construct a complete automation that identifies, analyzes, and resolves cluster points with minimal human intervention.

Resolution overview

The structure consists of the next core parts:

Amazon Bedrock collaborator agent – Orchestrates the workflow and maintains context whereas routing person prompts to specialised brokers, managing multistep operations and agent interactions
Amazon Bedrock agent for K8sGPT – Evaluates cluster and pod occasions by K8sGPT’s Analyze API for safety points, misconfigurations, and efficiency issues, offering remediation recommendations in pure language
Amazon Bedrock agent for ArgoCD – Manages GitOps-based remediation by ArgoCD, dealing with rollbacks, useful resource optimization, and configuration updates

The next diagram illustrates the answer structure.

Conditions

It’s good to have the next stipulations in place:

Arrange the Amazon EKS cluster with K8sGPT and ArgoCD

We begin with putting in and configuring the K8sGPT operator and ArgoCD controller on the EKS cluster.

The K8sGPT operator will assist with enabling AI-powered evaluation and troubleshooting of cluster points. For instance, it may well routinely detect and recommend fixes for misconfigured deployments, similar to figuring out and resolving useful resource constraint issues in pods.

ArgoCD is a declarative GitOps steady supply device for Kubernetes that automates the deployment of functions by conserving the specified software state in sync with what’s outlined in a Git repository.

The Amazon Bedrock agent serves because the clever decision-maker in our structure, analyzing cluster points detected by K8sGPT. After the foundation trigger is recognized, the agent orchestrates corrective actions by ArgoCD’s GitOps engine. This highly effective integration signifies that when issues are detected (whether or not it’s a misconfigured deployment, useful resource constraints, or scaling difficulty), the agent can routinely combine with ArgoCD to offer the mandatory fixes. ArgoCD then picks up these adjustments and synchronizes them together with your EKS cluster, creating a really self-healing infrastructure.

Create the mandatory namespaces in Amazon EKS:

kubectl create ns helm-guestbook
kubectl create ns k8sgpt-operator-system

Add the k8sgpt Helm repository and set up the operator:

helm repo add k8sgpt https://charts.k8sgpt.ai/
helm repo replace
helm set up k8sgpt-operator k8sgpt/k8sgpt-operator 
  --namespace k8sgpt-operator-system

You possibly can confirm the set up by coming into the next command:

kubectl get pods -n k8sgpt-operator-system

NAME                                                          READY   STATUS    RESTARTS  AGE
release-k8sgpt-operator-controller-manager-5b749ffd7f-7sgnd   2/2     Operating   0         1d

After the operator is deployed, you may configure a K8sGPT useful resource. This Customized Useful resource Definition(CRD) could have the giant language mannequin (LLM) configuration that may help in AI-powered evaluation and troubleshooting of cluster points. K8sGPT helps numerous backends to assist in AI-powered evaluation. For this submit, we use Amazon Bedrock because the backend and Anthropic’s Claude V3 because the LLM.

It’s good to create the pod identification for offering the EKS cluster entry to different AWS providers with Amazon Bedrock:

eksctl create podidentityassociation  --cluster PetSite --namespace k8sgpt-operator-system --service-account-name k8sgpt  --role-name k8sgpt-app-eks-pod-identity-role --permission-policy-arns arn:aws:iam::aws:coverage/AmazonBedrockFullAccess  --region $AWS_REGION

Configure the K8sGPT CRD:

cat << EOF > k8sgpt.yaml
apiVersion: core.k8sgpt.ai/v1alpha1
form: K8sGPT
metadata:
  title: k8sgpt-bedrock
  namespace: k8sgpt-operator-system
spec:
  ai:
    enabled: true
    mannequin: anthropic.claude-v3
    backend: amazonbedrock
    area: us-east-1
    credentials:
      secretRef:
        title: k8sgpt-secret
        namespace: k8sgpt-operator-system
  noCache: false
  repository: ghcr.io/k8sgpt-ai/k8sgpt
  model: v0.3.48
EOF

kubectl apply -f k8sgpt.yaml

Validate the settings to substantiate the k8sgpt-bedrock pod is operating efficiently:

kubectl get pods -n k8sgpt-operator-system
NAME                                                          READY   STATUS    RESTARTS      AGE
k8sgpt-bedrock-5b655cbb9b-sn897                               1/1     Operating   9 (22d in the past)   22d
release-k8sgpt-operator-controller-manager-5b749ffd7f-7sgnd   2/2     Operating   3 (10h in the past)   22d

Now you may configure the ArgoCD controller:

helm repo add argo https://argoproj.github.io/argo-helm
helm repo replace
kubectl create namespace argocd
helm set up argocd argo/argo-cd 
  --namespace argocd 
  --create-namespace

Confirm the ArgoCD set up:

kubectl get pods -n argocd
NAME                                                READY   STATUS    RESTARTS   AGE
argocd-application-controller-0                     1/1     Operating   0          43d
argocd-applicationset-controller-5c787df94f-7jpvp   1/1     Operating   0          43d
argocd-dex-server-55d5769f46-58dwx                  1/1     Operating   0          43d
argocd-notifications-controller-7ccbd7fb6-9pptz     1/1     Operating   0          43d
argocd-redis-587d59bbc-rndkp                        1/1     Operating   0          43d
argocd-repo-server-76f6c7686b-rhjkg                 1/1     Operating   0          43d
argocd-server-64fcc786c-bd2t8                       1/1     Operating   0          43d

Patch the argocd service to have an exterior load balancer:

kubectl patch svc argocd-server -n argocd -p '{"spec": {"sort": "LoadBalancer"}}'

Now you can entry the ArgoCD UI with the next load balancer endpoint and the credentials for the admin person:

kubectl get svc argocd-server -n argocd
NAME            TYPE           CLUSTER-IP       EXTERNAL-IP                                                              PORT(S)                      AGE
argocd-server   LoadBalancer   10.100.168.229   a91a6fd4292ed420d92a1a5c748f43bc-653186012.us-east-1.elb.amazonaws.com   80:32334/TCP,443:32261/TCP   43d

Retrieve the credentials for the ArgoCD UI:

export argocdpassword=`kubectl -n argocd get secret argocd-initial-admin-secret 
-o jsonpath="{.information.password}" | base64 -d`

echo ArgoCD admin password - $argocdpassword

Push the credentials to AWS Secrets and techniques Supervisor:

aws secretsmanager create-secret 
--name argocdcreds 
--description "Credentials for argocd" 
--secret-string "{"USERNAME":"admin","PASSWORD":"$argocdpassword"}"

Configure a pattern software in ArgoCD:

cat << EOF > argocd-application.yaml
apiVersion: argoproj.io/v1alpha1
form: Utility
metadata:
title: helm-guestbook
namespace: argocd
spec:
undertaking: default
supply:
repoURL: https://github.com/awsvikram/argocd-example-apps
targetRevision: HEAD
path: helm-guestbook
vacation spot:
server: https://kubernetes.default.svc
namespace: helm-guestbook
syncPolicy:
automated:
prune: true
selfHeal: true
EOF

Apply the configuration and confirm it from the ArgoCD UI by logging in because the admin person:
```
kubectl apply -f argocd-application.yaml
```
It takes a while for K8sGPT to investigate the newly created pods. To make that rapid, restart the pods created within the k8sgpt-operator-system namespace. The pods might be restarted by coming into the next command:
```
kubectl -n k8sgpt-operator-system rollout restart deploy

deployment.apps/k8sgpt-bedrock restarted
deployment.apps/k8sgpt-operator-controller-manager restarted
```

Arrange the Amazon Bedrock brokers for K8sGPT and ArgoCD

We use a CloudFormation stack to deploy the person brokers into the US East (N. Virginia) Area. Whenever you deploy the CloudFormation template, you deploy a number of assets (prices might be incurred for the AWS assets used).

Use the next parameters for the CloudFormation template:

The stack creates the next AWS Lambda capabilities:

-LambdaK8sGPTAgent-
-RestartRollBackApplicationArgoCD-
-ArgocdIncreaseMemory-

The stack creates the next Amazon Bedrock brokers:

ArgoCDAgent, with the next motion teams:
1. argocd-rollback
2. argocd-restart
3. argocd-memory-management

K8sGPTAgent, with the next motion group:
1. k8s-cluster-operations

The stack outputs the next, with the next brokers related to it:

ArgoCDAgent
K8sGPTAgent

LambdaK8sGPTAgentRole, AWS Identification and Entry Administration (IAM) position Amazon Useful resource Title (ARN) related to the Lambda perform handing interactions with the K8sGPT agent on the EKS cluster. This position ARN might be wanted at a later stage of the configuration course of.
K8sGPTAgentAliasId, ID of the K8sGPT Amazon Bedrock agent alias
ArgoCDAgentAliasId, ID of the ArgoCD Amazon Bedrock Agent alias
CollaboratorAgentAliasId, ID of the collaborator Amazon Bedrock agent alias

Assign applicable permissions to allow K8sGPT Amazon Bedrock agent to entry the EKS cluster

To allow the K8sGPT Amazon Bedrock agent to entry the EKS cluster, you must configure the suitable IAM permissions utilizing Amazon EKS entry administration APIs. This can be a two-step course of: first, you create an entry entry for the Lambda perform’s execution position (which you could find within the CloudFormation template output part), and then you definately affiliate the AmazonEKSViewPolicy to grant read-only entry to the cluster. This configuration makes positive that the K8sGPT agent has the mandatory permissions to observe and analyze the EKS cluster assets whereas sustaining the precept of least privilege.

Create an entry entry for the Lambda perform’s execution position

export CFN_STACK_NAME=EKS-Troubleshooter
	   export EKS_CLUSTER=PetSite

export K8SGPT_LAMBDA_ROLE=`aws cloudformation describe-stacks --stack-name $CFN_STACK_NAME --query "Stacks[0].Outputs[?OutputKey=='LambdaK8sGPTAgentRole'].OutputValue" --output textual content`

aws eks create-access-entry 
    --cluster-name $EKS_CLUSTER 
    --principal-arn $K8SGPT_LAMBDA_ROLE

Affiliate the EKS view coverage with the entry entry

aws eks associate-access-policy 
    --cluster-name $EKS_CLUSTER 
    --principal-arn  $K8SGPT_LAMBDA_ROLE
    --policy-arn arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy 
    --access-scope sort=cluster

Confirm the Amazon Bedrock brokers. The CloudFormation template provides all three required brokers. To view the brokers, on the Amazon Bedrock console, below Builder instruments within the navigation pane, choose Brokers, as proven within the following screenshot.

Carry out Amazon EKS troubleshooting utilizing the Amazon Bedrock agentic workflow

Now, take a look at the answer. We discover the next two eventualities:

The agent coordinates with the K8sGPT agent to offer insights into the foundation reason for a pod failure
The collaborator agent coordinates with the ArgoCD agent to offer a response

Agent coordinates with K8sGPT agent to offer insights into the foundation reason for a pod failure

On this part, we look at a down alert for a pattern software known as memory-demo. We’re within the root reason for the problem. We use the next immediate: “We acquired a down alert for the memory-demo app. Assist us with the foundation reason for the problem.”

The agent not solely acknowledged the foundation trigger, however went one step additional to probably repair the error, which on this case is rising reminiscence assets to the appliance.

Collaborator agent coordinates with ArgoCD agent to offer a response

For this state of affairs, we proceed from the earlier immediate. We really feel the appliance wasn’t supplied sufficient reminiscence, and it ought to be elevated to completely repair the problem. We are able to additionally inform the appliance is in an unhealthy state within the ArgoCD UI, as proven within the following screenshot.

Let’s now proceed to extend the reminiscence, as proven within the following screenshot.

The agent interacted with the argocd_operations Amazon Bedrock agent and was capable of efficiently enhance the reminiscence. The identical might be inferred within the ArgoCD UI.

Cleanup

In the event you resolve to cease utilizing the answer, full the next steps:

To delete the related assets deployed utilizing AWS CloudFormation:
1. On the AWS CloudFormation console, select Stacks within the navigation pane.
2. Find the stack you created in the course of the deployment course of (you assigned a reputation to it).
3. Choose the stack and select Delete.
Delete the EKS cluster in case you created one particularly for this implementation.

Conclusion

By orchestrating a number of Amazon Bedrock brokers, we’ve demonstrated construct an AI-powered Amazon EKS troubleshooting system that simplifies Kubernetes operations. This integration of K8sGPT evaluation and ArgoCD deployment automation showcases the highly effective prospects when combining specialised AI brokers with current DevOps instruments. Though this resolution represents development in automated Kubernetes operations, it’s vital to do not forget that human oversight stays worthwhile, notably for advanced eventualities and strategic choices.

As Amazon Bedrock and its agent capabilities proceed to evolve, we will count on much more refined orchestration prospects. You possibly can prolong this resolution to include extra instruments, metrics, and automation workflows to fulfill your group’s particular wants.

To be taught extra about Amazon Bedrock, seek advice from the next assets:

In regards to the authors

Vikram Venkataraman is a Principal Specialist Options Architect at Amazon Internet Companies (AWS). He helps clients modernize, scale, and undertake finest practices for his or her containerized workloads. With the emergence of Generative AI, Vikram has been actively working with clients to leverage AWS’s AI/ML providers to resolve advanced operational challenges, streamline monitoring workflows, and improve incident response by clever automation.

Puneeth Ranjan Komaragiri is a Principal Technical Account Supervisor at Amazon Internet Companies (AWS). He’s notably keen about monitoring and observability, cloud monetary administration, and generative AI domains. In his present position, Puneeth enjoys collaborating carefully with clients, leveraging his experience to assist them design and architect their cloud workloads for optimum scale and resilience.

Sudheer Sangunni is a Senior Technical Account Supervisor at AWS Enterprise Assist. Together with his intensive experience within the AWS Cloud and large information, Sudheer performs a pivotal position in aiding clients with enhancing their monitoring and observability capabilities inside AWS choices.

Vikrant Choudhary is a Senior Technical Account Supervisor at Amazon Internet Companies (AWS), specializing in healthcare and life sciences. With over 15 years of expertise in cloud options and enterprise structure, he helps companies speed up their digital transformation initiatives. In his present position, Vikrant companions with clients to architect and implement revolutionary options, from cloud migrations and software modernization to rising applied sciences similar to generative AI, driving profitable enterprise outcomes by cloud adoption.

Automate Amazon EKS troubleshooting utilizing an Amazon Bedrock agentic workflow

When Predictors Collide: Mastering VIF in Multicollinear Regression

Layers of the AI Stack, Defined Merely

Layers of the AI Stack, Defined Merely

Leave a Reply Cancel reply

Popular News

How Aviva constructed a scalable, safe, and dependable MLOps platform utilizing Amazon SageMaker

Diffusion Mannequin from Scratch in Pytorch | by Nicholas DiSalvo | Jul, 2024

Unlocking Japanese LLMs with AWS Trainium: Innovators Showcase from the AWS LLM Growth Assist Program

Proton launches ‘Privacy-First’ AI Email Assistant to Compete with Google and Microsoft

Streamlit fairly styled dataframes half 1: utilizing the pandas Styler

About Us

Category

Recent Posts