As organizations scale their Amazon Elastic Kubernetes Service (Amazon EKS) deployments, platform directors face rising challenges in effectively managing multi-tenant clusters. Duties similar to investigating pod failures, addressing useful resource constraints, and resolving misconfiguration can devour vital effort and time. As a substitute of spending worthwhile engineering hours manually parsing logs, monitoring metrics, and implementing fixes, groups ought to deal with driving innovation. Now, with the ability of generative AI, you may remodel your Kubernetes operations. By implementing clever cluster monitoring, sample evaluation, and automatic remediation, you may dramatically scale back each imply time to determine (MTTI) and imply time to resolve (MTTR) for frequent cluster points.
At AWS re:Invent 2024, we introduced the multi-agent collaboration functionality for Amazon Bedrock (preview). With multi-agent collaboration, you may construct, deploy, and handle a number of AI brokers working collectively on advanced multistep duties that require specialised expertise. As a result of troubleshooting an EKS cluster entails deriving insights from a number of observability alerts and making use of fixes utilizing a steady integration and deployment (CI/CD) pipeline, a multi-agent workflow might help an operations group streamline the administration of EKS clusters. The workflow supervisor agent can combine with particular person brokers that interface with particular person observability alerts and a CI/CD workflow to orchestrate and carry out duties based mostly on person immediate.
On this submit, we show orchestrate a number of Amazon Bedrock brokers to create a complicated Amazon EKS troubleshooting system. By enabling collaboration between specialised brokers—deriving insights from K8sGPT and performing actions by the ArgoCD framework—you may construct a complete automation that identifies, analyzes, and resolves cluster points with minimal human intervention.
Resolution overview
The structure consists of the next core parts:
- Amazon Bedrock collaborator agent – Orchestrates the workflow and maintains context whereas routing person prompts to specialised brokers, managing multistep operations and agent interactions
- Amazon Bedrock agent for K8sGPT – Evaluates cluster and pod occasions by K8sGPT’s Analyze API for safety points, misconfigurations, and efficiency issues, offering remediation recommendations in pure language
- Amazon Bedrock agent for ArgoCD – Manages GitOps-based remediation by ArgoCD, dealing with rollbacks, useful resource optimization, and configuration updates
The next diagram illustrates the answer structure.
Conditions
It’s good to have the next stipulations in place:
Arrange the Amazon EKS cluster with K8sGPT and ArgoCD
We begin with putting in and configuring the K8sGPT operator and ArgoCD controller on the EKS cluster.
The K8sGPT operator will assist with enabling AI-powered evaluation and troubleshooting of cluster points. For instance, it may well routinely detect and recommend fixes for misconfigured deployments, similar to figuring out and resolving useful resource constraint issues in pods.
ArgoCD is a declarative GitOps steady supply device for Kubernetes that automates the deployment of functions by conserving the specified software state in sync with what’s outlined in a Git repository.
The Amazon Bedrock agent serves because the clever decision-maker in our structure, analyzing cluster points detected by K8sGPT. After the foundation trigger is recognized, the agent orchestrates corrective actions by ArgoCD’s GitOps engine. This highly effective integration signifies that when issues are detected (whether or not it’s a misconfigured deployment, useful resource constraints, or scaling difficulty), the agent can routinely combine with ArgoCD to offer the mandatory fixes. ArgoCD then picks up these adjustments and synchronizes them together with your EKS cluster, creating a really self-healing infrastructure.
- Create the mandatory namespaces in Amazon EKS:
kubectl create ns helm-guestbook kubectl create ns k8sgpt-operator-system
- Add the k8sgpt Helm repository and set up the operator:
helm repo add k8sgpt https://charts.k8sgpt.ai/ helm repo replace helm set up k8sgpt-operator k8sgpt/k8sgpt-operator --namespace k8sgpt-operator-system
- You possibly can confirm the set up by coming into the next command:
kubectl get pods -n k8sgpt-operator-system NAME READY STATUS RESTARTS AGE release-k8sgpt-operator-controller-manager-5b749ffd7f-7sgnd 2/2 Operating 0 1d
After the operator is deployed, you may configure a K8sGPT useful resource. This Customized Useful resource Definition(CRD) could have the giant language mannequin (LLM) configuration that may help in AI-powered evaluation and troubleshooting of cluster points. K8sGPT helps numerous backends to assist in AI-powered evaluation. For this submit, we use Amazon Bedrock because the backend and Anthropic’s Claude V3 because the LLM.
- It’s good to create the pod identification for offering the EKS cluster entry to different AWS providers with Amazon Bedrock:
eksctl create podidentityassociation --cluster PetSite --namespace k8sgpt-operator-system --service-account-name k8sgpt --role-name k8sgpt-app-eks-pod-identity-role --permission-policy-arns arn:aws:iam::aws:coverage/AmazonBedrockFullAccess --region $AWS_REGION
- Configure the K8sGPT CRD:
cat << EOF > k8sgpt.yaml apiVersion: core.k8sgpt.ai/v1alpha1 form: K8sGPT metadata: title: k8sgpt-bedrock namespace: k8sgpt-operator-system spec: ai: enabled: true mannequin: anthropic.claude-v3 backend: amazonbedrock area: us-east-1 credentials: secretRef: title: k8sgpt-secret namespace: k8sgpt-operator-system noCache: false repository: ghcr.io/k8sgpt-ai/k8sgpt model: v0.3.48 EOF kubectl apply -f k8sgpt.yaml
- Validate the settings to substantiate the k8sgpt-bedrock pod is operating efficiently:
kubectl get pods -n k8sgpt-operator-system NAME READY STATUS RESTARTS AGE k8sgpt-bedrock-5b655cbb9b-sn897 1/1 Operating 9 (22d in the past) 22d release-k8sgpt-operator-controller-manager-5b749ffd7f-7sgnd 2/2 Operating 3 (10h in the past) 22d
- Now you may configure the ArgoCD controller:
helm repo add argo https://argoproj.github.io/argo-helm helm repo replace kubectl create namespace argocd helm set up argocd argo/argo-cd --namespace argocd --create-namespace
- Confirm the ArgoCD set up:
kubectl get pods -n argocd NAME READY STATUS RESTARTS AGE argocd-application-controller-0 1/1 Operating 0 43d argocd-applicationset-controller-5c787df94f-7jpvp 1/1 Operating 0 43d argocd-dex-server-55d5769f46-58dwx 1/1 Operating 0 43d argocd-notifications-controller-7ccbd7fb6-9pptz 1/1 Operating 0 43d argocd-redis-587d59bbc-rndkp 1/1 Operating 0 43d argocd-repo-server-76f6c7686b-rhjkg 1/1 Operating 0 43d argocd-server-64fcc786c-bd2t8 1/1 Operating 0 43d
- Patch the argocd service to have an exterior load balancer:
kubectl patch svc argocd-server -n argocd -p '{"spec": {"sort": "LoadBalancer"}}'
- Now you can entry the ArgoCD UI with the next load balancer endpoint and the credentials for the admin person:
kubectl get svc argocd-server -n argocd NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE argocd-server LoadBalancer 10.100.168.229 a91a6fd4292ed420d92a1a5c748f43bc-653186012.us-east-1.elb.amazonaws.com 80:32334/TCP,443:32261/TCP 43d
- Retrieve the credentials for the ArgoCD UI:
export argocdpassword=`kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.information.password}" | base64 -d` echo ArgoCD admin password - $argocdpassword
- Push the credentials to AWS Secrets and techniques Supervisor:
aws secretsmanager create-secret --name argocdcreds --description "Credentials for argocd" --secret-string "{"USERNAME":"admin","PASSWORD":"$argocdpassword"}"
- Configure a pattern software in ArgoCD:
cat << EOF > argocd-application.yaml apiVersion: argoproj.io/v1alpha1 form: Utility metadata: title: helm-guestbook namespace: argocd spec: undertaking: default supply: repoURL: https://github.com/awsvikram/argocd-example-apps targetRevision: HEAD path: helm-guestbook vacation spot: server: https://kubernetes.default.svc namespace: helm-guestbook syncPolicy: automated: prune: true selfHeal: true EOF
- Apply the configuration and confirm it from the ArgoCD UI by logging in because the admin person:
kubectl apply -f argocd-application.yaml
- It takes a while for K8sGPT to investigate the newly created pods. To make that rapid, restart the pods created within the k8sgpt-operator-system namespace. The pods might be restarted by coming into the next command:
kubectl -n k8sgpt-operator-system rollout restart deploy deployment.apps/k8sgpt-bedrock restarted deployment.apps/k8sgpt-operator-controller-manager restarted
Arrange the Amazon Bedrock brokers for K8sGPT and ArgoCD
We use a CloudFormation stack to deploy the person brokers into the US East (N. Virginia) Area. Whenever you deploy the CloudFormation template, you deploy a number of assets (prices might be incurred for the AWS assets used).
Use the next parameters for the CloudFormation template:
The stack creates the next AWS Lambda capabilities:
-LambdaK8sGPTAgent- -RestartRollBackApplicationArgoCD- -ArgocdIncreaseMemory-
The stack creates the next Amazon Bedrock brokers:
ArgoCDAgent
, with the next motion teams:argocd-rollback
argocd-restart
argocd-memory-management
K8sGPTAgent
, with the next motion group:k8s-cluster-operations
The stack outputs the next, with the next brokers related to it:
ArgoCDAgent
K8sGPTAgent
- LambdaK8sGPTAgentRole, AWS Identification and Entry Administration (IAM) position Amazon Useful resource Title (ARN) related to the Lambda perform handing interactions with the K8sGPT agent on the EKS cluster. This position ARN might be wanted at a later stage of the configuration course of.
K8sGPTAgentAliasId
, ID of the K8sGPT Amazon Bedrock agent aliasArgoCDAgentAliasId
, ID of the ArgoCD Amazon Bedrock Agent aliasCollaboratorAgentAliasId
, ID of the collaborator Amazon Bedrock agent alias
Assign applicable permissions to allow K8sGPT Amazon Bedrock agent to entry the EKS cluster
To allow the K8sGPT Amazon Bedrock agent to entry the EKS cluster, you must configure the suitable IAM permissions utilizing Amazon EKS entry administration APIs. This can be a two-step course of: first, you create an entry entry for the Lambda perform’s execution position (which you could find within the CloudFormation template output part), and then you definately affiliate the AmazonEKSViewPolicy
to grant read-only entry to the cluster. This configuration makes positive that the K8sGPT agent has the mandatory permissions to observe and analyze the EKS cluster assets whereas sustaining the precept of least privilege.
- Create an entry entry for the Lambda perform’s execution position
export CFN_STACK_NAME=EKS-Troubleshooter export EKS_CLUSTER=PetSite export K8SGPT_LAMBDA_ROLE=`aws cloudformation describe-stacks --stack-name $CFN_STACK_NAME --query "Stacks[0].Outputs[?OutputKey=='LambdaK8sGPTAgentRole'].OutputValue" --output textual content` aws eks create-access-entry --cluster-name $EKS_CLUSTER --principal-arn $K8SGPT_LAMBDA_ROLE
- Affiliate the EKS view coverage with the entry entry
aws eks associate-access-policy --cluster-name $EKS_CLUSTER --principal-arn $K8SGPT_LAMBDA_ROLE --policy-arn arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy --access-scope sort=cluster
- Confirm the Amazon Bedrock brokers. The CloudFormation template provides all three required brokers. To view the brokers, on the Amazon Bedrock console, below Builder instruments within the navigation pane, choose Brokers, as proven within the following screenshot.
Carry out Amazon EKS troubleshooting utilizing the Amazon Bedrock agentic workflow
Now, take a look at the answer. We discover the next two eventualities:
- The agent coordinates with the K8sGPT agent to offer insights into the foundation reason for a pod failure
- The collaborator agent coordinates with the ArgoCD agent to offer a response
Agent coordinates with K8sGPT agent to offer insights into the foundation reason for a pod failure
On this part, we look at a down alert for a pattern software known as memory-demo. We’re within the root reason for the problem. We use the next immediate: “We acquired a down alert for the memory-demo app. Assist us with the foundation reason for the problem.”
The agent not solely acknowledged the foundation trigger, however went one step additional to probably repair the error, which on this case is rising reminiscence assets to the appliance.
Collaborator agent coordinates with ArgoCD agent to offer a response
For this state of affairs, we proceed from the earlier immediate. We really feel the appliance wasn’t supplied sufficient reminiscence, and it ought to be elevated to completely repair the problem. We are able to additionally inform the appliance is in an unhealthy state within the ArgoCD UI, as proven within the following screenshot.
Let’s now proceed to extend the reminiscence, as proven within the following screenshot.
The agent interacted with the argocd_operations
Amazon Bedrock agent and was capable of efficiently enhance the reminiscence. The identical might be inferred within the ArgoCD UI.
Cleanup
In the event you resolve to cease utilizing the answer, full the next steps:
- To delete the related assets deployed utilizing AWS CloudFormation:
- On the AWS CloudFormation console, select Stacks within the navigation pane.
- Find the stack you created in the course of the deployment course of (you assigned a reputation to it).
- Choose the stack and select Delete.
- Delete the EKS cluster in case you created one particularly for this implementation.
Conclusion
By orchestrating a number of Amazon Bedrock brokers, we’ve demonstrated construct an AI-powered Amazon EKS troubleshooting system that simplifies Kubernetes operations. This integration of K8sGPT evaluation and ArgoCD deployment automation showcases the highly effective prospects when combining specialised AI brokers with current DevOps instruments. Though this resolution represents development in automated Kubernetes operations, it’s vital to do not forget that human oversight stays worthwhile, notably for advanced eventualities and strategic choices.
As Amazon Bedrock and its agent capabilities proceed to evolve, we will count on much more refined orchestration prospects. You possibly can prolong this resolution to include extra instruments, metrics, and automation workflows to fulfill your group’s particular wants.
To be taught extra about Amazon Bedrock, seek advice from the next assets:
In regards to the authors
Vikram Venkataraman is a Principal Specialist Options Architect at Amazon Internet Companies (AWS). He helps clients modernize, scale, and undertake finest practices for his or her containerized workloads. With the emergence of Generative AI, Vikram has been actively working with clients to leverage AWS’s AI/ML providers to resolve advanced operational challenges, streamline monitoring workflows, and improve incident response by clever automation.
Puneeth Ranjan Komaragiri is a Principal Technical Account Supervisor at Amazon Internet Companies (AWS). He’s notably keen about monitoring and observability, cloud monetary administration, and generative AI domains. In his present position, Puneeth enjoys collaborating carefully with clients, leveraging his experience to assist them design and architect their cloud workloads for optimum scale and resilience.
Sudheer Sangunni is a Senior Technical Account Supervisor at AWS Enterprise Assist. Together with his intensive experience within the AWS Cloud and large information, Sudheer performs a pivotal position in aiding clients with enhancing their monitoring and observability capabilities inside AWS choices.
Vikrant Choudhary is a Senior Technical Account Supervisor at Amazon Internet Companies (AWS), specializing in healthcare and life sciences. With over 15 years of expertise in cloud options and enterprise structure, he helps companies speed up their digital transformation initiatives. In his present position, Vikrant companions with clients to architect and implement revolutionary options, from cloud migrations and software modernization to rising applied sciences similar to generative AI, driving profitable enterprise outcomes by cloud adoption.