Automationscribe.com
  • Home
  • AI Scribe
  • AI Tools
  • Artificial Intelligence
  • Contact Us
No Result
View All Result
Automation Scribe
  • Home
  • AI Scribe
  • AI Tools
  • Artificial Intelligence
  • Contact Us
No Result
View All Result
Automationscribe.com
No Result
View All Result

Automate Amazon EKS troubleshooting utilizing an Amazon Bedrock agentic workflow

admin by admin
April 17, 2025
in Artificial Intelligence
0
Automate Amazon EKS troubleshooting utilizing an Amazon Bedrock agentic workflow
399
SHARES
2.3k
VIEWS
Share on FacebookShare on Twitter


As organizations scale their Amazon Elastic Kubernetes Service (Amazon EKS) deployments, platform directors face rising challenges in effectively managing multi-tenant clusters. Duties similar to investigating pod failures, addressing useful resource constraints, and resolving misconfiguration can devour vital effort and time. As a substitute of spending worthwhile engineering hours manually parsing logs, monitoring metrics, and implementing fixes, groups ought to deal with driving innovation. Now, with the ability of generative AI, you may remodel your Kubernetes operations. By implementing clever cluster monitoring, sample evaluation, and automatic remediation, you may dramatically scale back each imply time to determine (MTTI) and imply time to resolve (MTTR) for frequent cluster points.

At AWS re:Invent 2024, we introduced the multi-agent collaboration functionality for Amazon Bedrock (preview). With multi-agent collaboration, you may construct, deploy, and handle a number of AI brokers working collectively on advanced multistep duties that require specialised expertise. As a result of troubleshooting an EKS cluster entails deriving insights from a number of observability alerts and making use of fixes utilizing a steady integration and deployment (CI/CD) pipeline, a multi-agent workflow might help an operations group streamline the administration of EKS clusters. The workflow supervisor agent can combine with particular person brokers that interface with particular person observability alerts and a CI/CD workflow to orchestrate and carry out duties based mostly on person immediate.

On this submit, we show orchestrate a number of Amazon Bedrock brokers to create a complicated Amazon EKS troubleshooting system. By enabling collaboration between specialised brokers—deriving insights from K8sGPT and performing actions by the ArgoCD framework—you may construct a complete automation that identifies, analyzes, and resolves cluster points with minimal human intervention.

Resolution overview

The structure consists of the next core parts:

  • Amazon Bedrock collaborator agent – Orchestrates the workflow and maintains context whereas routing person prompts to specialised brokers, managing multistep operations and agent interactions
  • Amazon Bedrock agent for K8sGPT – Evaluates cluster and pod occasions by K8sGPT’s Analyze API for safety points, misconfigurations, and efficiency issues, offering remediation recommendations in pure language
  • Amazon Bedrock agent for ArgoCD – Manages GitOps-based remediation by ArgoCD, dealing with rollbacks, useful resource optimization, and configuration updates

The next diagram illustrates the answer structure.

Architecture Diagram

Conditions

It’s good to have the next stipulations in place:

Arrange the Amazon EKS cluster with K8sGPT and ArgoCD

We begin with putting in and configuring the K8sGPT operator and ArgoCD controller on the EKS cluster.

The K8sGPT operator will assist with enabling AI-powered evaluation and troubleshooting of cluster points. For instance, it may well routinely detect and recommend fixes for misconfigured deployments, similar to figuring out and resolving useful resource constraint issues in pods.

ArgoCD is a declarative GitOps steady supply device for Kubernetes that automates the deployment of functions by conserving the specified software state in sync with what’s outlined in a Git repository.

The Amazon Bedrock agent serves because the clever decision-maker in our structure, analyzing cluster points detected by K8sGPT. After the foundation trigger is recognized, the agent orchestrates corrective actions by ArgoCD’s GitOps engine. This highly effective integration signifies that when issues are detected (whether or not it’s a misconfigured deployment, useful resource constraints, or scaling difficulty), the agent can routinely combine with ArgoCD to offer the mandatory fixes. ArgoCD then picks up these adjustments and synchronizes them together with your EKS cluster, creating a really self-healing infrastructure.

  1. Create the mandatory namespaces in Amazon EKS:
    kubectl create ns helm-guestbook
    kubectl create ns k8sgpt-operator-system
  2. Add the k8sgpt Helm repository and set up the operator:
    helm repo add k8sgpt https://charts.k8sgpt.ai/
    helm repo replace
    helm set up k8sgpt-operator k8sgpt/k8sgpt-operator 
      --namespace k8sgpt-operator-system
  3. You possibly can confirm the set up by coming into the next command:
    kubectl get pods -n k8sgpt-operator-system
    
    NAME                                                          READY   STATUS    RESTARTS  AGE
    release-k8sgpt-operator-controller-manager-5b749ffd7f-7sgnd   2/2     Operating   0         1d
    

After the operator is deployed, you may configure a K8sGPT useful resource. This Customized Useful resource Definition(CRD) could have the giant language mannequin (LLM) configuration that may help in AI-powered evaluation and troubleshooting of cluster points. K8sGPT helps numerous backends to assist in AI-powered evaluation. For this submit, we use Amazon Bedrock because the backend and Anthropic’s Claude V3 because the LLM.

  1. It’s good to create the pod identification for offering the EKS cluster entry to different AWS providers with Amazon Bedrock:
    eksctl create podidentityassociation  --cluster PetSite --namespace k8sgpt-operator-system --service-account-name k8sgpt  --role-name k8sgpt-app-eks-pod-identity-role --permission-policy-arns arn:aws:iam::aws:coverage/AmazonBedrockFullAccess  --region $AWS_REGION
  2. Configure the K8sGPT CRD:
    cat << EOF > k8sgpt.yaml
    apiVersion: core.k8sgpt.ai/v1alpha1
    form: K8sGPT
    metadata:
      title: k8sgpt-bedrock
      namespace: k8sgpt-operator-system
    spec:
      ai:
        enabled: true
        mannequin: anthropic.claude-v3
        backend: amazonbedrock
        area: us-east-1
        credentials:
          secretRef:
            title: k8sgpt-secret
            namespace: k8sgpt-operator-system
      noCache: false
      repository: ghcr.io/k8sgpt-ai/k8sgpt
      model: v0.3.48
    EOF
    
    kubectl apply -f k8sgpt.yaml
    
  3. Validate the settings to substantiate the k8sgpt-bedrock pod is operating efficiently:
    kubectl get pods -n k8sgpt-operator-system
    NAME                                                          READY   STATUS    RESTARTS      AGE
    k8sgpt-bedrock-5b655cbb9b-sn897                               1/1     Operating   9 (22d in the past)   22d
    release-k8sgpt-operator-controller-manager-5b749ffd7f-7sgnd   2/2     Operating   3 (10h in the past)   22d
    
  4. Now you may configure the ArgoCD controller:
    helm repo add argo https://argoproj.github.io/argo-helm
    helm repo replace
    kubectl create namespace argocd
    helm set up argocd argo/argo-cd 
      --namespace argocd 
      --create-namespace
  5. Confirm the ArgoCD set up:
    kubectl get pods -n argocd
    NAME                                                READY   STATUS    RESTARTS   AGE
    argocd-application-controller-0                     1/1     Operating   0          43d
    argocd-applicationset-controller-5c787df94f-7jpvp   1/1     Operating   0          43d
    argocd-dex-server-55d5769f46-58dwx                  1/1     Operating   0          43d
    argocd-notifications-controller-7ccbd7fb6-9pptz     1/1     Operating   0          43d
    argocd-redis-587d59bbc-rndkp                        1/1     Operating   0          43d
    argocd-repo-server-76f6c7686b-rhjkg                 1/1     Operating   0          43d
    argocd-server-64fcc786c-bd2t8                       1/1     Operating   0          43d
  6. Patch the argocd service to have an exterior load balancer:
    kubectl patch svc argocd-server -n argocd -p '{"spec": {"sort": "LoadBalancer"}}'
  7. Now you can entry the ArgoCD UI with the next load balancer endpoint and the credentials for the admin person:
    kubectl get svc argocd-server -n argocd
    NAME            TYPE           CLUSTER-IP       EXTERNAL-IP                                                              PORT(S)                      AGE
    argocd-server   LoadBalancer   10.100.168.229   a91a6fd4292ed420d92a1a5c748f43bc-653186012.us-east-1.elb.amazonaws.com   80:32334/TCP,443:32261/TCP   43d
  8. Retrieve the credentials for the ArgoCD UI:
    export argocdpassword=`kubectl -n argocd get secret argocd-initial-admin-secret 
    -o jsonpath="{.information.password}" | base64 -d`
    
    echo ArgoCD admin password - $argocdpassword
  9. Push the credentials to AWS Secrets and techniques Supervisor:
    aws secretsmanager create-secret 
    --name argocdcreds 
    --description "Credentials for argocd" 
    --secret-string "{"USERNAME":"admin","PASSWORD":"$argocdpassword"}"
  10. Configure a pattern software in ArgoCD:
    cat << EOF > argocd-application.yaml
    apiVersion: argoproj.io/v1alpha1
    form: Utility
    metadata:
    title: helm-guestbook
    namespace: argocd
    spec:
    undertaking: default
    supply:
    repoURL: https://github.com/awsvikram/argocd-example-apps
    targetRevision: HEAD
    path: helm-guestbook
    vacation spot:
    server: https://kubernetes.default.svc
    namespace: helm-guestbook
    syncPolicy:
    automated:
    prune: true
    selfHeal: true
    EOF
  11. Apply the configuration and confirm it from the ArgoCD UI by logging in because the admin person:
    kubectl apply -f argocd-application.yaml

    ArgoCD Application

  12. It takes a while for K8sGPT to investigate the newly created pods. To make that rapid, restart the pods created within the k8sgpt-operator-system namespace. The pods might be restarted by coming into the next command:
    kubectl -n k8sgpt-operator-system rollout restart deploy
    
    deployment.apps/k8sgpt-bedrock restarted
    deployment.apps/k8sgpt-operator-controller-manager restarted

Arrange the Amazon Bedrock brokers for K8sGPT and ArgoCD

We use a CloudFormation stack to deploy the person brokers into the US East (N. Virginia) Area. Whenever you deploy the CloudFormation template, you deploy a number of assets (prices might be incurred for the AWS assets used).

Use the next parameters for the CloudFormation template:

The stack creates the next AWS Lambda capabilities:

  • -LambdaK8sGPTAgent-
  • -RestartRollBackApplicationArgoCD-
  • -ArgocdIncreaseMemory-

The stack creates the next Amazon Bedrock brokers:

  • ArgoCDAgent, with the next motion teams:
    1. argocd-rollback
    2. argocd-restart
    3. argocd-memory-management
  • K8sGPTAgent, with the next motion group:
    1. k8s-cluster-operations

The stack outputs the next, with the next brokers related to it:

  1. ArgoCDAgent
  2. K8sGPTAgent
  • LambdaK8sGPTAgentRole, AWS Identification and Entry Administration (IAM) position Amazon Useful resource Title (ARN) related to the Lambda perform handing interactions with the K8sGPT agent on the EKS cluster. This position ARN might be wanted at a later stage of the configuration course of.
  • K8sGPTAgentAliasId, ID of the K8sGPT Amazon Bedrock agent alias
  • ArgoCDAgentAliasId, ID of the ArgoCD Amazon Bedrock Agent alias
  • CollaboratorAgentAliasId, ID of the collaborator Amazon Bedrock agent alias

Assign applicable permissions to allow K8sGPT Amazon Bedrock agent to entry the EKS cluster

To allow the K8sGPT Amazon Bedrock agent to entry the EKS cluster, you must configure the suitable IAM permissions utilizing Amazon EKS entry administration APIs. This can be a two-step course of: first, you create an entry entry for the Lambda perform’s execution position (which you could find within the CloudFormation template output part), and then you definately affiliate the AmazonEKSViewPolicy to grant read-only entry to the cluster. This configuration makes positive that the K8sGPT agent has the mandatory permissions to observe and analyze the EKS cluster assets whereas sustaining the precept of least privilege.

  1. Create an entry entry for the Lambda perform’s execution position
    export CFN_STACK_NAME=EKS-Troubleshooter
    	   export EKS_CLUSTER=PetSite
    
    export K8SGPT_LAMBDA_ROLE=`aws cloudformation describe-stacks --stack-name $CFN_STACK_NAME --query "Stacks[0].Outputs[?OutputKey=='LambdaK8sGPTAgentRole'].OutputValue" --output textual content`
    
    aws eks create-access-entry 
        --cluster-name $EKS_CLUSTER 
        --principal-arn $K8SGPT_LAMBDA_ROLE
  2. Affiliate the EKS view coverage with the entry entry
    aws eks associate-access-policy 
        --cluster-name $EKS_CLUSTER 
        --principal-arn  $K8SGPT_LAMBDA_ROLE
        --policy-arn arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy 
        --access-scope sort=cluster
  3. Confirm the Amazon Bedrock brokers. The CloudFormation template provides all three required brokers. To view the brokers, on the Amazon Bedrock console, below Builder instruments within the navigation pane, choose Brokers, as proven within the following screenshot.

Bedrock agents

Carry out Amazon EKS troubleshooting utilizing the Amazon Bedrock agentic workflow

Now, take a look at the answer. We discover the next two eventualities:

  1. The agent coordinates with the K8sGPT agent to offer insights into the foundation reason for a pod failure
  2. The collaborator agent coordinates with the ArgoCD agent to offer a response

Agent coordinates with K8sGPT agent to offer insights into the foundation reason for a pod failure

On this part, we look at a down alert for a pattern software known as memory-demo. We’re within the root reason for the problem. We use the next immediate: “We acquired a down alert for the memory-demo app. Assist us with the foundation reason for the problem.”

The agent not solely acknowledged the foundation trigger, however went one step additional to probably repair the error, which on this case is rising reminiscence assets to the appliance.

K8sgpt agent finding

Collaborator agent coordinates with ArgoCD agent to offer a response

For this state of affairs, we proceed from the earlier immediate. We really feel the appliance wasn’t supplied sufficient reminiscence, and it ought to be elevated to completely repair the problem. We are able to additionally inform the appliance is in an unhealthy state within the ArgoCD UI, as proven within the following screenshot.

ArgoUI

Let’s now proceed to extend the reminiscence, as proven within the following screenshot.

Interacting with agent to increase memory

The agent interacted with the argocd_operations Amazon Bedrock agent and was capable of efficiently enhance the reminiscence. The identical might be inferred within the ArgoCD UI.

ArgoUI showing memory increase

Cleanup

In the event you resolve to cease utilizing the answer, full the next steps:

  1. To delete the related assets deployed utilizing AWS CloudFormation:
    1. On the AWS CloudFormation console, select Stacks within the navigation pane.
    2. Find the stack you created in the course of the deployment course of (you assigned a reputation to it).
    3. Choose the stack and select Delete.
  2. Delete the EKS cluster in case you created one particularly for this implementation.

Conclusion

By orchestrating a number of Amazon Bedrock brokers, we’ve demonstrated construct an AI-powered Amazon EKS troubleshooting system that simplifies Kubernetes operations. This integration of K8sGPT evaluation and ArgoCD deployment automation showcases the highly effective prospects when combining specialised AI brokers with current DevOps instruments. Though this resolution represents development in automated Kubernetes operations, it’s vital to do not forget that human oversight stays worthwhile, notably for advanced eventualities and strategic choices.

As Amazon Bedrock and its agent capabilities proceed to evolve, we will count on much more refined orchestration prospects. You possibly can prolong this resolution to include extra instruments, metrics, and automation workflows to fulfill your group’s particular wants.

To be taught extra about Amazon Bedrock, seek advice from the next assets:


In regards to the authors

Vikram Venkataraman is a Principal Specialist Options Architect at Amazon Internet Companies (AWS). He helps clients modernize, scale, and undertake finest practices for his or her containerized workloads. With the emergence of Generative AI, Vikram has been actively working with clients to leverage AWS’s AI/ML providers to resolve advanced operational challenges, streamline monitoring workflows, and improve incident response by clever automation.

Puneeth Ranjan Komaragiri is a Principal Technical Account Supervisor at Amazon Internet Companies (AWS). He’s notably keen about monitoring and observability, cloud monetary administration, and generative AI domains. In his present position, Puneeth enjoys collaborating carefully with clients, leveraging his experience to assist them design and architect their cloud workloads for optimum scale and resilience.

Sudheer Sangunni is a Senior Technical Account Supervisor at AWS Enterprise Assist. Together with his intensive experience within the AWS Cloud and large information, Sudheer performs a pivotal position in aiding clients with enhancing their monitoring and observability capabilities inside AWS choices.

Vikrant Choudhary is a Senior Technical Account Supervisor at Amazon Internet Companies (AWS), specializing in healthcare and life sciences. With over 15 years of expertise in cloud options and enterprise structure, he helps companies speed up their digital transformation initiatives. In his present position, Vikrant companions with clients to architect and implement revolutionary options, from cloud migrations and software modernization to rising applied sciences similar to generative AI, driving profitable enterprise outcomes by cloud adoption.

Tags: agenticAmazonAutomateBedrockEKStroubleshootingworkflow
Previous Post

When Predictors Collide: Mastering VIF in Multicollinear Regression

Next Post

Layers of the AI Stack, Defined Merely

Next Post
Layers of the AI Stack, Defined Merely

Layers of the AI Stack, Defined Merely

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Popular News

  • How Aviva constructed a scalable, safe, and dependable MLOps platform utilizing Amazon SageMaker

    How Aviva constructed a scalable, safe, and dependable MLOps platform utilizing Amazon SageMaker

    401 shares
    Share 160 Tweet 100
  • Diffusion Mannequin from Scratch in Pytorch | by Nicholas DiSalvo | Jul, 2024

    401 shares
    Share 160 Tweet 100
  • Unlocking Japanese LLMs with AWS Trainium: Innovators Showcase from the AWS LLM Growth Assist Program

    401 shares
    Share 160 Tweet 100
  • Streamlit fairly styled dataframes half 1: utilizing the pandas Styler

    400 shares
    Share 160 Tweet 100
  • Proton launches ‘Privacy-First’ AI Email Assistant to Compete with Google and Microsoft

    400 shares
    Share 160 Tweet 100

About Us

Automation Scribe is your go-to site for easy-to-understand Artificial Intelligence (AI) articles. Discover insights on AI tools, AI Scribe, and more. Stay updated with the latest advancements in AI technology. Dive into the world of automation with simplified explanations and informative content. Visit us today!

Category

  • AI Scribe
  • AI Tools
  • Artificial Intelligence

Recent Posts

  • Clustering Consuming Behaviors in Time: A Machine Studying Method to Preventive Well being
  • Insights in implementing production-ready options with generative AI
  • Producing Information Dictionary for Excel Information Utilizing OpenPyxl and AI Brokers
  • Home
  • Contact Us
  • Disclaimer
  • Privacy Policy
  • Terms & Conditions

© 2024 automationscribe.com. All rights reserved.

No Result
View All Result
  • Home
  • AI Scribe
  • AI Tools
  • Artificial Intelligence
  • Contact Us

© 2024 automationscribe.com. All rights reserved.