Constructing a RAG chat-based assistant on Amazon EKS Auto Mode and NVIDIA NIMs

Chat-based assistants powered by Retrieval Augmented Technology (RAG) are remodeling buyer assist, inner assist desks, and enterprise search, by delivering quick, correct solutions grounded in your individual knowledge. With RAG, you need to use a ready-to-deploy basis mannequin (FM) and enrich it with your individual knowledge, making responses related and context-aware with out the necessity for fine-tuning or retraining. Working these chat-based assistants on Amazon Elastic Kubernetes Service (Amazon EKS) offers you the flexibleness to make use of quite a lot of FMs, retaining full management over your knowledge and infrastructure.

Amazon EKS scales along with your workload and is cost-efficient for each regular and fluctuating demand. As a result of EKS is licensed Kubernetes-conformant, it’s appropriate with current functions operating on an ordinary Kubernetes setting, whether or not hosted on on-premises knowledge facilities or public clouds. To your knowledge airplane, you’ll be able to benefit from a variety of compute choices, together with CPUs, GPUs, AWS purpose-built AI chips (AWS Inferentia and AWS Trainium) and ARM-based CPU architectures (AWS Graviton), to match efficiency and value necessities. Such flexibility makes Amazon EKS a great candidate for operating heterogeneous workloads as a result of you’ll be able to compose totally different compute substrates, inside the identical cluster, to optimize each efficiency and value effectivity.

NVIDIA NIM microservices include microservices that deploy and serve FMs, integrating with AWS providers similar to Amazon Elastic Compute Cloud (Amazon EC2), Amazon EKS, and Amazon SageMaker. NIM microservices are distributed as Docker containers and can be found by the NVIDIA NGC Catalog. Deploying GPU-accelerated fashions manually requires you to pick and configure runtimes similar to PyTorch or TensorFlow, arrange inference servers similar to Triton, implement mannequin optimizations, and troubleshoot compatibility points. This takes engineering time and experience. NIM microservices remove this complexity by automating these technical choices and configurations for you.

The NVIDIA NIM Operator is a Kubernetes administration software that facilitates the operation of model-serving elements and providers. It handles giant language fashions (LLMs), embedders, and different mannequin sorts by NVIDIA NIM microservices inside Kubernetes environments. The Operator streamlines microservice administration by three major {custom} assets. First, the NIMCache useful resource facilitates mannequin downloading from NGC and community storage persistence. This permits a number of microservice situations to share a single cached mannequin, enhancing microservice startup time. Second, the NIMService useful resource manages particular person NIM microservices, creating Kubernetes deployments inside specified namespaces. Third, the NIMPipeline useful resource features as an orchestrator for a number of NIM service assets, permitting coordinated administration of service teams. This structure permits environment friendly operation and lifecycle administration, with specific emphasis on decreasing inference latency by mannequin caching and supporting automated scaling capabilities.

NVIDIA NIM, coupled with the NVIDIA NIM Operator, present a streamlined resolution to handle the deployment complexities said within the opening. On this put up, we display the implementation of a sensible RAG chat-based assistant utilizing a complete stack of recent applied sciences. The answer makes use of NVIDIA NIMs for each LLM inference and textual content embedding providers, with the NIM Operator dealing with their deployment and administration. The structure incorporates Amazon OpenSearch Serverless to retailer and question high-dimensional vector embeddings for similarity search.

The underlying Kubernetes infrastructure of the answer is supplied by EKS Auto Mode, which helps GPU-accelerated Amazon Machine Photos (AMIs) out of the field. These pictures embrace the NVIDIA system plugin, the NVIDIA container toolkit, precompiled NVIDIA kernel drivers, the Bottlerocket working system, and Elastic Material Adapter (EFA) networking. You should use Auto Mode with Accelerated AMIs to spin up GPU situations, with out manually putting in and configuring GPU software program elements. Merely specify GPU-based occasion sorts when creating Karpenter NodePools, and EKS Auto Mode will launch GPU-ready employee nodes to run your accelerated workloads.

Answer overview

The next structure diagram exhibits how NVIDIA NIM microservices operating on Amazon EKS Auto Mode energy our RAG chat-based assistant resolution. The design, proven within the following diagram, combines GPU-accelerated mannequin serving with vector search in Amazon OpenSearch Serverless, utilizing the NIM Operator to handle mannequin deployment and caching by persistent Amazon Elastic File System (Amazon EFS) storage.

Answer diagram (numbers point out steps within the resolution walkthrough part)

The answer follows these high-level steps:

Create an EKS cluster
Arrange Amazon OpenSearch Serverless
Create an EFS file system and arrange vital permissions
Create Karpenter GPU NodePool
Set up NVIDIA Node Function Discovery (NFD) and NIM Operator
Create nim-service namespace and NVIDIA secrets and techniques
Create NIMCaches
Create NIMServices

Answer walkthrough

On this part, we stroll by the implementation of this RAG chat-based assistant resolution step-by-step. We create an EKS cluster, configure Amazon OpenSearch Serverless and EFS storage, arrange GPU-enabled nodes with Karpenter, deploy NVIDIA elements for mannequin serving, and eventually combine a chat-based assistant consumer utilizing Gradio and LangChain. This end-to-end setup demonstrates how you can mix LLM inference on Kubernetes with vector search capabilities, forming the inspiration for a scalable, production-grade system—pending the addition of monitoring, auto scaling, and reliability options.

Conditions

To start, guarantee you’ve put in and arrange the next required instruments:

AWS CLI (model aws-cli/2.27.11 or later)
kubectl
eksctl (use model v0.195.0 or later to assist Auto Mode)
Helm

These instruments must be correctly configured based on the Amazon EKS setup documentation.

Clone the reference repository and cd into the foundation folder:

git clone https://github.com/aws-samples/sample-rag-chatbot-nim
cd sample-rag-chatbot-nim/infra

Atmosphere setup

You want an NGC API key to authenticate and obtain NIM fashions. To generate the important thing, you’ll be able to enroll (free of charge) within the NVIDIA Developer Program after which comply with the NVIDIA tips.

Subsequent, arrange just a few setting variables (change the values along with your info):

export CLUSTER_NAME=automode-nims-blog-cluster
export AWS_DEFAULT_REGION={your area}
export NVIDIA_NGC_API_KEY={your key}

Sample deployment

To carry out the answer, full the steps within the following sections.

Create an EKS cluster

Deploy the EKS cluster utilizing EKS Auto Mode, with eksctl :

CHATBOT_SA_NAME=${CLUSTER_NAME}-client-service-account
IAM_CHATBOT_ROLE=${CLUSTER_NAME}-client-eks-pod-identity-role

cat << EOF | eksctl create cluster -f -
apiVersion: eksctl.io/v1alpha5
variety: ClusterConfig

metadata:
  identify: ${CLUSTER_NAME}
  area: ${AWS_DEFAULT_REGION}

autoModeConfig:
  enabled: true

iam:
  podIdentityAssociations:
    - namespace: default
      serviceAccountName: ${CHATBOT_SA_NAME}
      createServiceAccount: true
      roleName: ${IAM_CHATBOT_ROLE}
      permissionPolicy:
        Model: "2012-10-17"
        Assertion:
          - Impact: Permit
            Motion:
              - "aoss:*"
            Useful resource: "*"

addons:
- identify: aws-efs-csi-driver
  useDefaultPodIdentityAssociations: true
EOF

Pod Identification Associations join Kubernetes service accounts to AWS Identification and Entry Administration (IAM) roles, permitting pods to entry AWS providers securely. On this configuration, a service account shall be created and related to an IAM position, granting it full permissions to OpenSearch Serverless (in a manufacturing setting, prohibit privileges based on the precept of least privilege).

NIMCaches require quantity AccessMode: ReadWriteMany. Amazon Elastic Block Retailer (Amazon EBS) volumes supplied by EKS Auto Mode aren’t appropriate as a result of they assist ReadWriteOnce solely and might’t be mounted by a number of nodes. Storage choices that assist AccessMode: ReadWriteMany embrace Amazon EFS, as proven on this instance, or Amazon FSx for Lustre, which affords greater efficiency for workloads with better throughput or latency necessities.

The previous command will take a couple of minutes to be accomplished. When it’s accomplished, eksctl configures your kubeconfig and factors it to the brand new cluster. You possibly can validate that the cluster is up and operating and that the EFS addon is put in by getting into the next command:

kubectl get pods --all-namespaces

Anticipated output:

NAMESPACE     NAME                                  READY   STATUS    RESTARTS   AGE
kube-system   efs-csi-controller-55b8dd6f57-wpzbg   3/3     Working   0          3m7s
kube-system   efs-csi-controller-55b8dd6f57-z2gzc   3/3     Working   0          3m7s
kube-system   efs-csi-node-6k5kz                    3/3     Working   0          3m7s
kube-system   efs-csi-node-pvv2v                    3/3     Working   0          3m7s
kube-system   metrics-server-6d67d68f67-7x4tg       1/1     Working   0          6m15s
kube-system   metrics-server-6d67d68f67-l4xv6       1/1     Working   0          6m15s

Arrange Amazon OpenSearch Serverless

A vector database shops and searches by numerical representations of textual content (embeddings). Such a part is important in RAG chat-based assistant architectures as a result of it facilitates discovering related info associated to a consumer query based mostly on semantic similarity moderately than precise key phrase matches.

We use Amazon OpenSearch Service because the vector database. OpenSearch Service offers a managed resolution for deploying, working, and scaling OpenSearch clusters inside AWS Cloud infrastructure. As a part of this service, Amazon OpenSearch Serverless affords an on-demand configuration that mechanically handles scaling to match your utility’s necessities.

First, utilizing AWS PrivateLink, create a personal connection between the cluster’s Amazon Digital Personal Cloud (Amazon VPC) connection and Amazon OpenSearch Serverless. This retains visitors inside the AWS community and avoids public web routing.

Enter the next instructions to retrieve the cluster’s digital non-public cloud (VPC) ID, CIDR block vary, and subnet IDs, and retailer them in corresponding setting variables:

VPC_ID=$(aws eks describe-cluster 
    --name $CLUSTER_NAME 
    --query "cluster.resourcesVpcConfig.vpcId" 
    --output textual content 
    --region=$AWS_DEFAULT_REGION) && 
CIDR_RANGE=$(aws ec2 describe-vpcs 
    --vpc-ids $VPC_ID 
    --query "Vpcs[].CidrBlock" 
    --output textual content 
    --region $AWS_DEFAULT_REGION) && 
SUBNET_IDS=($(aws eks describe-cluster 
    --name $CLUSTER_NAME 
    --query "cluster.resourcesVpcConfig.subnetIds[]" 
    --region $AWS_DEFAULT_REGION 
    --output textual content))

Use the next code to create a safety group for OpenSearch Serverless within the VPC, add an inbound rule to the safety group permitting HTTPS visitors (port 443) out of your VPC’s CIDR vary, and create an OpenSearch Serverless VPC endpoint related to the subnets and safety group:

AOSS_SECURITY_GROUP_ID=$(aws ec2 create-security-group 
    --group-name ${CLUSTER_NAME}-AOSSSecurityGroup 
    --description "${CLUSTER_NAME} AOSS safety group" 
    --vpc-id $VPC_ID 
    --region $AWS_DEFAULT_REGION 
    --query 'GroupId' 
    --output textual content) && 
aws ec2 authorize-security-group-ingress 
    --group-id $AOSS_SECURITY_GROUP_ID 
    --protocol tcp 
    --port 443 
    --region $AWS_DEFAULT_REGION 
    --cidr $CIDR_RANGE && 
VPC_ENDPOINT_ID=$(aws opensearchserverless create-vpc-endpoint 
    --name ${CLUSTER_NAME}-aoss-vpc-endpoint 
    --subnet-ids "${SUBNET_IDS[@]}" 
    --security-group-ids $AOSS_SECURITY_GROUP_ID 
    --region $AWS_DEFAULT_REGION 
    --vpc-id $VPC_ID 
    --query 'createVpcEndpointDetail.id' 
    --output textual content)

Within the following steps, create an OpenSearch Serverless assortment (a logical unit to retailer and set up paperwork).

Create an encryption coverage for the gathering:

AOSS_COLLECTION_NAME=${CLUSTER_NAME}-collection
ENCRYPTION_POLICY_NAME=${CLUSTER_NAME}-encryption-policy
aws opensearchserverless create-security-policy 
    --name ${ENCRYPTION_POLICY_NAME}
    --type encryption 
    --policy "{"Guidelines":[{"ResourceType":"collection","Resource":["collection/${AOSS_COLLECTION_NAME}"]}],"AWSOwnedKey":true}"

The community coverage that restricts entry to the gathering to solely come by a selected VPC endpoint:

NETWORK_POLICY_NAME=${CLUSTER_NAME}-network-policy
aws opensearchserverless create-security-policy 
    --name ${NETWORK_POLICY_NAME} 
    --type community 
    --policy "[{"Description":"Allow VPC endpoint access","Rules":[{"ResourceType":"collection","Resource":["collection/${AOSS_COLLECTION_NAME}"]}],"SourceVPCEs":["$VPC_ENDPOINT_ID"]}]"

The information coverage that grants permissions to the IAM chat-based assistant position for interacting with indices within the assortment:

DATA_POLICY_NAME=${CLUSTER_NAME}-data-policy
IAM_CHATBOT_ROLE_ARN=$(aws iam get-role --role-name ${IAM_CHATBOT_ROLE} --query 'Function.Arn' --output textual content)
aws opensearchserverless create-access-policy 
    --name ${DATA_POLICY_NAME} 
    --type knowledge 
    --policy "[{"Rules":[{"ResourceType":"index","Resource":["index/${AOSS_COLLECTION_NAME}/*"],"Permission":["aoss:CreateIndex","aoss:DescribeIndex","aoss:ReadDocument","aoss:WriteDocument","aoss:UpdateIndex","aoss:DeleteIndex"]}],"Principal":["${IAM_CHATBOT_ROLE_ARN}"]}]"

The OpenSearch assortment itself:

AOSS_COLLECTION_ID=$(aws opensearchserverless create-collection 
    --name ${AOSS_COLLECTION_NAME} 
    --type VECTORSEARCH 
    --region ${AWS_DEFAULT_REGION} 
    --query 'createCollectionDetail.id' 
    --output textual content)

Create EFS file system and arrange vital permissions

Create an EFS file system:

EFS_FS_ID=$(aws efs create-file-system 
    --region $AWS_DEFAULT_REGION 
    --performance-mode generalPurpose 
    --query 'FileSystemId' 
    --output textual content)

EFS requires mount targets, that are VPC community endpoints that join your EKS nodes to the EFS file system. These mount targets should be reachable out of your EKS employee nodes, and entry is managed utilizing safety teams.

Execute the next command to arrange the mount targets and configure the required safety group guidelines:

EFS_SECURITY_GROUP_ID=$(aws ec2 create-security-group 
    --group-name ${CLUSTER_NAME}-EfsSecurityGroup 
    --description "${CLUSTER_NAME} EFS safety group" 
    --vpc-id $VPC_ID 
    --region $AWS_DEFAULT_REGION 
    --query 'GroupId' 
    --output textual content) && 
aws ec2 authorize-security-group-ingress 
    --group-id $EFS_SECURITY_GROUP_ID 
    --protocol tcp 
    --port 2049 
    --region $AWS_DEFAULT_REGION 
    --cidr $CIDR_RANGE && 
for subnet in $SUBNET_IDS; do
    aws efs create-mount-target 
        --file-system-id $EFS_FS_ID 
        --subnet-id $subnet 
        --security-groups $EFS_SECURITY_GROUP_ID 
        --region $AWS_DEFAULT_REGION 
completed

Create the StorageClass in Amazon EKS for Amazon EFS:

cat << EOF | kubectl apply -f -
variety: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  identify: efs
provisioner: efs.csi.aws.com
parameters:
  provisioningMode: efs-ap
  fileSystemId: ${EFS_FS_ID}
  directoryPerms: "777"
EOF

Validate the EFS storage class:

kubectl get storageclass efs

These are the anticipated outcomes:

NAME   PROVISIONER       RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE
efs    efs.csi.aws.com   Delete          Fast           false                  9s

Create Karpenter GPU `NodePool`

To create the Karpenter GPU NodePool, enter the next code:

cat << EOF | kubectl apply -f -
apiVersion: karpenter.sh/v1
variety: NodePool
metadata:
  identify: gpu-node-pool
spec:
  template:
    metadata:
      labels:
        kind: karpenter
        NodeGroupType: gpu-node-pool
    spec:
      nodeClassRef:
        group: eks.amazonaws.com
        variety: NodeClass
        identify: default
      taints:
        - key: nvidia.com/gpu
          worth: "Exists"
          impact: "NoSchedule"

      necessities:
        - key: "eks.amazonaws.com/instance-family"
          operator: In
          values: ["g5"]
        - key: "eks.amazonaws.com/instance-size"
          operator: In
          values: [ "2xlarge", "4xlarge", "8xlarge", "16xlarge", "12xlarge", "24xlarge"]
        - key: "kubernetes.io/arch"
          operator: In
          values: ["amd64"]
        - key: "karpenter.sh/capacity-type"
          operator: In
          values: ["on-demand"]

  limits:
    cpu: "1000"
EOF

This NodePool is designed for GPU workloads utilizing AWS G5 situations, which function NVIDIA A10G GPUs. The taint ensures that solely workloads particularly designed for GPU utilization shall be scheduled on these nodes, sustaining environment friendly useful resource utilization. In a manufacturing setting, you would possibly wish to think about using Amazon EC2 Spot Situations as nicely to optimize on prices.

Enter the command to validate profitable creation of the NodePool:

These are the anticipated outcomes:

NAME              NODECLASS   NODES   READY   AGE
general-purpose   default     0       True    15m
gpu-node-pool     default     0       True    8s
system            default     2       True    15m

gpu-node-pool was created and has 0 nodes. To examine nodes additional, enter this command:

kubectl get nodes -o custom-columns=NAME:.metadata.identify,READY:"standing.circumstances[?(@.type=='Ready')].standing",OS-IMAGE:.standing.nodeInfo.osImage,INSTANCE-TYPE:.metadata.labels.'node.kubernetes.io/instance-type'

That is the anticipated output:

NAME                  READY    OS-IMAGE                                           INSTANCE-TYPE
i-0b0c1cd3d744883cd   True     Bottlerocket (EKS Auto) 2025.4.26 (aws-k8s-1.32)   c6g.giant
i-0e1f33e42fac76a09   True     Bottlerocket (EKS Auto) 2025.4.26 (aws-k8s-1.32)   c6g.giant

There are two situations, launched by EKS Auto Mode with non-accelerated Bottlerocket Amazon Machine Picture (AMI) variant aws-k8s-1.32, and CPU-only (non-GPU) occasion kind c6g.

Set up NVIDIA NFD and NIM Operator

The NFD is a Kubernetes plugin that identifies obtainable {hardware} capabilities and system settings. NFD and NIM Operator are put in utilizing Helm charts, every with their very own {custom} useful resource definitions (CRDs).

Earlier than continuing with set up, confirm if associated CRDs exist in your cluster:

# Test for NFD-related CRDs
kubectl get crds | grep nfd

# Test for NIM-related CRDs
kubectl get crds | grep nim

If these CRDs aren’t current, each instructions will return no outcomes.

Add Helm repos:

helm repo add nfd https://kubernetes-sigs.github.io/node-feature-discovery/charts
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo replace

Set up the NFD dependency for NIM Operator:

helm set up node-feature-discovery nfd/node-feature-discovery 
  --namespace node-feature-discovery 
  --create-namespace

Validate the pods are up and CRDs have been created:

kubectl get po -n node-feature-discovery

Anticipated output:

NAME                                             READY   STATUS    RESTARTS   AGE
node-feature-discovery-gc-5b65f7f5b6-q4hlr       1/1     Working   0          79s
node-feature-discovery-master-767dcc6cb8-6hc2t   1/1     Working   0          79s
node-feature-discovery-worker-sg852              1/1     Working   0          43s

kubectl get crds | grep nfd

Anticipated output:

nodefeaturegroups.nfd.k8s-sigs.io            2025-05-05T01:23:16Z
nodefeaturerules.nfd.k8s-sigs.io             2025-05-05T01:23:16Z
nodefeatures.nfd.k8s-sigs.io                 2025-05-05T01:23:16Z

Set up the NIM Operator:

helm set up nim-operator nvidia/k8s-nim-operator 
  --namespace nim-operator 
  --create-namespace 
  --version v2.0.0

You would possibly want to make use of model v1.0.1 for the NIM Operator as a substitute of v2.0.0 as proven within the previous code instance as a result of often you would possibly obtain a “402 Fee Required” message.

Validate the pod is up and CRDs have been created:

kubectl get po -n nim-operator

Anticipated output:

NAME                                             READY   STATUS    RESTARTS   AGE
nim-operator-k8s-nim-operator-6d988f78df-h4nqn   1/1     Working   0          24s

kubectl get crds | grep nim

Anticipated output:

nimcaches.apps.nvidia.com                    2025-05-05T01:18:00Z
nimpipelines.apps.nvidia.com                 2025-05-05T01:18:00Z
nimservices.apps.nvidia.com                  2025-05-05T01:18:01Z

Create `nim-service` namespace and NVIDIA secrets and techniques

On this part, create the nim-service namespace and add two secrets and techniques containing your NGC API key.

Create namespace and secrets and techniques:

kubectl create namespace nim-service
kubectl create secret -n nim-service docker-registry ngc-secret 
    --docker-server=nvcr.io 
    --docker-username="$oauthtoken" 
    --docker-password=$NVIDIA_NGC_API_KEY
kubectl create secret -n nim-service generic ngc-api-secret 
    --from-literal=NGC_API_KEY=$NVIDIA_NGC_API_KEY

Validate secrets and techniques have been created:

kubectl -n nim-service get secrets and techniques

The next is the anticipated consequence:

NAME             TYPE                             DATA   AGE
ngc-api-secret   Opaque                           1      13s
ngc-secret       kubernetes.io/dockerconfigjson   1      14s

ngc-secret is a Docker registry secret used to authenticate and pull NIM container pictures from NVIDIA’s NGC container registry.

ngc-api-secret is a generic secret utilized by the mannequin puller init container to authenticate and obtain fashions from the identical registry.

Create `NIMCaches`

RAG enhances chat functions by enabling AI fashions to entry both inner domain-specific data or exterior data bases, decreasing hallucinations and offering extra correct, up-to-date responses. In a RAG system, a data base is created from domain-specific paperwork. These paperwork are sliced into smaller items of textual content. The textual content items and their generated embeddings are then uploaded to a vector database. Embeddings are numerical representations (vectors) that seize the which means of textual content, the place comparable textual content content material leads to comparable vector values. When questions are obtained from customers, they’re additionally despatched with their respective embeddings to the database for semantic similarity search. The database returns the closest matching chunks of textual content, that are utilized by an LLM to offer a domain-specific reply.

We use Meta’s llama-3-2-1b-instruct as LLM and NVIDIA Retrieval QA E5 (embedqa-e5-v5) as embedder.

This part covers the deployment of NIMCaches for storing each the LLM and embedder fashions. Native storage of those fashions accelerates pod initialization by eliminating the necessity for repeated downloads. Our llama-3-2-1b-instruct LLM, with 1B parameters, is a comparatively small mannequin and makes use of 2.5 GB of cupboard space. The storage necessities and initialization time improve when bigger fashions are used. Though the preliminary setup of the LLM and embedder caches takes 10–quarter-hour, subsequent pod launches shall be sooner as a result of the fashions are already obtainable within the cluster’s native storage.

Enter the next command:

kubectl apply -f nim-caches.yaml

That is the anticipated output:

nimcache.apps.nvidia.com/nv-embedqa-e5-v5 created
nimcache.apps.nvidia.com/meta-llama-3-2-1b-instruct created

NIMCaches will create PersistentVolumes (PVs) and PersistentVolumeClaims (PVCs) to retailer the fashions, with STORAGECLASS efs:

kubectl get -n nim-service pv,pvc

The next is the anticipated output:

NAME                                                        CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                        STORAGECLASS   VOLUMEATTRIBUTESCLASS   REASON   AGE
persistentvolume/pvc-5fa98625-ea65-4aef-99ff-ca14001afb47   50Gi       RWX            Delete           Sure    nim-service/nv-embedqa-e5-v5-pvc             efs                                      77s
persistentvolume/pvc-ab67e4dc-53df-47e7-95c8-ec6458a57a01   50Gi       RWX            Delete           Sure    nim-service/meta-llama-3-2-1b-instruct-pvc   efs                                      76s

NAME                                                   STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   VOLUMEATTRIBUTESCLASS   AGE
persistentvolumeclaim/meta-llama-3-2-1b-instruct-pvc   Sure    pvc-ab67e4dc-53df-47e7-95c8-ec6458a57a01   50Gi       RWX            efs                             77s
persistentvolumeclaim/nv-embedqa-e5-v5-pvc             Sure    pvc-5fa98625-ea65-4aef-99ff-ca14001afb47   50Gi       RWX            efs                             77s

Enter the next to validate NIMCaches:

kubectl get nimcaches -n nim-service

That is the anticipated output (STATUS will keep initially clean, then develop into InProgress for 10–15 minutes till mannequin obtain is full):

NAME                         STATUS   PVC                              AGE
meta-llama-3-2-1b-instruct   Prepared    meta-llama-3-2-1b-instruct-pvc   13m
nv-embedqa-e5-v5             Prepared    nv-embedqa-e5-v5-pvc             13m

Create `NIMServices`

NIMServices are {custom} assets to handle NVIDIA NIM microservices. To deploy the LLM and embedder providers enter the next:

kubectl apply -f nim-services.yaml

The next is the anticipated output:

nimservice.apps.nvidia.com/meta-llama-3-2-1b-instruct created
nimservice.apps.nvidia.com/nv-embedqa-e5-v5 created

Validate the NIMServices:

kubectl get nimservices -n nim-service

The next is the anticipated output:

NAME                         STATUS   AGE
meta-llama-3-2-1b-instruct   Prepared    5m25s
nv-embedqa-e5-v5             Prepared    5m24s

Our fashions are saved in an EFS quantity, which is mounted to the EC2 situations as a PVC. That interprets to sooner pod startup instances. In reality, discover within the previous instance that the NIMServices are prepared in roughly 5 minutes. This time consists of GPU node(s) launch from Karpenter and container picture pull and launch.

In comparison with the ten–quarter-hour required for internet-based mannequin downloads, as skilled throughout the NIMCaches deployment, loading fashions from the native cache reduces startup time significantly, enhancing the general system scaling velocity. Must you want much more performing storage options, you could possibly discover options similar to Amazon FSx for Lustre.

Enter the next command to verify the nodes once more:

kubectl get nodes -o custom-columns=NAME:.metadata.identify,READY:"standing.circumstances[?(@.type=='Ready')].standing",OS-IMAGE:.standing.nodeInfo.osImage,INSTANCE-TYPE:.metadata.labels.'node.kubernetes.io/instance-type'

The next is the anticipated output:

NAME                  READY   OS-IMAGE                                                          INSTANCE-TYPE
i-0150ecedccffcc17f   True    Bottlerocket (EKS Auto) 2025.4.26 (aws-k8s-1.32)                  c6g.giant
i-027bf5419d63073cf   True    Bottlerocket (EKS Auto) 2025.4.26 (aws-k8s-1.32)                  c5a.giant
i-0a1a1f39564fbf125   True    Bottlerocket (EKS Auto, Nvidia) 2025.4.21 (aws-k8s-1.32-nvidia)   g5.2xlarge
i-0d418bd8429dd12cd   True    Bottlerocket (EKS Auto, Nvidia) 2025.4.21 (aws-k8s-1.32-nvidia)   g5.2xlarge

Karpenter launched two new GPU situations to assist NIMServices, with a Bottlerocket accelerated AMI variant Bottlerocket (EKS Auto, Nvidia) 2025.4.21 (aws-k8s-1.32-nvidia). The quantity and kind of situations launched would possibly range relying on Karpenter’s algorithm, which takes into consideration parameters similar to occasion availability and value.

Verify that the NIMService STATUS is Prepared earlier than progressing additional.

Chat-based assistant consumer

We now use a Python consumer, implementing the chat-based assistant interface, utilizing the Gradio and LangChain libraries. Gradio creates the net interface and chat elements, dealing with the frontend presentation. LangChain connects varied elements and implements RAG by a number of providers in our EKS cluster. Meta’s llama-3-2-1b-instruct serves as the bottom language mannequin, and nv-embedqa-e5-v5 creates textual content embeddings. OpenSearch acts because the vector retailer, managing these embeddings and enabling similarity search. This setup permits the chat-based assistant to retrieve related info and generate contextual responses.

Sequence diagram displaying question-answering workflow with doc add course of

Enter the next instructions to deploy the consumer, hosted on Amazon Elastic Container Registry (Amazon ECR) as a container picture within the public gallery (the appliance’s supply recordsdata can be found within the consumer folder of the cloned repository):

AOSS_INDEX=${CLUSTER_NAME}-index
CHATBOT_CONTAINER_IMAGE=public.ecr.aws/h6c7e9p3/aws-rag-chatbot-eks-nims:1.0

cat << EOF | kubectl apply -f -
apiVersion: v1
variety: Service
metadata:
  identify: rag-chatbot
  labels:
    app: rag-chatbot
spec:
  ports:
  - port: 7860
    protocol: TCP
  selector:
    app: rag-chatbot
---
apiVersion: apps/v1
variety: Deployment
metadata:
  identify: rag-chatbot
spec:
  selector:
    matchLabels:
      app: rag-chatbot
  template:
    metadata:
      labels:
        app: rag-chatbot
    spec:
      serviceAccountName: ${CHATBOT_SA_NAME}
      containers:
      - identify: rag-chatbot
        picture: ${CHATBOT_CONTAINER_IMAGE}
        ports:
        - containerPort: 7860
          protocol: TCP
        env:
        - identify: AWS_DEFAULT_REGION
          worth: ${AWS_DEFAULT_REGION}
        - identify: OPENSEARCH_COLLECTION_ID
          worth: ${AOSS_COLLECTION_ID}
        - identify: OPENSEARCH_INDEX
          worth: ${AOSS_INDEX}
        - identify: LLM_URL
          worth: "http://meta-llama-3-2-1b-instruct.nim-service.svc.cluster.native:8000/v1"
        - identify: EMBEDDINGS_URL
          worth: "http://nv-embedqa-e5-v5.nim-service.svc.cluster.native:8000/v1"
EOF

Test the consumer pod standing:

The next is the instance output:

NAME                           READY   STATUS    RESTARTS   AGE
rag-chatbot-6678cd95cb-4mwct   1/1     Working   0          60s

Port-forward the consumer’s service:

kubectl port-forward service/rag-chatbot 7860:7860 &

Open a browser window at http://127.0.0.1:7860.

Within the following screenshot, we prompted the chat-based assistant a few subject that isn’t in its data base but: “What’s Amazon Nova Canvas.”

The chat-based assistant can’t discover info on the subject and might’t formulate a correct reply.

Obtain the file at location: https://docs.aws.amazon.com/pdfs/ai/responsible-ai/nova-canvas/nova-canvas.pdf and add its embeddings to OpenSearch Serverless utilizing the consumer UI, switching to the Doc add tab, within the prime left, as proven within the following screenshot.

The anticipated result’s nova-canvas.pdf showing the listing of uploaded recordsdata, as proven within the following screenshot.

Wait 15–30 seconds for OpenSearch Serverless to course of and index the info. Ask the identical query, “What’s Amazon Nova Canvas,” and you’ll obtain a distinct reply, as proven within the following screenshot.

Cleanup

To scrub up the cluster and the EFS assets created to date, enter the next command:

aws efs describe-mount-targets 
    --region $AWS_DEFAULT_REGION 
    --file-system-id $EFS_FS_ID 
    --query 'MountTargets[*].MountTargetId' 
    --output textual content 
    | xargs -n1 aws efs delete-mount-target 
        --region $AWS_DEFAULT_REGION 
        --mount-target-id

Wait roughly 30 seconds for the mount targets to be eliminated, then enter the next command:

aws efs delete-file-system --file-system-id $EFS_FS_ID --region $AWS_DEFAULT_REGION
eksctl delete cluster --name=$CLUSTER_NAME --region $AWS_DEFAULT_REGION

To delete the OpenSearch Serverless assortment and insurance policies, enter the next command:

aws opensearchserverless delete-collection 
    --id ${AOSS_COLLECTION_ID}

aws opensearchserverless delete-security-policy 
    --name ${ENCRYPTION_POLICY_NAME} 
    --type encryption
    
aws opensearchserverless delete-security-policy 
    --name ${NETWORK_POLICY_NAME} 
    --type community

aws opensearchserverless delete-access-policy 
    --name ${DATA_POLICY_NAME} 
    --type knowledge

Conclusion

On this put up, we confirmed how you can deploy a RAG-enabled chat-based assistant on Amazon EKS, utilizing NVIDIA NIM microservices, integrating an LLM for textual content technology, an embedding mannequin, and Amazon OpenSearch Serverless for vector storage. Utilizing EKS Auto Mode with GPU-accelerated AMIs, we streamlined our deployment by automating the setup of GPU infrastructure. We specified GPU-based occasion sorts in our Karpenter NodePools, and the system mechanically provisioned employee nodes with all vital NVIDIA elements, together with system plugins, container toolkit, and kernel drivers. The implementation demonstrated the effectiveness of RAG, with the chat-based assistant offering knowledgeable responses when accessing related info from its data base. This structure showcases how Amazon EKS can streamline the deployment of AI options, sustaining production-grade reliability and scalability.

As a problem, strive enhancing the chat-based assistant utility by implementing chat historical past performance to protect context throughout conversations. This enables the LLM to reference earlier exchanges and supply extra contextually related responses. To additional learn to run synthetic intelligence and machine studying (AI/ML) workloads on Amazon EKS, try our EKS finest practices information for operating AI/ML workloads, be a part of certainly one of our Get Fingers On with Amazon EKS occasion sequence, and go to AI on EKS deployment-ready blueprints.

Concerning the authors

Riccardo Freschi is a Senior Options Architect at AWS who focuses on Modernization. He helps companions and prospects remodel their IT landscapes by designing and implementing trendy cloud-native architectures on AWS. His focus areas embrace container-based functions on Kubernetes, cloud-native growth, and establishing modernization methods that drive enterprise worth.

Christina Andonov is a Sr. Specialist Options Architect at AWS, serving to prospects run AI workloads on Amazon EKS with open supply instruments. She’s captivated with Kubernetes and identified for making advanced ideas simple to know.

Constructing a RAG chat-based assistant on Amazon EKS Auto Mode and NVIDIA NIMs

Ideas for Setting Expectations in AI Tasks

A Chicken’s-Eye View of Linear Algebra: Why Is Matrix Multiplication Like That?

A Chicken’s-Eye View of Linear Algebra: Why Is Matrix Multiplication Like That?

Leave a Reply Cancel reply

Popular News

How Aviva constructed a scalable, safe, and dependable MLOps platform utilizing Amazon SageMaker

Speed up edge AI improvement with SiMa.ai Edgematic with a seamless AWS integration

Unlocking Japanese LLMs with AWS Trainium: Innovators Showcase from the AWS LLM Growth Assist Program

The Journey from Jupyter to Programmer: A Fast-Begin Information

The right way to run Qwen 2.5 on AWS AI chips utilizing Hugging Face libraries

About Us

Category

Recent Posts

Constructing a RAG chat-based assistant on Amazon EKS Auto Mode and NVIDIA NIMs

Answer overview

Answer walkthrough

Conditions

Atmosphere setup

Sample deployment

Create an EKS cluster

Arrange Amazon OpenSearch Serverless

Create EFS file system and arrange vital permissions

Create Karpenter GPU NodePool

Set up NVIDIA NFD and NIM Operator

Create nim-service namespace and NVIDIA secrets and techniques

Create NIMCaches

Create NIMServices

Chat-based assistant consumer

Cleanup

Conclusion

Concerning the authors

Ideas for Setting Expectations in AI Tasks

A Chicken’s-Eye View of Linear Algebra: Why Is Matrix Multiplication Like That?

A Chicken’s-Eye View of Linear Algebra: Why Is Matrix Multiplication Like That?

Leave a Reply Cancel reply

Popular News

How Aviva constructed a scalable, safe, and dependable MLOps platform utilizing Amazon SageMaker

Speed up edge AI improvement with SiMa.ai Edgematic with a seamless AWS integration

Unlocking Japanese LLMs with AWS Trainium: Innovators Showcase from the AWS LLM Growth Assist Program

The Journey from Jupyter to Programmer: A Fast-Begin Information

The right way to run Qwen 2.5 on AWS AI chips utilizing Hugging Face libraries

About Us

Category

Recent Posts

Create Karpenter GPU `NodePool`

Create `nim-service` namespace and NVIDIA secrets and techniques

Create `NIMCaches`

Create `NIMServices`