PDI Applied sciences is a world chief within the comfort retail and petroleum wholesale industries. They assist companies across the globe improve effectivity and profitability by securely connecting their information and operations. With 40 years of expertise, PDI Applied sciences assists clients in all elements of their enterprise, from understanding shopper habits to simplifying know-how ecosystems throughout the provision chain.
Enterprises face a major problem of creating their information bases accessible, searchable, and usable by AI programs. Inside groups at PDI Applied sciences have been fighting data scattered throughout disparate programs together with web sites, Confluence pages, SharePoint websites, and varied different information sources. To handle this, PDI Applied sciences constructed PDI Intelligence Question (PDIQ), an AI assistant that offers staff entry to firm information by way of an easy-to-use chat interface. This answer is powered by a customized Retrieval Augmented Technology (RAG) system, constructed on Amazon Net Providers (AWS) utilizing serverless applied sciences. Constructing PDIQ required addressing the next key challenges:
- Routinely extracting content material from various sources with totally different authentication necessities
- Needing the pliability to pick out, apply, and interchange probably the most appropriate massive language mannequin (LLM) for various processing necessities
- Processing and indexing content material for semantic search and contextual retrieval
- Making a information basis that permits correct, related AI responses
- Repeatedly refreshing data by way of scheduled crawling
- Supporting enterprise-specific context in AI interactions
On this submit, we stroll by way of the PDIQ course of movement and structure, specializing in the implementation particulars and the enterprise outcomes it has helped PDI obtain.
Answer structure
On this part, we discover PDIQ’s complete end-to-end design. We look at the information ingestion pipeline from preliminary processing by way of storage to person search capabilities, in addition to the zero-trust safety framework that protects key person personas all through their platform interactions. The structure consists of those parts:
- Scheduler – Amazon EventBridge maintains and executes the crawler scheduler.
- Crawlers – AWS Lambda invokes crawlers which might be executed as duties by Amazon Elastic Container Service (Amazon ECS).
- Amazon DynamoDB – Persists crawler configurations and different metadata comparable to Amazon Easy Storage Service (Amazon S3) picture location and captions.
- Amazon S3 – All supply paperwork are saved in Amazon S3. Amazon S3 occasions set off the downstream movement for each object that’s created or deleted.
- Amazon Easy Notification Service (Amazon SNS) – Receives notification from Amazon S3 occasions.
- Amazon Easy Queue Service (Amazon SQS) – Subscribed to Amazon SNS to carry the incoming requests in a queue.
- AWS Lambda – Handles the enterprise logic for chunking, summarizing, and producing vector embeddings.
- Amazon Bedrock – Gives API entry to basis fashions (FMs) utilized by PDIQ:
- Amazon Aurora PostgreSQL-Suitable Version – Shops vector embeddings.
The next diagram is the answer structure.

Subsequent, we overview how PDIQ implements a zero-trust safety mannequin with role-based entry management for 2 key personas:
- Directors configure information bases and crawlers by way of Amazon Cognito person teams built-in with enterprise single sign-on. Crawler credentials are encrypted at relaxation utilizing AWS Key Administration Service (AWS KMS) and solely accessible inside remoted execution environments.
- Finish customers entry information bases based mostly on group permissions validated on the utility layer. Customers can belong to a number of teams (comparable to human sources or compliance) and change contexts to question role-appropriate datasets.
Course of movement
On this part, we overview the end-to-end course of movement. We break it down by sections to dive deeper into every step and clarify the performance.

Crawlers
Crawlers are configured by Administrator to gather information from a wide range of sources that PDI depends on. Crawlers hydrate the information into the information base in order that this data may be retrieved by finish customers. PDIQ presently helps the next crawler configurations:
- Net crawler – By utilizing Puppeteer for headless browser automation, the crawler converts HTML internet pages to markdown format utilizing turndown. By following the embedded hyperlinks on the web site, the crawler can seize full context and relationships between pages. Moreover, the crawler downloads property comparable to PDFs and pictures whereas preserving the unique reference and provides customers configuration choices comparable to price limiting.
- Confluence crawler – This crawler makes use of Confluence REST API with authenticated entry to extract web page content material, attachments, and embedded photos. It preserves web page hierarchy and relationships, handles particular Confluence parts comparable to data bins, notes, and plenty of extra.
- Azure DevOps crawler – PDI makes use of Azure DevOps to handle its code base, observe commits, and keep undertaking documentation in a centralized repository. PDIQ makes use of Azure DevOps REST API with OAuth or private entry token (PAT) authentication to extract this data. Azure DevOps crawler preserves undertaking hierarchy, dash relationships, and backlog construction additionally maps work merchandise relationships (comparable to mum or dad/little one or linked objects), thereby offering an entire view of the dataset.
- SharePoint crawler – It makes use of Microsoft Graph API with OAuth authentication to extract doc libraries, lists, pages, and file content material. The crawler processes MS Workplace paperwork (Phrase, Excel, PowerPoint) into searchable textual content and maintains doc model historical past and permission metadata.
By constructing separate crawler configurations, PDIQ provides straightforward extensibility into the platform to configure extra crawlers on demand. It additionally provides the pliability to administrator customers to configure the settings for his or her respective crawlers (comparable to frequency, depth, or price limits).
The next determine reveals the PDIQ UI to configure the information base.

The next determine reveals the PDI UI to configure your crawler (comparable to Confluence).

The next determine reveals the PDIQ UI to schedule crawlers.

Dealing with photos
Knowledge crawled is saved in Amazon S3 with correct metadata tags. If the supply is in HTML format, the duty converts the content material into markdown (.md) recordsdata. For these markdown recordsdata, there’s an extra optimization step carried out to exchange the pictures within the doc with the Amazon S3 reference location. Key advantages of this method embrace:
- PDI can use S3 object keys to uniquely reference every picture, thereby optimizing the synchronization course of to detect modifications in supply information
- You’ll be able to optimize storage by changing photos with captions and avoiding the necessity to retailer duplicate photos
- It supplies the flexibility to make the content material of the pictures searchable and relatable to the textual content content material within the doc
- Seamlessly inject unique photos when rendering a response to person inquiry
The next is a pattern markdown file the place photos are changed with the S3 file location:
Doc processing
That is probably the most crucial step of the method. The important thing goal of this step is to generate vector embeddings in order that they can be utilized for similarity matching and efficient retrieval based mostly on person inquiry. The method follows a number of steps, beginning with picture captioning, then doc chunking, abstract technology, and embedding technology. To caption the pictures, PDIQ scans the markdown recordsdata to find picture tags
The next is an instance of a picture caption immediate:
The next is a snippet of markdown file that incorporates the picture tag, LLM-generated caption, and the corresponding S3 file location:
Now that markdown recordsdata are injected with picture captions, the following step is to interrupt the unique doc into chunks that match into the context window of the embeddings mannequin. PDIQ makes use of Amazon Titan Textual content Embeddings V2 mannequin to generate vectors and shops them in Aurora PostgreSQL-Suitable Serverless. Based mostly on inside accuracy testing and chunking greatest practices from AWS, PDIQ performs chunking as follows:
- 70% of the tokens for content material
- 10% overlap between chunks
- 20% for abstract tokens
Utilizing the doc chunking logic from the earlier step, the doc is transformed into vector embeddings. The method consists of:
- Calculate chunk parameters – Decide the dimensions and complete variety of chunks required for the doc based mostly on the 70% calculation.
- Generate doc abstract – Use Amazon Nova Lite to create a abstract of the complete doc, constrained by the 20% token allocation. This abstract is reused throughout all chunks to supply constant context.
- Chunk and prepend abstract – Cut up the doc into overlapping chunks (10%), with the abstract prepended on the prime.
- Generate embeddings – Use Amazon Titan Textual content Embeddings V2 to generate vector embeddings for every chunk (abstract plus content material), which is then saved within the vector retailer.
By designing a custom-made method to generate a abstract part atop of all chunks, PDIQ ensures that when a specific chunk is matched based mostly on similarity search, the LLM has entry to the complete abstract of the doc and never solely the chunk that matched. This method enriches finish person expertise leading to a rise of approval price for accuracy from 60% to 79%.
The next is an instance of a summarization immediate:
The next is an instance of abstract textual content, accessible on every chunk:
Chunk 1 has a abstract on the prime adopted by particulars from the supply:
Chunk 2 has a abstract on the prime, adopted by continuation of particulars from the supply:
PDIQ scans every doc chunk and generates vector embeddings. This information is saved in Aurora PostgreSQL database with key attributes, together with a novel information base ID, corresponding embeddings attribute, unique textual content (abstract plus chunk plus picture caption), and a JSON binary object that features metadata fields for extensibility. To maintain the information base in sync, PDI implements the next steps:
- Add – These are internet new supply objects that needs to be ingested. PDIQ implements the doc processing movement described beforehand.
- Replace – If PDIQ determines the identical object is current, it compares the hash key worth from the supply with the hash worth from the JSON object.
- Delete – If PDIQ determines {that a} particular supply doc not exists, it triggers a delete operation on the S3 bucket (
s3:ObjectRemoved:*), which ends up in a cleanup job, deleting the data similar to the important thing worth within the Aurora desk.
PDI makes use of Amazon Nova Professional to retrieve probably the most related doc and generates a response by following these key steps:
- Utilizing similarity search, retrieves probably the most related doc chunks, which embrace abstract, chunk information, picture caption, and picture hyperlink.
- For the matching chunk, retrieve the complete doc.
- LLM then replaces the picture hyperlink with the precise picture from Amazon S3.
- LLM generates a response based mostly on the information retrieved and the preconfigured system immediate.
The next is a snippet of system immediate:
Outcomes and subsequent steps
By constructing this custom-made RAG answer on AWS, PDI realized the next advantages:
- Versatile configuration choices permit information ingestion at consumer-preferred frequencies.
- Scalable design allows future ingestion from extra supply programs by way of simply configurable crawlers.
- Helps crawler configuration utilizing a number of authentication strategies, together with username and password, secret key-value pairs, and API keys.
- Customizable metadata fields allow superior filtering and enhance question efficiency.
- Dynamic token administration helps PDI intelligently stability tokens between content material and summaries, enhancing person responses.
- Consolidates various supply information codecs right into a unified structure for streamlined storage and retrieval.
PDIQ supplies key enterprise outcomes that embrace:
- Improved effectivity and backbone charges – The device empowers PDI assist groups to resolve buyer queries considerably quicker, typically automating routine points and offering rapid, exact responses. This has led to much less buyer ready on case decision and extra productive brokers.
- Excessive buyer satisfaction and loyalty – By delivering correct, related, and customized solutions grounded in reside documentation and firm information, PDIQ elevated buyer satisfaction scores (CSAT), internet promoter scores (NPS), and total loyalty. Prospects really feel heard and supported, strengthening PDI model relationships.
- Price discount – PDIQ handles the majority of repetitive queries, permitting restricted assist workers to deal with expert-level instances, which improves productiveness and morale. Moreover, PDIQ is constructed on serverless structure, which robotically scales whereas minimizing operational overhead and price.
- Enterprise flexibility – A single platform can serve totally different enterprise items, who can curate the content material by configuring their respective information sources.
- Incremental worth – Every new content material supply provides measurable worth with out system redesign.
PDI continues to boost the applying with a number of deliberate enhancements within the pipeline, together with:
- Construct extra crawler configuration for brand new information sources (for instance, GitHub).
- Construct agentic implementation for PDIQ to be built-in into bigger advanced enterprise processes.
- Enhanced doc understanding with desk extraction and construction preservation.
- Multilingual assist for world operations.
- Improved relevance rating with hybrid retrieval methods.
- Potential to invoke PDIQ based mostly on occasions (for instance, supply commits).
Conclusion
PDIQ service has reworked how customers entry and use enterprise information at PDI Applied sciences. By utilizing Amazon serverless companies, PDIQ can robotically scale with demand, scale back operational overhead, and optimize prices. The answer’s distinctive method to doc processing, together with the dynamic token administration and the customized picture captioning system, represents vital technical innovation in enterprise RAG programs. The structure efficiently balances efficiency, value, and scalability whereas sustaining safety and authentication necessities. As PDI Applied sciences proceed to develop PDIQ’s capabilities, they’re excited to see how this structure can adapt to new sources, codecs, and use instances.
Concerning the authors
Samit Kumbhani is an Amazon Net Providers (AWS) Senior Options Architect within the New York Metropolis space with over 18 years of expertise. He presently companions with impartial software program distributors (ISVs) to construct extremely scalable, progressive, and safe cloud options. Exterior of labor, Samit enjoys taking part in cricket, touring, and biking.
Jhorlin De Armas is an Architect II at PDI Applied sciences, the place he leads the design of AI-driven platforms on Amazon Net Providers (AWS). Since becoming a member of PDI in 2024, he has architected a compositional AI service that permits configurable assistants, brokers, information bases, and guardrails utilizing Amazon Bedrock, Aurora Serverless, AWS Lambda, and DynamoDB. With over 18 years of expertise constructing enterprise software program, Jhorlin makes a speciality of cloud-centered architectures, serverless platforms, and AI/ML options.
David Mbonu is a Sr. Options Architect at Amazon Net Providers (AWS), serving to horizontal enterprise utility ISV clients construct and deploy transformational options on AWS. David has over 27 years of expertise in enterprise options structure and system engineering throughout software program, FinTech, and public cloud firms. His latest pursuits embrace AI/ML, information technique, observability, resiliency, and safety. David and his household reside in Sugar Hill, GA.

