This publish was co-written with Saurabh Gupta and Todd Colby from Pushpay.
Pushpay is a market-leading digital giving and engagement platform designed to assist church buildings and faith-based organizations drive neighborhood engagement, handle donations, and strengthen generosity fundraising processes effectively. Pushpay’s church administration system offers church directors and ministry leaders with insight-driven reporting, donor growth dashboards, and automation of economic workflows.
Utilizing the facility of generative AI, Pushpay developed an progressive agentic AI search characteristic constructed for the distinctive wants of ministries. The method makes use of pure language processing so ministry workers can ask questions in plain English and generate real-time, actionable insights from their neighborhood knowledge. The AI search characteristic addresses a vital problem confronted by ministry leaders: the necessity for fast entry to neighborhood insights with out requiring technical experience. For instance, ministry leaders can enter “present me people who find themselves members in a gaggle, however haven’t given this 12 months” or “present me people who find themselves not engaged in my church,” and use the outcomes to take significant motion to raised assist people of their neighborhood. Most neighborhood leaders are time-constrained and lack technical backgrounds; they will use this answer to acquire significant knowledge about their congregations in seconds utilizing pure language queries.
By empowering ministry workers with quicker entry to neighborhood insights, the AI search characteristic helps Pushpay’s mission to encourage generosity and connection between church buildings and their neighborhood members. Early adoption customers report that this answer has shortened their time to insights from minutes to seconds. To attain this end result, the Pushpay staff constructed the characteristic utilizing agentic AI capabilities on Amazon Internet Companies (AWS) whereas implementing sturdy high quality assurance measures and establishing a speedy iterative suggestions loop for steady enhancements.
On this publish, we stroll you thru Pushpay’s journey in constructing this answer and discover how Pushpay used Amazon Bedrock to create a customized generative AI analysis framework for steady high quality assurance and establishing speedy iteration suggestions loops on AWS.
Resolution overview: AI powered search structure
The answer consists of a number of key elements that work collectively to ship an enhanced search expertise. The next determine reveals the answer structure diagram and the general workflow.

Determine 1: AI Search Resolution Structure
- Person interface layer: The answer begins with Pushpay customers submitting pure language queries via the present Pushpay software interface. Through the use of pure language queries, church ministry workers can get hold of knowledge insights utilizing AI capabilities with out studying new instruments or interfaces.
- AI search agent: On the coronary heart of the system lies the AI search agent, which consists of two key elements:
- System immediate: Incorporates the massive language mannequin (LLM) function definitions, directions, and software descriptions that information the agent’s habits.
- Dynamic immediate constructor (DPC): routinely constructs extra personalized system prompts primarily based on the person particular data, resembling church context, pattern queries, and software filter stock. Additionally they use semantic search to pick out solely related filters amongst a whole bunch of obtainable software filters. The DPC improves response accuracy and person expertise.
- Amazon Bedrock superior characteristic: The answer makes use of the next Amazon Bedrock managed providers:
- Immediate caching: Reduces latency and prices by caching ceaselessly used system immediate.
- LLM processing: Makes use of Claude Sonnet 4.5 to course of prompts and generate JSON output required by the appliance to show the specified question outcomes as insights to customers.
- Analysis system: The analysis system implements a closed-loop enchancment answer the place person interactions are instrumented, captured and evaluated offline. The analysis outcomes feed right into a dashboard for product and engineering groups to investigate and drive iterative enhancements to the AI search agent. Throughout this course of, the information science staff collects a golden dataset and constantly curates this dataset primarily based on the precise person queries coupled with validated responses.
The challenges of preliminary answer with out analysis
To create the AI search characteristic, Pushpay developed the primary iteration of the AI search agent. The answer implements a single agent configured with a rigorously tuned system immediate that features the system function, directions, and the way the person interface works with detailed clarification of every filter device and their sub-settings. The system immediate is cached utilizing Amazon Bedrock immediate caching to cut back token value and latency. The agent makes use of the system immediate to invoke an Amazon Bedrock LLM which generates the JSON doc that Pushpay’s software makes use of to use filters and current question outcomes to customers.
Nevertheless, this primary iteration shortly revealed some limitations. Whereas it demonstrated a 60-70% success fee with fundamental enterprise queries, the staff reached an accuracy plateau. The analysis of the agent was a handbook and tedious course of Tuning the system immediate past this accuracy threshold proved difficult given the various spectrum of person queries and the appliance’s protection of over 100 distinct configurable filters. These offered vital blockers for the staff’s path to manufacturing.

Determine 2: AI Search First Resolution
Enhancing the answer by including a customized generative AI analysis framework
To handle the challenges of measuring and bettering agent accuracy, the staff carried out a generative AI analysis framework built-in into the present structure, proven within the following determine. This framework consists of 4 key elements that work collectively to supply complete efficiency insights and allow data-driven enhancements.

Determine 3: Introducing the GenAI Analysis Framework
- The golden dataset: A curated golden dataset containing over 300 consultant queries, every paired with its corresponding anticipated output, types the inspiration of automated analysis. The product and knowledge science groups rigorously developed and validated this dataset to realize complete protection of real-world use circumstances and edge circumstances. Moreover, there’s a steady curation technique of including consultant precise person queries with validated outcomes.
- The evaluator: The evaluator part processes person enter queries and compares the agent-generated output in opposition to the golden dataset utilizing the LLM as a decide sample This method generates core accuracy metrics whereas capturing detailed logs and efficiency knowledge, resembling latency, for additional evaluation and debugging.
- Area class: Area classes are developed utilizing a mix of generative AI area summarization and human-defined common expressions to successfully categorize person queries. The evaluator determines the area class for every question, enabling nuanced, category-based analysis as a further dimension of analysis metrics.
- Generative AI analysis dashboard: The dashboard serves because the mission management for Pushpay’s product and engineering groups, displaying area category-level metrics to evaluate efficiency and latency and information choices. It shifts the staff from single combination scores to nuanced, domain-based efficiency insights.
The accuracy dashboard: Pinpointing weaknesses by area
As a result of person queries are categorized into area classes, the dashboard incorporates statistical confidence visualization utilizing a 95% Wilson rating interval to show accuracy metrics and question volumes at every area degree. Through the use of classes, the staff can pinpoint the AI agent’s weaknesses by area. Within the following instance , the “exercise” area reveals considerably decrease accuracy than different classes.

Determine 4: Pinpointing Agent Weaknesses by Area
Moreover, a efficiency dashboard, proven within the following determine, visualizes latency indicators on the area class degree, together with latency distributions from p50 to p90 percentiles. Within the following instance, the exercise area reveals notably greater latency than others.

Determine 5: Figuring out Latency Bottlenecks by Area
Strategic rollout via domain-Stage insights
Area-based metrics revealed various efficiency ranges throughout semantic domains, offering essential insights into agent effectiveness. Pushpay used this granular visibility to make strategic characteristic rollout choices. By briefly suppressing underperforming classes—resembling exercise queries—whereas present process optimization, the system achieved 95% total accuracy. Through the use of this method, customers skilled solely the highest-performing options whereas the staff refined others to manufacturing requirements.

Determine 6: Reaching 95% Accuracy with Area-Stage Characteristic Rollout
Strategic prioritization: Specializing in high-impact domains
To prioritize enhancements systematically, Pushpay employed a 2×2 matrix framework plotting subjects in opposition to two dimensions (proven within the following determine): Enterprise precedence (vertical axis) and present efficiency or feasibility (horizontal axis). This visualization positioned subjects with each excessive enterprise worth and robust current efficiency within the top-right quadrant. The staff then targeted on these areas as a result of they required much less heavy lifting to realize additional accuracy enchancment from already-good ranges to an distinctive 95% accuracy for the enterprise targeted subjects.
The implementation adopted an iterative cycle: after every spherical of enhancements, they re-analyze the outcomes to determine the subsequent set of high-potential subjects. This systematic, cyclical method enabled steady optimization whereas sustaining concentrate on business-critical areas.

Determine 7: Strategic Prioritization Framework for Area Class Optimization
Dynamic immediate building
The insights gained from the analysis framework led to an architectural enhancement: the introduction of a dynamic immediate constructor. This part enabled speedy iterative enhancements by permitting fine-grained management over which area classes the agent may deal with. The structured area stock – beforehand embedded within the system immediate – was remodeled right into a dynamic aspect, utilizing semantic search to assemble contextually related prompts for every person question. This method tailors the immediate filter stock primarily based on three key contextual dimensions: question content material, person persona, and tenant-specific necessities. The result’s a extra exact and environment friendly system that generates extremely related responses whereas sustaining the pliability wanted for steady optimization.
Enterprise influence
The generative AI analysis framework turned the cornerstone of Pushpay’s AI characteristic growth, delivering measurable worth throughout three dimensions:
- Person expertise: The AI search characteristic diminished time-to-insight from roughly 120 seconds (skilled customers manually navigating advanced UX) to underneath 4 seconds – a 15-fold acceleration that immediately helps improve ministry leaders’ productiveness and decision-making pace. This characteristic democratized knowledge insights, in order that customers of various technical ranges can entry significant intelligence with out requiring specialised experience.
- Growth velocity: The scientific analysis method remodeled optimization cycles. Moderately than debating immediate modifications, the staff now validates adjustments and measures domain-specific impacts inside minutes, changing extended deliberations with data-driven iteration.
- Manufacturing readiness: Enhancements from 60–70% accuracy to greater than 95% accuracy utilizing high-performance domains offered the quantitative confidence required for customer-facing deployment, whereas the framework’s structure allows steady refinement throughout different area classes.
Key takeaways in your AI agent journey
The next are key takeaways from Pushpay’s expertise that you need to use in your personal AI agent journey.
1/ Construct with manufacturing in thoughts from day one
Constructing agentic AI programs is easy, however scaling them to manufacturing is difficult. Builders ought to undertake a scaling mindset throughout the proof-of-concept part, not after. Implementing sturdy tracing and analysis frameworks early, offers a transparent pathway from experimentation to manufacturing. Through the use of this methodology, groups can determine and deal with accuracy points systematically earlier than they change into blockers.
2/ Benefit from the superior options of Amazon Bedrock
Amazon Bedrock immediate caching considerably reduces token prices and latency by caching ceaselessly used system prompts. For brokers with giant, steady system prompts, this characteristic is crucial for production-grade efficiency.
3/ Suppose past combination metrics
Combination accuracy scores can generally masks vital efficiency variations. By evaluating agent efficiency on the area class degree, Pushpay uncovered weaknesses past what a single accuracy metric can seize. This granular method allows focused optimization and knowledgeable rollout choices, ensuring customers solely expertise high-performing options whereas others are refined.
4/ Knowledge safety and accountable AI
When creating agentic AI programs, think about data safety and LLM safety concerns from the outset, following the AWS Shared Accountability Mannequin, as a result of safety necessities essentially influence the architectural design. Pushpay’s clients are church buildings and faith-based organizations who’re stewards of delicate data—together with pastoral care conversations, monetary giving patterns, household struggles, prayer requests and extra. On this implementation instance, Pushpay set a transparent method to incorporating AI ethically inside its product ecosystem, sustaining strict safety requirements to make sure church knowledge and personally identifiable data (PII) stays inside its safe partnership ecosystem. Knowledge is shared solely with safe and applicable knowledge protections utilized and isn’t used to coach exterior fashions. To study extra about Pushpay’s requirements for incorporating AI inside their merchandise, go to the Pushpay Data Heart for a extra in-depth evaluation of firm requirements.
Conclusion: Your Path to Manufacturing-Prepared AI Brokers
Pushpay’s journey from a 60–70% accuracy prototype to a 95% correct production-ready AI agent demonstrates that constructing dependable agentic AI programs requires extra than simply subtle prompts—it calls for a scientific, data-driven method to analysis and optimization. The important thing breakthrough wasn’t within the AI know-how itself, however in implementing a complete analysis framework constructed on robust observability basis that offered granular visibility into agent efficiency throughout completely different domains. This systematic method enabled speedy iteration, strategic rollout choices, and steady enchancment.
Able to construct your personal production-ready AI agent?
- Discover Amazon Bedrock: Start constructing your agent with Amazon Bedrock
- Implement LLM-as-a-judge: Create your personal analysis system utilizing the patterns described on this LLM-as-a-judge on Amazon Bedrock Mannequin Analysis
- Construct your golden dataset: Begin curating consultant queries and anticipated outputs in your particular use case
Concerning the authors
Roger Wang is a Senior Resolution Architect at AWS. He’s a seasoned architect with over 20 years of expertise within the software program business. He helps New Zealand and world software program and SaaS corporations use cutting-edge know-how at AWS to unravel advanced enterprise challenges. Roger is obsessed with bridging the hole between enterprise drivers and technological capabilities and thrives on facilitating conversations that drive impactful outcomes.
Melanie Li, PhD, is a Senior Generative AI Specialist Options Architect at AWS primarily based in Sydney, Australia, the place her focus is on working with clients to construct options leveraging state-of-the-art AI and machine studying instruments. She has been actively concerned in a number of Generative AI initiatives throughout APJ, harnessing the facility of Giant Language Fashions (LLMs). Previous to becoming a member of AWS, Dr. Li held knowledge science roles within the monetary and retail industries.
Frank Huang, PhD, is a Senior Analytics Specialist Options Architect at AWS primarily based in Auckland, New Zealand. He focuses on serving to clients ship superior analytics and AI/ML options. All through his profession, Frank has labored throughout a wide range of industries resembling monetary providers, Web3, hospitality, media and leisure, and telecommunications. Frank is raring to make use of his deep experience in cloud structure, AIOps, and end-to-end answer supply to assist clients obtain tangible enterprise outcomes with the facility of knowledge and AI.
Saurabh Gupta is an information science and AI skilled at Pushpay primarily based in Auckland, New Zealand, the place he focuses on implementing sensible AI options and statistical modeling. He has in depth expertise in machine studying, knowledge science, and Python for knowledge science purposes, with specialised expertise coaching in database brokers and AI implementation. Previous to his present function, he gained expertise in telecom, retail and monetary providers, creating experience in advertising and marketing analytics and buyer retention applications. He has a Grasp’s in Statistics from College of Auckland and a Grasp’s in Enterprise Administration from the Indian Institute of Administration, Calcutta.
Todd Colby is a Senior Software program Engineer at Pushpay primarily based in Seattle. His experience is concentrated on evolving advanced legacy purposes with AI, and translating person wants into structured, high-accuracy options. He leverages AI to extend supply velocity and produce leading edge metrics and enterprise choice instruments.


