Construct interactive PDF textual content extraction from Amazon S3

Image this: a compliance officer wants a particular clause throughout an audit, an legal professional wants contract phrases whereas a consumer waits on the cellphone, or a finance analyst wants numbers from final quarter’s report earlier than a gathering that begins in 10 minutes. In every case, ready for a scheduled job to complete just isn’t sensible. You want on-demand entry to the textual content inside your PDFs.

On this put up, you’ll construct a server that extracts textual content from PDF recordsdata in Amazon S3 in actual time. This protocol-based method offers programmatic doc entry. You’ll stroll via the structure, arrange the server, and run interactive doc queries. Alongside the best way, you’ll examine this method with Amazon Textract so you’ll be able to determine which software matches your workload.

We constructed this answer after working with a number of groups who shared the identical frustration: their paperwork lived in Amazon S3, however getting textual content out of them on demand meant both writing customized scripts or ready on batch pipelines. This MCP server method sits in between, supplying you with interactive entry with minimal setup. Interactive PDF textual content extraction from Amazon S3 provides you real-time solutions out of your paperwork with out batch pipelines or heavy infrastructure.

This MCP-based choice works properly for text-based PDFs in growth and proof of idea settings. For complicated doc processing like optical character recognition (OCR), type extraction, and format evaluation, Amazon Textract stays the really useful selection.

Who advantages from this method

This answer matches a number of widespread roles. If these eventualities sound like your day-to-day, learn on.

Compliance and authorized groups: Throughout a time-sensitive evaluate, you might want to find a particular clause buried in a 200-page coverage doc or contract. Looking manually takes too lengthy. With this answer, you ask a query in pure language and get the related passage again in seconds.

Monetary providers groups: Throughout an audit session, you want fast entry to the precise wording of an inside danger coverage or regulatory submitting. This answer enables you to pull that info instantly out of your Amazon S3 doc repository with out leaving your terminal.

Government groups: Throughout strategic planning conferences, you’ll be able to question a PDF on the spot when somebody asks a few knowledge level from final quarter’s earnings report. No flipping via printed copies or ready for somebody to look it up after the assembly.

These eventualities share just a few widespread traits: they contain real-time info wants the place batch processing is simply too sluggish, text-based PDF paperwork with customary formatting, price sensitivity in growth and proof of idea environments, and integration necessities with present AWS workflows and tooling.

Amazon Textract is a completely managed AWS AI service purpose-built for doc processing at scale. It handles scanned pages, handwriting, and multi-column layouts. Select Amazon Textract whenever you want OCR for scanned paperwork, superior type and desk extraction, complicated format evaluation, production-scale batch processing with service degree settlement (SLA) necessities, or compliance options and enterprise help.

The MCP-based method addresses a complementary situation: giving an AI assistant interactive, on-demand entry to textual content already encoded inside PDFs. Select this sample when your paperwork are text-based PDFs (no OCR required), your workflow is interactive quite than batch, you’re working in growth or proof of idea environments, and also you need minimal infrastructure between the AI assistant and the supply doc. For every part else, together with any doc processing that advantages from OCR or structured extraction, route the work to Amazon Textract.

How the answer works

With this answer, you join your AI assistant on to your PDF paperwork in Amazon S3 and might get solutions rapidly. Underneath the hood, the answer makes use of the Mannequin Context Protocol (MCP), an open customary that gives a structured strategy to entry exterior knowledge sources. MCP acts as a communication layer between your software and your knowledge. The structure has 4 elements: a command-line interface because the person interface, the MCP layer for communication, a customized MCP server for PDF processing, and Amazon S3 for doc storage, secured by AWS Identification and Entry Administration (AWS IAM).

Price comparability

Select the method that matches your price range and necessities. For about 10,000 text-based PDF pages per thirty days in a proof of idea setting, right here is how the 2 approaches examine:

These two figures are value factors for various characteristic units and shouldn’t be learn as a head-to-head value comparability. Use them to select the precise software for the workload, to not optimize purely on {dollars}. In case your workload includes scanned paperwork, varieties, tables, complicated layouts, or manufacturing SLAs, Amazon Textract is the suitable selection and the extra capabilities are mirrored in its value.

Amazon Textract scope: page-level processing, OCR-ready, type and desk extraction, format understanding, enterprise SLAs

Indicative month-to-month price: Amazon Textract processing roughly $15, Amazon S3 storage $2, AWS Lambda compute $1, and enormous language mannequin (LLM) token processing roughly $5 to $10, for a complete of roughly $23 to $28.

MCP server scope: direct textual content extraction from PDFs whose textual content is already encoded; no managed processing service concerned

Indicative month-to-month price: Amazon S3 storage $2 and knowledge switch $0.50, for a complete of roughly $2.50.

All price figures are illustrative and should change. Discuss with the official AWS pricing pages for present charges.

Structure overview

The next sequence diagram illustrates the end-to-end workflow for extracting textual content from a PDF saved in Amazon S3. The method begins when the AI consumer initiates a request for PDF extraction via the CLI. The system forwards this request to the MCP server, which retrieves the PDF file from Amazon S3 utilizing the offered bucket and object key.

After the MCP server fetches the PDF, it passes the file to a PDF parsing part. The part processes the doc and extracts the textual content material. The MCP server then returns the extracted textual content to the consumer, and the consumer shows it to the person.

Step-by-step implementation

Comply with these steps to arrange and configure the PDF textual content extraction answer. Start by confirming you’ve the required conditions in place.

Stipulations

Earlier than you start, affirm that you’ve the next gadgets prepared. You’ll additionally want fundamental familiarity with Python programming and AWS providers.

An AWS account with Amazon S3 learn permissions.
Python 3.10 or later put in.
AWS Command Line Interface (AWS CLI) configured with legitimate credentials.
Kiro CLI put in.
```
pip set up boto3 PyPDF2 mcp
```

Set up

This part guides you thru putting in the MCP server and its dependencies. The method includes making a Python digital setting, putting in the required packages, and creating the server file. Comply with these steps so as. Run every command in your terminal.

Earlier than you begin, you want:

Python 3.10 or newer put in in your machine.
The Kiro CLI put in and logged in.
AWS credentials arrange in your machine (run aws configure if you happen to haven’t).
An S3 bucket that incorporates at the very least one PDF file.

Step 1 — Create a folder for the mission

Run these two instructions in your terminal:

Step 2 — Navigate to the mission folder

Run this command:

Step 3 — Create a Python digital setting

Run this command:

Step 4 — Activate the digital setting

Run this command:

After this, your terminal immediate will present (venv) at the beginning. Preserve this terminal open. It’s good to keep on this digital setting for the subsequent steps.

Step 5 — Set up the required Python packages

Run this one command:

pip set up mcp boto3 PyPDF2

Anticipate it to complete. It ought to finish with “Efficiently put in…”.

Step 6 — Create the server file

Contained in the ~/s3-pdf-extractor folder, create a brand new file named precisely:

Paste the next code into that file and reserve it:

Step 7 — Take a look at that the server begins

In your terminal (nonetheless contained in the s3-pdf-extractor folder with the venv lively), run:

python s3_pdf_extractor.py

The terminal will seem to “pause” with no output. That’s appropriate. It means the server is operating and ready for requests. Press Ctrl+C to cease it.

When you see an error as a substitute, re-check Steps 2 and three.

from mcp.server import Server
from mcp.varieties import Device, TextContent
import boto3
from PyPDF2 import PdfReader
import tempfile
import os
import logging

# Configure logging for manufacturing use
logging.basicConfig(degree=logging.INFO)
logger = logging.getLogger(__name__)

server = Server("s3-pdf-extractor")

@server.list_tools()
async def list_tools():
    return [
        Tool(
            name="extract_s3_pdf_text",
            description="Extract text content from a PDF stored in Amazon S3",
            inputSchema={
                "type": "object",
                "properties": {
                    "bucket": {"type": "string", "description": "S3 bucket name"},
                    "key": {"type": "string", "description": "S3 object key"}
                },
                "required": ["bucket", "key"]
            }
        )
    ]

@server.call_tool()
async def call_tool(identify: str, arguments: dict):
    if identify == "extract_s3_pdf_text":
        bucket = arguments["bucket"]
        key = arguments["key"]

        strive:
            # Use present AWS credentials and IAM permissions
            s3_client = boto3.consumer('s3')

            with tempfile.NamedTemporaryFile(delete=False, suffix='.pdf') as tmp_file:
                s3_client.download_file(bucket, key, tmp_file.identify)
                tmp_path = tmp_file.identify

            # Extract textual content utilizing PyPDF2
            reader = PdfReader(tmp_path)
            textual content = ""
            for web page in reader.pages:
                textual content += web page.extract_text() + "n"

            logger.information(f"Efficiently extracted textual content from {bucket}/{key}")
            return [TextContent(type="text", text=text)]

        besides Exception as e:
            logger.error(f"Error processing {bucket}/{key}: {str(e)}")
            increase
        lastly:
            # Guarantee cleanup of short-term recordsdata
            if 'tmp_path' in locals():
                os.unlink(tmp_path)

if __name__ == "__main__":
    server.run()

Step 8 — Find or create the Kiro CLI configuration file

Kiro CLI makes use of a JSON configuration file to know which MCP servers can be found. It’s good to add your server to this file.

The Kiro CLI MCP configuration file is positioned at:

~/.kiro/settings/instruments/mcp.json

If this file doesn’t exist, create it by operating these instructions in your terminal:

mkdir -p ~/.kiro/settings/instruments
nano ~/.kiro/settings/instruments/mcp.json

Step 9 — Add the MCP server configuration

Paste the next JSON into the file. Substitute /path/to/s3_pdf_extractor.py with the precise path from Step 1 (for instance, ~/s3-pdf-extractor/s3_pdf_extractor.py):

{
    "mcpServers": {
        "s3-pdf-extractor": {
            "command": "python",
            "args": ["/path/to/s3_pdf_extractor.py"]
        }
    }
}

To get the total absolute path, run echo ~/s3-pdf-extractor/s3_pdf_extractor.py in your terminal and use that output within the args area.

Step 10 — Save the configuration file

Press Ctrl+O, then press Enter to avoid wasting the file.

Step 11 — Shut the file editor

Press Ctrl+X to exit nano.

Step 12 — Restart Kiro CLI

Restart Kiro CLI to load the brand new configuration. Shut and reopen Kiro CLI, or run:

Step 13 — Confirm the MCP server connection

Confirm the connection by operating a check extraction in Kiro CLI:

extract textual content from s3://your-bucket-name/pattern.pdf

Safety concerns

Safety is built-in from the start, not added as an afterthought. Right here is how the answer handles it:

IAM integration: The answer makes use of your present AWS credentials. You don’t want to create or handle separate API keys.
Least privilege entry: You grant solely Amazon S3 learn permissions, scoped to the particular buckets that include your PDF paperwork. Nothing extra.
Short-term storage: The server deletes downloaded recordsdata routinely after it completes processing. No PDF knowledge lingers on the native file system.
No knowledge persistence: Textual content extraction happens on demand with out storing outcomes.
Audit path: AWS CloudTrail logs Amazon S3 entry requests in your account.

Efficiency and limitations

Right here is what to anticipate by way of efficiency:

The server processes paperwork in actual time. For a typical 50-page text-based PDF, outcomes are usually obtainable in just a few seconds, making it sensible for interactive workflows the place you ask follow-up questions.
Processing time scales linearly with doc dimension. A ten-page doc processes roughly 5 instances quicker than a 50-page one.
Reminiscence utilization is proportional to doc dimension. For many text-based PDFs beneath 100 pages, reminiscence consumption stays properly inside typical growth machine limits.

This method has clear limits. Know them earlier than you commit:

Textual content-based PDFs solely. In case your paperwork are scanned photos or images of paper, the server can’t learn them. Amazon Textract handles these circumstances natively with OCR.
No OCR functionality. The server reads embedded textual content from the PDF file format. It can’t interpret pixels in a picture.
Restricted format understanding. The server performs simple textual content extraction. It doesn’t reconstruct tables, columns, or complicated web page layouts. Amazon Textract handles this natively.
No type processing. In case your PDFs include fillable type fields or structured knowledge, the server doesn’t extract these components. Amazon Textract handles this natively.

Actual-world use circumstances

These capabilities translate instantly into measurable outcomes throughout industries. Whether or not it’s authorized groups retrieving contract clauses mid-call, compliance officers finding coverage language throughout audits, or executives pulling earnings knowledge in actual time, the answer removes the friction of guide doc search. The next examples present how completely different groups put it to work.

Authorized providers agency

A mid-sized authorized agency adopted this answer for contract evaluate. Their attorneys used to spend 15 to twenty minutes looking out via PDF contracts to search out particular indemnification clauses throughout consumer calls. That meant placing the consumer on maintain or promising to name again later. Now they kind a query into Kiro CLI and get the related passage in seconds. The agency reviews that analysis time throughout consumer calls was considerably lowered.

Monetary providers compliance

A regional financial institution deployed the answer for regulatory examinations. Throughout audits, compliance officers have to find particular coverage language rapidly. Beforehand, they bookmarked key sections manually throughout dozens of PDF recordsdata, which was error-prone and laborious to keep up as insurance policies modified. With the MCP server linked to their S3 doc repository, they now pull up the precise paragraph an examiner asks about in actual time.

Company technique staff

An enterprise management staff makes use of the answer throughout quarterly technique conferences. When a board member asks a few particular metric from the earlier quarter’s earnings report, the staff queries the PDF on the spot as a substitute of flipping via printed copies. This retains discussions shifting and grounded in precise knowledge.

Scaling and enhancement choices

This answer is a place to begin. As your wants develop, you’ll be able to prolong it. Begin with caching in case your staff accesses the identical paperwork repeatedly. Take into account batch processing when you might want to deal with lots of of paperwork directly. Add vector search when key phrase matching is now not adequate.

Particularly, you’ll be able to prolong the answer in these methods:

Add caching with Amazon DynamoDB for regularly accessed paperwork.
Implement batch processing with Amazon Easy Queue Service (Amazon SQS) for bulk operations.
Combine vector search with Amazon OpenSearch Service for semantic doc discovery.
Create hybrid workflows that route complicated paperwork to Amazon Textract routinely.
Add monitoring with Amazon CloudWatch to trace utilization patterns and error charges.

Cleanup

If you’re performed testing or wish to take away the answer, observe these steps to keep away from pointless prices.

Cease the MCP ServerPress Ctrl+C within the terminal the place the server is operating.
Take away the MCP ConfigurationOpen your Kiro CLI MCP configuration file (~/.kiro/settings/instruments/mcp.json) and delete the s3-pdf-extractor entry. Save and shut the file.
Delete the mission recordsdataTake away the mission listing and all its contents:
```
rm -rf ~/s3-pdf-extractor
```
Warning: This command completely deletes all recordsdata within the listing with out affirmation. Ensure you have saved any modifications earlier than continuing.
Clear up S3 assets (non-compulsory)When you created check PDFs in Amazon S3 particularly for this walkthrough, delete the check recordsdata or the check bucket utilizing the Amazon S3 console or the AWS CLI:
```
aws s3 rm s3://your-bucket-name/test-file.pdf
```
Solely delete assets you created for testing.
Overview IAM permissions (non-compulsory)Navigate to the IAM console and take away any S3 learn permissions added particularly for this answer. Preserve permissions that different workflows rely upon.
Confirm cleanupVerify the listing now not exists:
Anticipated output: No such file or listing

After cleanup, you’ll now not incur S3 storage and knowledge switch prices for the assets you deleted. For detailed pricing info, see Amazon S3 Pricing. If you wish to redeploy later, repeat the set up steps. All code and configuration examples stay on this doc.

Conclusion

On this put up, you constructed an MCP server that extracts textual content from PDF recordsdata in Amazon S3 in actual time. You walked via the structure, in contrast prices with Amazon Textract, and noticed how 3 completely different groups put this method to work. The sample follows a transparent method: join your AI assistant to your paperwork, hold the infrastructure minimal, and scale up solely when the workload calls for it.

In abstract, the MCP server sample is a targeted, interactive complement to Amazon Textract. Use it when an AI assistant must learn text-based PDFs in actual time. When your wants embody OCR, varieties, tables, or production-scale processing, Amazon Textract is the AWS service designed for that work, and the 2 approaches match cleanly collectively. That is precisely the sample proven within the hybrid workflow choice earlier on this put up.

Subsequent steps:

Consider your use case towards the standards within the “The place this method matches alongside Amazon Textract” part.
Deploy the answer in your growth setting by following the Set up part on this put up. Take a look at with 5 to 10 consultant paperwork to ascertain baseline efficiency.
Discover Amazon Textract for OCR capabilities, or study extra about Kiro CLI integration as your necessities evolve.
When you do that answer or adapt it in your personal use case, we’d love to listen to about it within the feedback.

To study extra, discover the next assets:

Concerning the authors

Construct interactive PDF textual content extraction from Amazon S3

Constructing Browser-Utilizing AI Brokers in Python

Leave a Reply Cancel reply

Popular News

Greatest practices for Amazon SageMaker HyperPod activity governance

How Cursor Really Indexes Your Codebase

Context Engineering — A Complete Fingers-On Tutorial with DSPy

Construct a serverless audio summarization resolution with Amazon Bedrock and Whisper

Speed up edge AI improvement with SiMa.ai Edgematic with a seamless AWS integration

About Us

Category

Recent Posts