Introducing document-level sync studies: Enhanced knowledge sync visibility in Amazon Kendra

Amazon Kendra is an clever search service powered by machine studying (ML). Amazon Kendra helps you combination content material from quite a lot of content material repositories right into a centralized index that allows you to shortly search all of your enterprise knowledge and discover probably the most correct reply.

Amazon Kendra securely connects to over 40 knowledge sources. When utilizing your knowledge supply, you may want higher visibility into the doc processing lifecycle throughout knowledge supply sync jobs. They might embody understanding the standing of every doc you tried to crawl and index, in addition to with the ability to troubleshoot why sure paperwork weren’t returned with the anticipated solutions. Moreover, you may want entry to metadata, timestamps, and entry management lists (ACLs) for the listed paperwork.

We’re happy to announce a brand new characteristic now out there in Amazon Kendra that considerably improves visibility into knowledge supply sync operations. The newest launch introduces a complete document-level report integrated into the sync historical past, offering directors with granular indexing standing, metadata, and ACL particulars for each doc processed throughout an information supply sync job. This enhancement to sync job observability permits directors to shortly examine and resolve ingestion or entry points encountered whereas organising Amazon Kendra indexes. The detailed doc studies are persevered within the new SYNC_RUN_HISTORY_REPORT log stream beneath the Amazon Kendra index log group, so essential sync job particulars can be found on-demand when troubleshooting.

On this publish, we talk about the advantages of this new characteristic and the way it presents enhanced knowledge sync visibility in Amazon Kendra.

Lifecycle of a doc in an information supply sync run job

On this part, we look at the lifecycle of a doc inside an information supply sync in Amazon Kendra. This supplies invaluable perception into the sync course of. The info supply sync includes three key phases: crawling, syncing, and indexing. Crawling entails the connector connecting to the info supply and extracting paperwork assembly the outlined sync scope based on the info supply configuration. These paperwork are then synced to the Amazon Kendra index through the syncing section. Lastly, indexing makes the synced paperwork searchable inside the Amazon Kendra setting.

The next diagram reveals a flowchart of a sync run job.

Crawling stage

The primary stage is the crawling stage, the place the connector crawls all paperwork and their metadata from the info supply. Throughout this stage, the connector additionally compares the checksum of the doc in opposition to the Amazon Kendra index to find out if a specific doc must be added, modified, or deleted from the index. This operation corresponds to the CrawlAction area within the sync run historical past report.

If the doc is unmodified, it’s marked as UNMODIFIED and skipped in the remainder of the phases. If any doc fails within the crawling stage, for instance as a result of throttling errors, damaged content material, or if the doc measurement is simply too massive, that doc is marked within the sync run historical past report with the CrawlStatus as FAILED. If the doc was skipped as a result of any validation errors, its CrawlStatus is marked as SKIPPED. These paperwork should not despatched to the following stage. All profitable paperwork are marked as SUCCESS and are despatched ahead.

We additionally seize the ACLs and metadata on every doc on this stage to have the ability to add it to the sync run historical past report.

Syncing stage

Through the syncing stage, the doc is shipped to Amazon Kendra ingestion service APIs like BatchPutDocument and BatchDeleteDocument. After a doc is submitted to those APIs, Amazon Kendra runs validation checks on the submitted paperwork. If any doc fails these checks, its SyncStatus is marked as FAILED. If there may be an irrecoverable error for a specific doc, it’s marked as SKIPPED and different paperwork are despatched ahead.

Indexing stage

On this step, Amazon Kendra parses the doc, processes it based on its content material sort, and persists it within the index. If the doc fails to be persevered, its IndexStatus is marked as FAILED; in any other case, it’s marked as SUCCESS.

After the statuses of all of the phases have been captured, we emit these statuses as an Amazon CloudWatch occasion to the shopper’s AWS account.

Key options and advantages of document-level studies

The next are the important thing options and advantages of the brand new document-level report in Amazon Kendra indexes:

Enhanced sync run historical past web page – A brand new Actions column has been added to the sync run historical past web page, offering entry to the document-level report for every sync run.

Devoted log stream – A brand new log stream named SYNC_RUN_HISTORY_REPORT has been created within the Amazon Kendra CloudWatch log group, containing the document-level report.

Complete doc data – The document-level report consists of the next data for every doc:
Doc ID – That is the doc ID that’s inherited straight from the info supply or mapped by the shopper within the knowledge supply area mappings.
Doc title – The title of the doc is taken from the info supply or mapped by the shopper within the knowledge supply area mappings.
Consolidated doc standing (SUCCESS, FAILED, or SKIPPED) – That is the ultimate consolidated standing of the doc. It might have a price of SUCCESS, FAILED, or SKIPPED. If the doc was efficiently processed in all phases, then the worth is SUCCESS. If the doc failed or was skipped in any of the phases, then the worth of this area can be FAILED or SKIPPED, respectively.
Error message (if the doc failed) – This area incorporates the error message with which a doc failed. If a doc was skipped as a result of throttling errors, or any inner errors, this can be proven within the error message area.
Crawl standing – This area denotes whether or not the doc was crawled efficiently from the info supply. This standing correlates to the syncing-crawling state within the knowledge supply sync.
Sync standing – This area denotes whether or not the doc was despatched for syncing efficiently. This correlates to the syncing-indexing state within the knowledge supply sync.
Index standing – This area denotes whether or not the doc was efficiently persevered within the index.
ACLs – This area incorporates an inventory of document-level permissions that have been crawled from the info supply. The main points of every ingredient within the record are:
- International identify – That is the e-mail or person identify of the person. This area is mapped throughout a number of knowledge sources. For instance, if a person has three datasources Confluence, SharePoint, and Gmail, with the native person ID as confluence_user, sharepoint_user and gmail_user respectively, and their e mail tackle person@e mail.com is the globalName within the ACL for all of them, then Amazon Kendra understands that each one of those native person IDs map to the identical international identify.
- Identify – That is the native distinctive ID of the person, which is assigned by the info supply.
- Kind – This area signifies the principal sort. This may be both USER or GROUP.
- Is Federated – This can be a boolean flag that signifies whether or not the group is of INDEX stage (true) or DATASOURCE stage (false).
- Entry – This area signifies whether or not the person has entry allowed or denied explicitly. Values will be both ALLOWED or DENIED.
- Information supply ID – That is the info supply ID. For federated teams (INDEX stage), this area can be null.
Metadata – This area incorporates the metadata fields (aside from ACL) that have been pulled from the info supply. This record additionally consists of the metadata fields mapped by the shopper within the knowledge supply area mappings in addition to further metadata fields added by the connector.
Hashed doc ID (for troubleshooting help) – To safeguard your knowledge privateness, we current a safe, one-way hash of the doc identifier. This encrypted worth permits the Amazon Kendra group to effectively find and analyze the particular doc inside our logs, do you have to encounter any challenge that requires additional investigation and backbone.
Timestamp – The timestamp signifies when the doc standing was logged in CloudWatch.

Within the following sections, we discover totally different use instances for the logging characteristic.

Decide the optimum boosting period for latest paperwork in utilizing document-level reporting

In terms of producing correct solutions, chances are you’ll need to fine-tune the best way Amazon Kendra prioritizes its content material. For example, chances are you’ll favor to spice up latest paperwork over older ones to ensure probably the most up-to-date passages are used to generate a solution. To realize this, you should use the relevance tuning characteristic in Amazon Kendra to spice up paperwork based mostly on the final replace date attribute, with a specified boosting period. Nevertheless, figuring out the optimum boosting interval will be difficult when coping with numerous incessantly altering paperwork.

Now you can use the per-document-level report back to get hold of the _last_updated_at metadata area data on your paperwork, which may also help you establish the suitable boosting interval. For this, you employ the next CloudWatch Logs Insights question to retrieve the _last_updated_at metadata attribute for machine studying paperwork from the SYNC_RUN_HISTORY_REPORT log stream.

filter @logStream like 'SYNC_RUN_HISTORY_REPORT/'
and Metadata like 'Machine Studying'
| parse Metadata '{"key":"_last_updated_at","worth":{"dateValue":"*"}}' as @last_updated_at
| kind @last_updated_at desc, @timestamp desc
| dedup DocumentTitle

With the previous question, you’ll be able to acquire insights into the final up to date timestamps of your paperwork, enabling you to make knowledgeable choices concerning the optimum boosting interval. This strategy makes certain your chat responses are generated utilizing the newest and related data, enhancing the general accuracy and effectiveness of your Amazon Kendra implementation.

The next screenshot reveals an instance consequence.

Frequent doc indexing observability and troubleshooting strategies

On this part, we discover some frequent admin duties for observing and troubleshooting doc indexing utilizing the brand new document-level reporting characteristic.

Checklist all efficiently listed paperwork from an information supply

To retrieve an inventory of all paperwork which have been efficiently listed from a particular knowledge supply, you should use the next CloudWatch Logs Insights question:

fields DocumentTitle, DocumentId, @timestamp
| filter @logStream like 'SYNC_RUN_HISTORY_REPORT/your-data-source-id/'
and ConnectorDocumentStatus.Standing = "SUCCESS"
| kind @timestamp desc | dedup DocumentTitle, DocumentId

The next screenshot reveals an instance consequence.

Checklist all efficiently listed paperwork from an information supply sync job

To retrieve an inventory of all paperwork which have been efficiently listed throughout a particular sync job, you should use the next CloudWatch Logs Insights question:

fields DocumentTitle, DocumentId, ConnectorDocumentStatus.Standing AS IndexStatus, @timestamp
| filter @logStream like 'SYNC_RUN_HISTORY_REPORT/your-data-source-id/run-id'
and ConnectorDocumentStatus.Standing = "SUCCESS"
| kind DocumentTitle

The next screenshot reveals an instance consequence.

Checklist all failed listed paperwork from an information supply sync job

To retrieve an inventory of all paperwork that didn’t index throughout a particular sync job, together with the error messages, you should use the next CloudWatch Logs Insights question:

fields DocumentTitle, DocumentId, ConnectorDocumentStatus.Standing AS IndexStatus, ErrorMsg, @timestamp
| filter @logStream like 'SYNC_RUN_HISTORY_REPORT/your-data-source-id/run-id'
and ConnectorDocumentStatus.Standing = "FAILED"
| kind @timestamp desc

The next screenshot reveals an instance consequence.

Checklist all paperwork that comprise a person’s ACL permission from an Amazon Kendra index

To retrieve an inventory of paperwork which have a particular customers ACL permission, you should use the next CloudWatch Logs Insights question:

filter @logStream like 'SYNC_RUN_HISTORY_REPORT/'
and Acl like 'aneesh@mydemoaws.onmicrosoft.com'
| show DocumentTitle, SourceUri

The next screenshot reveals an instance consequence.

Checklist the ACL of an listed doc from an information supply sync job

To retrieve the ACL data for a particular listed doc from a sync job, you should use the next CloudWatch Logs Insights question:

filter @logStream like 'SYNC_RUN_HISTORY_REPORT/data-source-id/run-id'
and DocumentTitle = "your-document-title"
| show DocumentTitle, Acl

The next screenshot reveals an instance consequence.

Checklist metadata of an listed doc from an information supply sync job

To retrieve the metadata data for a particular listed doc from a sync job, you should use the next CloudWatch Logs Insights question:

filter @logStream like 'SYNC_RUN_HISTORY_REPORT/data-source-id/run-id'
and DocumentTitle = "your-document-title"
| show DocumentTitle, Metadata

The next screenshot reveals an instance consequence.

Conclusion

The newly launched document-level report in Amazon Kendra supplies enhanced visibility and observability into the doc processing lifecycle throughout knowledge supply sync jobs. This characteristic addresses a essential want expressed by clients for higher troubleshooting capabilities and entry to detailed details about the indexing standing, metadata, and ACLs of particular person paperwork.

The document-level report is saved in a log stream named SYNC_RUN_HISTORY_REPORT inside the Amazon Kendra index CloudWatch log group. This report incorporates complete data for every doc, together with the doc ID, title, general doc sync standing, error messages (if any), together with its ACLs and metadata data retrieved from the info sources. The info supply sync run historical past web page now consists of an Actions column, offering entry to the document-level report for every sync run. This characteristic considerably improves the power to troubleshoot points associated to doc ingestion and entry management, and points associated to metadata relevance, and supplies higher visibility concerning the paperwork synced with an Amazon Kendra index.

To get began with Amazon Kendra, discover the Getting began information. To study extra about knowledge supply connectors and finest practices, see Creating an information supply connector.

Concerning the Authors

Aneesh Mohan is a Senior Options Architect at Amazon Internet Companies (AWS), with over 20 years of expertise in architecting and delivering high-impact options for mission-critical workloads. His experience spans throughout the monetary providers trade, AI/ML, safety, and knowledge applied sciences. Pushed by a deep ardour for expertise, Aneesh is devoted to partnering with clients to design and implement well-architected, revolutionary options that tackle their distinctive enterprise wants.

Ashwin Shukla is a Software program Improvement Engineer II on the Amazon Q for Enterprise and Amazon Kendra engineering group, with 6 years of expertise in creating enterprise software program. On this function, he works on designing and creating foundational options for Amazon Q for Enterprise.

Introducing document-level sync studies: Enhanced knowledge sync visibility in Amazon Kendra

Arms-On Numerical Spinoff with Python, from Zero to Hero | by Piero Paialunga | Sep, 2024

Information Empowers Enterprise. Exploiting the complete potential of… | by Bernd Wessely | Sep, 2024

Information Empowers Enterprise. Exploiting the complete potential of… | by Bernd Wessely | Sep, 2024

Leave a Reply Cancel reply

Popular News

How Aviva constructed a scalable, safe, and dependable MLOps platform utilizing Amazon SageMaker

Diffusion Mannequin from Scratch in Pytorch | by Nicholas DiSalvo | Jul, 2024

Unlocking Japanese LLMs with AWS Trainium: Innovators Showcase from the AWS LLM Growth Assist Program

Streamlit fairly styled dataframes half 1: utilizing the pandas Styler

Proton launches ‘Privacy-First’ AI Email Assistant to Compete with Google and Microsoft

About Us

Category

Recent Posts