No Peeking Forward: Time-Conscious Graph Fraud Detection

In my final article [1], I threw out quite a lot of concepts centered round constructing structured graphs, primarily centered on descriptive or unsupervised exploration of knowledge by graph constructions. Nevertheless, once we use graph options to enhance our fashions, the temporal nature of the information have to be taken into account. If we wish to keep away from undesired results, we have to be cautious to not leak future info into our coaching course of. This implies our graph (and the options derived from it) have to be constructed in a time-aware, incremental method.

Information leakage is such a paradoxical downside {that a} 2023 research by Sayash Kapoor and Arvind Narayanan [2] discovered that, as much as that second, it had affected 294 analysis papers throughout 17 scientific fields. They classify the kinds of information leakages starting from textbook errors to open analysis issues.

The problem is that in prototyping, outcomes typically appear very promising when they’re actually not. More often than not, folks don’t notice this till fashions are deployed in manufacturing, losing the time and assets of a whole staff. Then, efficiency normally falls in need of expectations with out understanding why. This situation can change into the Achilles’ heel that undermines all enterprise AI initiatives.

…

ML-base leakage

Information leakage happens when the coaching information incorporates details about the output that gained’t be accessible throughout inference. This causes overly optimistic analysis metrics throughout improvement, creating deceptive expectations. Nevertheless, when deployed in real-time methods with the correct information circulate, the mannequin predictions change into untrustworthy as a result of it realized from info not accessible.

Ethically, we should attempt to provide outcomes that actually replicate the capabilities of our fashions, quite than sensational or deceptive findings. When a mannequin strikes from prototyping to manufacturing, it shouldn’t fail to generalize correctly; if it does, its sensible worth is undermined. Fashions that fail to generalize effectively can exhibit important issues throughout inference or deployment, compromising their usefulness.

That is particularly harmful in delicate contexts like fraud detection, which regularly contain imbalanced information situations (with fewer fraud circumstances than non-fraud). In these conditions, the hurt brought on by information leakage is extra pronounced as a result of the mannequin would possibly overfit to leaked information associated to the minority class, producing seemingly good outcomes for the minority label, which is the toughest to foretell. This could result in missed fraud detections, leading to severe sensible penalties.

Information leakage examples could be categorized into textbook errors and open analysis issues [2] as follows:

Textbook Errors:

Imputing lacking values utilizing your entire dataset as an alternative of solely the coaching set, inflicting details about the check information to leak into coaching.
Duplicated or very related situations showing each in coaching and check units, akin to photos of the identical object taken from barely completely different angles.
Lack of clear separation between coaching and check datasets, or no check set in any respect, resulting in fashions gaining access to check info earlier than analysis.
Utilizing proxies of end result variables that not directly reveal the goal variable.
Random information splitting in situations the place a number of associated data belong to a single entity, akin to a number of declare standing occasions from the identical buyer.
Artificial information augmentation carried out over the entire dataset, as an alternative of solely on the coaching set.

Open issues for analysis:

Temporal leakage happens when future information unintentionally influences coaching. In such circumstances, strict separation is difficult as a result of timestamps could be noisy or incomplete.
Updating database data with out lineage or audit path, for instance, altering fraud standing with out storing historical past, could cause fashions to coach on future or altered information unintentionally.
Advanced real-world information integration and pipeline points that introduce leakage by misconfiguration or lack of controls.

These circumstances are a part of a broader taxonomy reported in machine studying analysis, highlighting information leakage as a essential and sometimes an underinvestigated threat for dependable modeling [3]. Such points come up even with easy tabular information, and so they can stay hidden when working with many options if every one is just not individually checked.

Now, let’s contemplate what occurs once we embody nodes and edges within the equation…

…

Graph-base leakage

Within the case of graph-based fashions, leakage could be sneakier than in conventional tabular settings. When options are derived from related elements or topological constructions, utilizing future nodes or edges can silently alter the graph’s construction. For instance:

methodologies akin to graph neural networks (GNNs) be taught the context not solely from particular person nodes but additionally from their neighbours, which might inadvertently introduce leakage if delicate or future info is propagated throughout the graph construction throughout coaching.
when the graph construction is overwritten or up to date with out preserving the previous occasions means the mannequin loses invaluable context wanted for correct temporal evaluation, and it might once more entry info within the incorrect time or lose traceability about doable leakage or issues with the information that originate the graphs.
Computing graph aggregations like diploma, triangles, or PageRank on your entire graph with out accounting for the temporal dimension (time-agnostic aggregation) makes use of all edges: previous, current, and future. This causes information leakage as a result of options embody info from future edges that wouldn’t be accessible at prediction time.

Graph temporal leakage happens when options, edges, or node relationships from future time factors are included throughout coaching in a method that violates the chronological order of occasions. This ends in edges or coaching options that incorporate information from time steps that must be unknown.

…

How can this be mounted?

We will construct a single graph that captures your entire historical past by assigning timestamps or time intervals to edges. To research the graph as much as a selected time limit (t), we “look again in time” by filtering any graph to incorporate solely the occasions that occurred earlier than or at that cutoff. This strategy is good for stopping information leakage as a result of it ensures that solely previous and current info is used for modeling. Moreover, it presents flexibility in defining completely different time home windows for protected and correct temporal evaluation.

On this article, we construct a temporal graph of insurance coverage claims the place the nodes symbolize particular person claims, and temporal hyperlinks are created when two claims share an entity (e.g., cellphone quantity, license plate, restore store, and many others.) to make sure the right occasion order. Graph-based options are then computed to feed fraud prediction fashions, fastidiously avoiding the usage of future info (no peeking).

The concept is easy: if two claims share a standard entity and one happens earlier than the opposite, we join them in the intervening time this connection turns into seen (determine 1). As we defined within the earlier part, the way in which we mannequin the information is essential, not solely to seize what we’re really on the lookout for, but additionally to allow the usage of superior strategies akin to Graph Neural Networks (GNNs).

**Determine 1:** Claims and entities (akin to cellphone numbers) are added to the graph as they arrive over time. When a brand new declare (Claim_id2 at time t) shares a beforehand noticed entity with an earlier declare (Claim_id1 at time t-1), a directed temporal edge (blue arrow) is created from the sooner to the later declare. This building reveals when relationships change into seen and ensures causal, time-respecting connectivity within the graph. Picture by Creator.

In our graph mannequin, we save the timestamp when an entity is first seen, capturing the second it seems within the information. Nevertheless, in lots of real-world situations, additionally it is helpful to contemplate a time interval spanning the entity’s first and final appearances (for instance, generated with one other variable like plate or e mail). This interval can present richer temporal context, reflecting the lifespan or energetic interval of nodes and edges, which is efficacious for dynamic temporal graph analyses and superior mannequin coaching.

Code

The code is offered on this repository: Hyperlink to the repository

To run the experiments, arrange a Python ≥3.11 surroundings with the required libraries (e.g., torch, torch-geometric, networkx, and many others.). It is suggested to make use of a digital surroundings (through venv or conda) to maintain dependencies remoted.

Code Pipeline

The diagram of Determine 2, reveals the end-to-end workflow for fraud detection with GraphSAGE. Step 1 masses the (simulated) uncooked claims information. Step 2 builds a time-stamped directed graph (entity→declare and older-claim→newer-claim). Step 3 performs temporal slicing to create practice, validation, and check units, then indexes nodes, builds options, and at last trains and validates the mannequin.

Determine 2. Finish-to-end pipeline for temporal fraud modeling: **(I)** load information → **(II)** construct and save a time-stamped graph and **(III)** put together subgraphs (temporal slicing → node indexing → function constructing → PyG `Information`) for coaching and inference. Picture by Creator.

Step 1: Simulated Fraud Dataset

We first simulate a dataset of insurance coverage claims. Every row within the dataset represents a declare and consists of variables akin to:

Entities: insurer_license_plate, insurer_phone_number, insurer_email, insurer_address, repair_shop, bank_account, claim_location, third_party_license_plate
Core info: claim_id, claim_date, type_of_claim, insurer_id, insurer_name
Goal: fraud (binary variable indicating whether or not the declare is fraudulent or not)

These entity attributes act as potential hyperlinks between claims, permitting us to deduce connections by shared values (e.g., two claims utilizing the identical restore store or cellphone quantity). By modeling these implicit relationships as edges in a graph, we will construct highly effective topological representations that seize suspicious behavioral patterns and allow downstream duties akin to function engineering or graph-based studying.

Desk 1. Overview of simulated insurance coverage claims information, displaying key entity fields and the fraud label for every report used within the train. Desk by Creator.

Determine 3. Distribution of fraud and non-fraud claims (left) and evolution of each day fraud fee with declare quantity (proper) within the simulated dataset. Fraud base fee of roughly 12.45%. Picture by Creator.

Step2: Graph Modeling

We use the NetworkX library to construct our graph mannequin. For small-scale examples, NetworkX is adequate and efficient. For extra superior graph processing, instruments like Memgraph or Neo4j may very well be used. To mannequin with NetworkX, we create nodes and edges representing entities and their relationships, enabling community evaluation and visualization inside Python.

So, now we have:

one node per declare, with node key equal to the claim_id and attributes as node_type and claim_date
one node per entity worth (cellphone, plate, checking account, store, and many others.). Node key: "{column_name}:{worth}" and attributes node_type = (e.g., "insurer_phone_number", "bank_account", "repair_shop") label = (simply the uncooked worth with out the prefix)

The graph consists of these two kinds of edges:

claim_id(t-1)→ claim_id(t) : when two claims share an entity (with edge_type='claim-claim')
entity_value →claim_id: direct hyperlink to the shared entity (with edge_type='entity-claim')

These edges are annotated with:

edge_type: to differentiate the relation (declare→declare vs entity→declare)
entity_type: the column from which the worth comes (like bank_account)
shared_value: the precise worth (like a cellphone quantity or license plate)
timestamp: when the sting was added (primarily based on the present declare’s date)

To interpret our simulation, we applied a script that generates explanations for why a declare is flagged as fraud. In Determine 4, declare 20000695 is taken into account dangerous primarily as a result of it’s related to restore store SHOP_856, which acts as an energetic hub with a number of claims linked round related dates, a sample typically seen in fraud “bursts.” Moreover, this declare shares a license plate and deal with with a number of different claims, creating dense connections to different suspicious circumstances.

Determine 4. Visible rationalization for declare 20000695: The left panel reveals a entity-claim community highlighting connections between the declare and key entities such because the restore store, location, and license plate; the precise panel shows the claim-claim community, revealing how this declare clusters with others through shared entities. The decrease panel summarizes threat components supporting the fraud label. Streamlit code. Picture by Creator.

This code saves the graph as a pickel file: temporal_graph_with_edge_attrs.gpickle.

Step 3: Graph preparation & Coaching

Illustration studying transforms advanced, high-dimensional information (like textual content, photos, or sensor readings) into simplified, structured codecs (typically known as embeddings) that seize significant patterns and relationships. These realized representations enhance mannequin efficiency, interpretability, and the flexibility to switch studying throughout completely different duties.

We practice a neural community, to map every enter to a vector in ℝᵈ that encodes what issues. In our pipeline, GraphSAGE does illustration studying on the declare graph: it aggregates info from a node’s neighbours (shared telephones, retailers, plates, and many others.) and mixes that with the node’s personal attributes to provide a node embedding. These embeddings are then fed to a small classifier head to foretell fraud.

3.1. Temporal slicing

From the only full graph we create in step 2, we extract three time-sliced subgraphs for practice, validation, and check. For every cut up we select a cutoff date and maintain solely (1) declare nodes with claim_date ≤ cutoff, and (2) edges whose timestamp ≤ cutoff. This produces a time-consistent subgraph for that cut up: no info from the long run leaks into the previous, matching how the mannequin would run in manufacturing with solely historic information accessible.

3.2 Node indexing

Give each node within the sliced graph an integer index 0…N-1. That is simply an ID mapping (like tokenization). We’ll use these indices to align options, labels, and edges in tensors.

3.3 Construct node options

Create one function row per node:

Kind one-hot (declare, cellphone, e mail, …).
Diploma stats computed throughout the sliced graph: normalized in-degree, out-degree, and undirected diploma throughout the sliced graph.
Prior fraud from older neighbors (claims solely): fraction of older related claims (direct declare→declare predecessors) which are labeled fraud, contemplating solely neighbors that existed earlier than the present declare’s time.
We additionally set the label y (1/0) for claims and 0 for entities, and mark claims in claim_mask so loss/metrics are computed solely on claims.

3.4 Construct PyG Information

Translate edges (u→v) right into a 2×E integer tensor edge_index utilizing the node indices and add self-loops so every node additionally retains its personal options at each layer. Pack every part right into a PyG Information(x, edge_index, y, claim_mask) object. Edges are directed, so message passing respects time (earlier→later).

3.5 GraphSage:

We implement a GraphSAGE structure in PyTorch Geometric with the SAGEConv layer. so, we run two GraphSAGE convolution layers (imply aggregation), ReLU, dropout, then a linear head to foretell fraud vs non-fraud. We practice full-batch (no neighbor sampling). The loss is weighted to deal with class imbalance and is computed solely on declare nodes through claim_mask. After every epoch we consider on the validation cut up and select the choice threshold that maximizes F1; we maintain the very best mannequin by val-F1 (early stopping).

Determine 5. PyTorch implementation of a GraphSAGE mannequin structure for node illustration studying and prediction on graph information. Picture by Creator.

3.6 Inference outcomes.

Consider the very best mannequin on the check cut up utilizing the validation-chosen threshold. Report accuracy, precision, recall, F1, and the confusion matrix. Produce a carry desk/plot (how concentrated fraud is by rating decile), export a t-SNE plot of declare embeddings to visualise construction.

**Determine 6:** Mannequin outcomes. **Left:** Carry by decile on the check set; **Proper:** t-SNE of declare embeddings (fraud vs. non-fraud). Picture by Creator.

The carry chart evaluates how effectively the mannequin ranks fraud: bars present carry by rating decile and the road reveals cumulative fraud seize. Within the high 10–20% of claims (Deciles 1–2), the fraud fee is about 2–3× the typical, suggesting that reviewing the highest 20–30% of claims would seize a big share of fraud. The t-SNE plot reveals a number of clusters the place fraud concentrates, indicating the mannequin learns significant relational patterns, whereas overlap with non-fraud factors highlights remaining ambiguity and alternatives for function or mannequin tuning.

…

Conclusion

Utilizing a graph that solely connects older claims to newer claims (previous to future) with out “leaking” future fraud info, the mannequin efficiently concentrates fraud circumstances within the high scoring teams, attaining about 2–3 occasions higher detection within the high 10–20%. This setup is dependable sufficient to deploy.

As a check, it’s doable to attempt a model the place the graph is two-way or undirected (connections each methods) and examine the spurious enchancment with the one-way model. If the two-way model will get considerably higher outcomes, it’s probably due to “temporal leakage,” which means future info is wrongly influencing the mannequin. It is a technique to show why two-way connections shouldn’t be utilized in actual use circumstances.

To keep away from making the article too lengthy, we’ll cowl the experiments with and with out leakage in a separate article. On this article, we give attention to creating a mannequin that meets manufacturing readiness.

There’s nonetheless room to enhance with richer options, calibration, and small mannequin tweaks, however our focus right here is to clarify a leak-safe temporal graph methodology that addresses information leakage.

References

[1] Gomes-Gonçalves, E. (2025, January 23). Purposes and Alternatives of Graphs in Insurance coverage. Medium. Retrieved September 11, 2025, from https://medium.com/@erikapatg/applications-and-opportunities-of-graphs-in-insurance-0078564271ab

[2] Kapoor, S., & Narayanan, A. (2023). Leakage and the reproducibility disaster in machinelearning-based science. Patterns. 2023; 4 (9): 100804. Hyperlink.

[3] Guignard, F., Ginsbourger, D., Levy Häner, L., & Herrera, J. M. (2024). Some combinatorics of knowledge leakage induced by clusters. Stochastic Environmental Analysis and Danger Evaluation, 38(7), 2815–2828.

[4] Huang, S., et. al. (2024). UTG: In direction of a Unified View of Snapshot and Occasion Primarily based Fashions for Temporal Graphs. arXiv preprint arXiv:2407.12269. https://arxiv.org/abs/2407.12269

[5] Labonne, M. (2022). GraphSAGE: Scaling up Graph Neural Networks. In direction of Information Science. Retrieved from https://towardsdatascience.com/introduction-to-graphsage-in-python-a9e7f9ecf9d7/

[6] An Introduction to GraphSAGE. (2025). Weights & Biases. Retrieved from https://wandb.ai/graph-neural-networks/GraphSAGE/reviews/An-Introduction-to-GraphSAGE–Vmlldzo1MTEwNzQ1

No Peeking Forward: Time-Conscious Graph Fraud Detection

Create a non-public workforce on Amazon SageMaker Floor Reality with the AWS CDK

Schedule topology-aware workloads utilizing Amazon SageMaker HyperPod process governance

Schedule topology-aware workloads utilizing Amazon SageMaker HyperPod process governance

Leave a Reply Cancel reply

Popular News

How Aviva constructed a scalable, safe, and dependable MLOps platform utilizing Amazon SageMaker

Speed up edge AI improvement with SiMa.ai Edgematic with a seamless AWS integration

Unlocking Japanese LLMs with AWS Trainium: Innovators Showcase from the AWS LLM Growth Assist Program

The Journey from Jupyter to Programmer: A Fast-Begin Information

The right way to run Qwen 2.5 on AWS AI chips utilizing Hugging Face libraries

About Us

Category

Recent Posts