In my final article [1], I threw out quite a lot of concepts centered round constructing structured graphs, primarily centered on descriptive or unsupervised exploration of knowledge by graph constructions. Nevertheless, once we use graph options to enhance our fashions, the temporal nature of the information have to be taken into account. If we wish to keep away from undesired results, we have to be cautious to not leak future info into our coaching course of. This implies our graph (and the options derived from it) have to be constructed in a time-aware, incremental method.
Information leakage is such a paradoxical downside {that a} 2023 research by Sayash Kapoor and Arvind Narayanan [2] discovered that, as much as that second, it had affected 294 analysis papers throughout 17 scientific fields. They classify the kinds of information leakages starting from textbook errors to open analysis issues.
The problem is that in prototyping, outcomes typically appear very promising when they’re actually not. More often than not, folks don’t notice this till fashions are deployed in manufacturing, losing the time and assets of a whole staff. Then, efficiency normally falls in need of expectations with out understanding why. This situation can change into the Achilles’ heel that undermines all enterprise AI initiatives.
…
ML-base leakage
Information leakage happens when the coaching information incorporates details about the output that gained’t be accessible throughout inference. This causes overly optimistic analysis metrics throughout improvement, creating deceptive expectations. Nevertheless, when deployed in real-time methods with the correct information circulate, the mannequin predictions change into untrustworthy as a result of it realized from info not accessible.
Ethically, we should attempt to provide outcomes that actually replicate the capabilities of our fashions, quite than sensational or deceptive findings. When a mannequin strikes from prototyping to manufacturing, it shouldn’t fail to generalize correctly; if it does, its sensible worth is undermined. Fashions that fail to generalize effectively can exhibit important issues throughout inference or deployment, compromising their usefulness.
That is particularly harmful in delicate contexts like fraud detection, which regularly contain imbalanced information situations (with fewer fraud circumstances than non-fraud). In these conditions, the hurt brought on by information leakage is extra pronounced as a result of the mannequin would possibly overfit to leaked information associated to the minority class, producing seemingly good outcomes for the minority label, which is the toughest to foretell. This could result in missed fraud detections, leading to severe sensible penalties.
Information leakage examples could be categorized into textbook errors and open analysis issues [2] as follows:
Textbook Errors:
- Imputing lacking values utilizing your entire dataset as an alternative of solely the coaching set, inflicting details about the check information to leak into coaching.
- Duplicated or very related situations showing each in coaching and check units, akin to photos of the identical object taken from barely completely different angles.
- Lack of clear separation between coaching and check datasets, or no check set in any respect, resulting in fashions gaining access to check info earlier than analysis.
- Utilizing proxies of end result variables that not directly reveal the goal variable.
- Random information splitting in situations the place a number of associated data belong to a single entity, akin to a number of declare standing occasions from the identical buyer.
- Artificial information augmentation carried out over the entire dataset, as an alternative of solely on the coaching set.
Open issues for analysis:
- Temporal leakage happens when future information unintentionally influences coaching. In such circumstances, strict separation is difficult as a result of timestamps could be noisy or incomplete.
- Updating database data with out lineage or audit path, for instance, altering fraud standing with out storing historical past, could cause fashions to coach on future or altered information unintentionally.
- Advanced real-world information integration and pipeline points that introduce leakage by misconfiguration or lack of controls.
These circumstances are a part of a broader taxonomy reported in machine studying analysis, highlighting information leakage as a essential and sometimes an underinvestigated threat for dependable modeling [3]. Such points come up even with easy tabular information, and so they can stay hidden when working with many options if every one is just not individually checked.
Now, let’s contemplate what occurs once we embody nodes and edges within the equation…
…
Graph-base leakage
Within the case of graph-based fashions, leakage could be sneakier than in conventional tabular settings. When options are derived from related elements or topological constructions, utilizing future nodes or edges can silently alter the graph’s construction. For instance:
- methodologies akin to graph neural networks (GNNs) be taught the context not solely from particular person nodes but additionally from their neighbours, which might inadvertently introduce leakage if delicate or future info is propagated throughout the graph construction throughout coaching.
- when the graph construction is overwritten or up to date with out preserving the previous occasions means the mannequin loses invaluable context wanted for correct temporal evaluation, and it might once more entry info within the incorrect time or lose traceability about doable leakage or issues with the information that originate the graphs.
- Computing graph aggregations like diploma, triangles, or PageRank on your entire graph with out accounting for the temporal dimension (time-agnostic aggregation) makes use of all edges: previous, current, and future. This causes information leakage as a result of options embody info from future edges that wouldn’t be accessible at prediction time.
Graph temporal leakage happens when options, edges, or node relationships from future time factors are included throughout coaching in a method that violates the chronological order of occasions. This ends in edges or coaching options that incorporate information from time steps that must be unknown.
…
How can this be mounted?
We will construct a single graph that captures your entire historical past by assigning timestamps or time intervals to edges. To research the graph as much as a selected time limit (t), we “look again in time” by filtering any graph to incorporate solely the occasions that occurred earlier than or at that cutoff. This strategy is good for stopping information leakage as a result of it ensures that solely previous and current info is used for modeling. Moreover, it presents flexibility in defining completely different time home windows for protected and correct temporal evaluation.
On this article, we construct a temporal graph of insurance coverage claims the place the nodes symbolize particular person claims, and temporal hyperlinks are created when two claims share an entity (e.g., cellphone quantity, license plate, restore store, and many others.) to make sure the right occasion order. Graph-based options are then computed to feed fraud prediction fashions, fastidiously avoiding the usage of future info (no peeking).
The concept is easy: if two claims share a standard entity and one happens earlier than the opposite, we join them in the intervening time this connection turns into seen (determine 1). As we defined within the earlier part, the way in which we mannequin the information is essential, not solely to seize what we’re really on the lookout for, but additionally to allow the usage of superior strategies akin to Graph Neural Networks (GNNs).

In our graph mannequin, we save the timestamp when an entity is first seen, capturing the second it seems within the information. Nevertheless, in lots of real-world situations, additionally it is helpful to contemplate a time interval spanning the entity’s first and final appearances (for instance, generated with one other variable like plate or e mail). This interval can present richer temporal context, reflecting the lifespan or energetic interval of nodes and edges, which is efficacious for dynamic temporal graph analyses and superior mannequin coaching.
Code
The code is offered on this repository: Hyperlink to the repository
To run the experiments, arrange a Python ≥3.11 surroundings with the required libraries (e.g., torch, torch-geometric, networkx, and many others.). It is suggested to make use of a digital surroundings (through venv or conda) to maintain dependencies remoted.
Code Pipeline
The diagram of Determine 2, reveals the end-to-end workflow for fraud detection with GraphSAGE. Step 1 masses the (simulated) uncooked claims information. Step 2 builds a time-stamped directed graph (entity→declare and older-claim→newer-claim). Step 3 performs temporal slicing to create practice, validation, and check units, then indexes nodes, builds options, and at last trains and validates the mannequin.

Information
) for coaching and inference. Picture by Creator.Step 1: Simulated Fraud Dataset
We first simulate a dataset of insurance coverage claims. Every row within the dataset represents a declare and consists of variables akin to:
- Entities:
insurer_license_plate
,insurer_phone_number
,insurer_email
,insurer_address
,repair_shop
,bank_account
,claim_location
,third_party_license_plate
- Core info:
claim_id
,claim_date
,type_of_claim
,insurer_id
,insurer_name
- Goal:
fraud
(binary variable indicating whether or not the declare is fraudulent or not)
These entity attributes act as potential hyperlinks between claims, permitting us to deduce connections by shared values (e.g., two claims utilizing the identical restore store or cellphone quantity). By modeling these implicit relationships as edges in a graph, we will construct highly effective topological representations that seize suspicious behavioral patterns and allow downstream duties akin to function engineering or graph-based studying.


Step2: Graph Modeling
We use the NetworkX library to construct our graph mannequin. For small-scale examples, NetworkX is adequate and efficient. For extra superior graph processing, instruments like Memgraph or Neo4j may very well be used. To mannequin with NetworkX, we create nodes and edges representing entities and their relationships, enabling community evaluation and visualization inside Python.
So, now we have:
- one node per declare, with node key equal to the claim_id and attributes as node_type and claim_date
- one node per entity worth (cellphone, plate, checking account, store, and many others.). Node key:
"{column_name}:{worth}"
and attributesnode_type =
(e.g.,"insurer_phone_number"
,"bank_account"
,"repair_shop"
)label =
(simply the uncooked worth with out the prefix)
The graph consists of these two kinds of edges:
claim_id(t-1)
→claim_id(t)
: when two claims share an entity (withedge_type='claim-claim'
)entity_value
→claim_id
: direct hyperlink to the shared entity (withedge_type='entity-claim'
)
These edges are annotated with:
edge_type
: to differentiate the relation (declare
→declare
vsentity
→declare
)entity_type
: the column from which the worth comes (likebank_account
)shared_value
: the precise worth (like a cellphone quantity or license plate)timestamp
: when the sting was added (primarily based on the present declare’s date)
To interpret our simulation, we applied a script that generates explanations for why a declare is flagged as fraud. In Determine 4, declare 20000695 is taken into account dangerous primarily as a result of it’s related to restore store SHOP_856, which acts as an energetic hub with a number of claims linked round related dates, a sample typically seen in fraud “bursts.” Moreover, this declare shares a license plate and deal with with a number of different claims, creating dense connections to different suspicious circumstances.

This code saves the graph as a pickel file: temporal_graph_with_edge_attrs.gpickle.
Step 3: Graph preparation & Coaching
Illustration studying transforms advanced, high-dimensional information (like textual content, photos, or sensor readings) into simplified, structured codecs (typically known as embeddings) that seize significant patterns and relationships. These realized representations enhance mannequin efficiency, interpretability, and the flexibility to switch studying throughout completely different duties.
We practice a neural community, to map every enter to a vector in ℝᵈ that encodes what issues. In our pipeline, GraphSAGE does illustration studying on the declare graph: it aggregates info from a node’s neighbours (shared telephones, retailers, plates, and many others.) and mixes that with the node’s personal attributes to provide a node embedding. These embeddings are then fed to a small classifier head to foretell fraud.
3.1. Temporal slicing
From the only full graph we create in step 2, we extract three time-sliced subgraphs for practice, validation, and check. For every cut up we select a cutoff date and maintain solely (1) declare nodes with claim_date ≤ cutoff
, and (2) edges whose timestamp ≤ cutoff
. This produces a time-consistent subgraph for that cut up: no info from the long run leaks into the previous, matching how the mannequin would run in manufacturing with solely historic information accessible.
3.2 Node indexing
Give each node within the sliced graph an integer index 0…N-1
. That is simply an ID mapping (like tokenization). We’ll use these indices to align options, labels, and edges in tensors.
3.3 Construct node options
Create one function row per node:
- Kind one-hot (declare, cellphone, e mail, …).
- Diploma stats computed throughout the sliced graph: normalized in-degree, out-degree, and undirected diploma throughout the sliced graph.
- Prior fraud from older neighbors (claims solely): fraction of older related claims (direct declare→declare predecessors) which are labeled fraud, contemplating solely neighbors that existed earlier than the present declare’s time.
We additionally set the labely
(1/0) for claims and 0 for entities, and mark claims inclaim_mask
so loss/metrics are computed solely on claims.
3.4 Construct PyG Information
Translate edges (u→v)
right into a 2×E integer tensor edge_index
utilizing the node indices and add self-loops so every node additionally retains its personal options at each layer. Pack every part right into a PyG Information(x, edge_index, y, claim_mask)
object. Edges are directed, so message passing respects time (earlier→later).
3.5 GraphSage:
We implement a GraphSAGE structure in PyTorch Geometric with the SAGEConv
layer. so, we run two GraphSAGE convolution layers (imply aggregation), ReLU, dropout, then a linear head to foretell fraud vs non-fraud. We practice full-batch (no neighbor sampling). The loss is weighted to deal with class imbalance and is computed solely on declare nodes through claim_mask
. After every epoch we consider on the validation cut up and select the choice threshold that maximizes F1; we maintain the very best mannequin by val-F1 (early stopping).

3.6 Inference outcomes.
Consider the very best mannequin on the check cut up utilizing the validation-chosen threshold. Report accuracy, precision, recall, F1, and the confusion matrix. Produce a carry desk/plot (how concentrated fraud is by rating decile), export a t-SNE plot of declare embeddings to visualise construction.

The carry chart evaluates how effectively the mannequin ranks fraud: bars present carry by rating decile and the road reveals cumulative fraud seize. Within the high 10–20% of claims (Deciles 1–2), the fraud fee is about 2–3× the typical, suggesting that reviewing the highest 20–30% of claims would seize a big share of fraud. The t-SNE plot reveals a number of clusters the place fraud concentrates, indicating the mannequin learns significant relational patterns, whereas overlap with non-fraud factors highlights remaining ambiguity and alternatives for function or mannequin tuning.
…
Conclusion
Utilizing a graph that solely connects older claims to newer claims (previous to future) with out “leaking” future fraud info, the mannequin efficiently concentrates fraud circumstances within the high scoring teams, attaining about 2–3 occasions higher detection within the high 10–20%. This setup is dependable sufficient to deploy.
As a check, it’s doable to attempt a model the place the graph is two-way or undirected (connections each methods) and examine the spurious enchancment with the one-way model. If the two-way model will get considerably higher outcomes, it’s probably due to “temporal leakage,” which means future info is wrongly influencing the mannequin. It is a technique to show why two-way connections shouldn’t be utilized in actual use circumstances.
To keep away from making the article too lengthy, we’ll cowl the experiments with and with out leakage in a separate article. On this article, we give attention to creating a mannequin that meets manufacturing readiness.
There’s nonetheless room to enhance with richer options, calibration, and small mannequin tweaks, however our focus right here is to clarify a leak-safe temporal graph methodology that addresses information leakage.
References
[1] Gomes-Gonçalves, E. (2025, January 23). Purposes and Alternatives of Graphs in Insurance coverage. Medium. Retrieved September 11, 2025, from https://medium.com/@erikapatg/applications-and-opportunities-of-graphs-in-insurance-0078564271ab
[2] Kapoor, S., & Narayanan, A. (2023). Leakage and the reproducibility disaster in machinelearning-based science. Patterns. 2023; 4 (9): 100804. Hyperlink.
[3] Guignard, F., Ginsbourger, D., Levy Häner, L., & Herrera, J. M. (2024). Some combinatorics of knowledge leakage induced by clusters. Stochastic Environmental Analysis and Danger Evaluation, 38(7), 2815–2828.
[4] Huang, S., et. al. (2024). UTG: In direction of a Unified View of Snapshot and Occasion Primarily based Fashions for Temporal Graphs. arXiv preprint arXiv:2407.12269. https://arxiv.org/abs/2407.12269
[5] Labonne, M. (2022). GraphSAGE: Scaling up Graph Neural Networks. In direction of Information Science. Retrieved from https://towardsdatascience.com/introduction-to-graphsage-in-python-a9e7f9ecf9d7/
[6] An Introduction to GraphSAGE. (2025). Weights & Biases. Retrieved from https://wandb.ai/graph-neural-networks/GraphSAGE/reviews/An-Introduction-to-GraphSAGE–Vmlldzo1MTEwNzQ1