With FIFA set to kick off on Thursday, June 11, 2026, the opening match on the Mexico Metropolis Stadium, I believe it will be enjoyable to construct the very best ML mannequin we will to foretell match outcomes. To do that, I’ve introduced collectively a number of databases—49,000 matches—with knowledge on Elo scores, match outcomes, and cup areas. From FIFA to the Baltic Cup, with matches from 1872 to 2026, we are going to take a probabilistic strategy to the game.
We’ll evaluate the efficiency of a number of ML fashions, together with
- multinomial regression
- multinomial ridge / elastic-net mannequin
- LightGBM
We will even work to know the strengths and weaknesses of our fashions to create a well-calibrated mannequin that predicts dwelling wins 86% of the time. By weighing mannequin efficiency, calibration, and complexity, we are going to discover the very best mannequin for our knowledge.
Soccer by the Numbers

Lots of people say soccer is sleep-inducing. As a soccer fan, I disagree, however to be honest, this isn’t with out purpose. Nearly all of matches finish with fewer than 5 objectives, and something above 20 is an anomaly, if not inconceivable. In distinction, it’s not unusual for one participant to attain greater than 50 factors in an NBA recreation. However regardless of the tempo, pubs from England to botecos in Rio stay full.
What critics don’t perceive is that the low rating could make a recreation extra fascinating, as this makes it tougher for groups to achieve a considerable lead, protecting followers on the sting till the tip. Sadly, this additionally means matches finish in a draw near 22% of the time—which will also be infuriating. But the game stays as standard as ever.

The truth that so many matches finish in a draw really turns into a modeling downside later, however earlier than we get to that lets go over how we put this knowledge togther.
Stitching the information collectively
Oftentimes the easiest way to enhance a mannequin is to easily get extra knowledge. We shall be working with international_results.csv, international_team_ratings.csv and international_goalscorers.csv
We wish to matchinternational_results.csv to international_team_ratings.csv so we will use Elo scores. This might be easy, however as you may’ve guessed, the workforce names don’t match up completely, so we have to flip to textual content processing except we wish to examine 336 groups individually. We additionally must be extremely cautious of when the Elo score was up to date. We may take the Elo on the identical day the match happens, however that will be a supply of information leakage, as Elo scores are up to date solely after the match. Making use of it as a characteristic tempting however problematic.
We should take the newest Elo rating, and as an extra engineered characteristic we maintain observe of the time for the reason that newest Elo replace, positing that earlier scores can be extra informative than older ones. The code for becoming a member of these tables and your entire undertaking is on the market within the Appendix.

international_results.csv
| Subject sort | Examples |
|---|---|
| Match identification | source_match_id, date, season, competitors |
| Groups | home_team, away_team |
| Ultimate consequence | home_score, away_score, match_result, result_class |
| Context | impartial, match, metropolis, nation |
international_team_ratings.csv
| Characteristic | That means |
|---|---|
home_rating_pre_match |
Residence workforce Elo earlier than kickoff |
away_rating_pre_match |
Away workforce Elo earlier than kickoff |
rating_diff |
Residence Elo minus away Elo |
rating_age_days_home |
How stale the house workforce score is |
rating_age_days_away |
How stale the away workforce score is |
international_goalscorers.csv
| Characteristic thought | That means |
|---|---|
| Distinctive scorers in current matches | Whether or not a workforce relies on one scorer or many |
| Targets by prime scorer | Focus of scoring |
| Latest scoring kind | Attacking output earlier than this match |

As a result of we’re doing a time-series prediction, we have to guarantee our break up respects the time order. We’ll consider our mannequin on all video games from 2018 onward, which might be roughly 8,000 matches.
| Efficient break up | Approximate date logic |
|---|---|
| mannequin practice | earlier a part of pre-2018 knowledge |
| validation | newest ~20% of the pre-2018 coaching pool |
| take a look at | 2018 onward |
Engineered Options

We wish to transfer from primary match-level predictors in the direction of richer pre-match options that seize: workforce power, attacking and defensive high quality, dwelling/away results, matchup stability, goalkeeper power, historic efficiency developments.
1. Draw-modeling options
Essentially the most evident failure of our baseline multinomial logistic regression mannequin was its weak efficiency at classifying attracts. Whereas the mannequin may calculate the likelihood of a draw as a result of we outlined the goal variable as match_result ∈ {H, D, A} (Residence win, Draw, Away win), Draw was merely by no means the more than likely end result. We are able to see this by the lacking column for Attracts within the confusion matrix.

This poor draw efficiency shouldn’t be particular to 1 mannequin household. After we isolate high-confidence errors — instances the place the mannequin’s predicted class was improper, and its most predicted likelihood was a minimum of 0.60 — the identical sample seems throughout fashions: they’re systematically overconfident in dwelling wins. Many matches that truly led to attracts have been assigned a assured home-win prediction, suggesting that the fashions seize team-strength path higher than match-level uncertainty or draw chance.

To deal with this ‘blindness’ to the draw possibility, we will engineer options similar to abs_rating_diff, home_draw_rate_last_5, form_draw_rate_mean_last_5, and binary context options like impartial, flag_is_world_cup, and flag_is_friendly, indicating whether or not the match is on impartial floor or on the World Cup.
| Characteristic group | That means | Examples |
|---|---|---|
| Elo closeness | Measures how evenly matched the groups are. Smaller score gaps are particularly related for draw likelihood. | abs_rating_diff |
| Latest draw tendency | Measures how typically every workforce’s prior matches led to attracts. | home_draw_rate_last_5, away_draw_rate_last_10 |
| Mixed draw tendency | Captures whether or not each groups have just lately been draw-prone. | form_draw_rate_mean_last_5, form_draw_rate_mean_last_10 |
| Match context | Match and venue indicators that will have an effect on draw frequency. | impartial, flag_is_world_cup, flag_is_friendly |

With these options, our mannequin can now higher discriminate between Residence/Away wins and attracts, as evidenced by a 3.3% enhance in true-positive draw predictions. That is nonetheless low, on condition that ~20% of matches finish in attracts. So our options assist however not by a lot. This implies that it might be price constructing a mannequin devoted to attract modeling, with the goal variable match_result ∈ {D, ¬D}, however for now we have to engineer extra options.
¬D represents not D that means our goal variable is the match ends in draw (1), or match doesn’t finish in draw (0)

2. Elo options
The typical workforce has an Elo barely above 1500; that is close to Saudi Arabia, Iceland, and Haiti for FIFA 2026. After we graph the distributions of Residence wins, Attracts, and away wins, we will see that because the distinction decreases, Attracts grow to be more and more doubtless. Our distributions are additionally barely shifted to the left, indicating a small dwelling benefit, as anticipated.


We might be leaving LogLoss factors on the desk if we relied solely on pre-match Elo as our solely characteristic. To get essentially the most from the information, we additionally
| Characteristic | That means |
|---|---|
home_rating_pre_match |
Residence workforce Elo score earlier than kickoff. |
away_rating_pre_match |
Away workforce Elo score earlier than kickoff. |
rating_diff |
Residence workforce Elo minus away workforce Elo earlier than kickoff. Constructive values favor the house workforce. |
rating_age_days_home |
Days for the reason that dwelling workforce’s Elo score was final up to date. |
rating_age_days_away |
Days for the reason that away workforce’s Elo score was final up to date. |

3. Rolling past-performance options
A critic may argue that utilizing rolling previous efficiency and Elo shouldn’t be a good suggestion, since they each mannequin workforce power, which might add redundant or extremely correlated options to the mannequin.
Rolling previous efficiency does seize workforce power, however it’s particularly there to assist the modeling of workforce momentum. Successful streaks are a really actual factor in sports activities. The truth is, the present best choice by supercomputers is Spain. One purpose they’re predicted first is their historic 31-match unbeaten streak coming into FIFA 2026.
| Characteristic group | That means | Examples |
|---|---|---|
| Latest factors per match | Common factors earned over every workforce’s earlier 5 or 10 matches. | home_points_per_match_last_5, away_points_per_match_last_10 |
| Latest objective distinction | Common objectives scored minus objectives conceded over prior matches. | home_goal_diff_per_match_last_5, away_goal_diff_per_match_last_10 |
| Latest draw charge | Share of prior matches that led to a draw. | home_draw_rate_last_5, away_draw_rate_last_10 |
| Residence-away kind variations | Distinction between the house and away groups on the identical rolling metric. | form_points_diff_last_5, form_goal_diff_diff_last_10 |
| Prior match counts | Variety of earlier matches obtainable earlier than the fixture. | home_prior_matches, away_prior_matches |
4. Assault and protection kind options
Whereas our mannequin tried to seize attacking and defending workforce power via factors, that is the place our mannequin falls in need of super-computer approaches. Fashionable approaches typically additionally implement participant knowledge, which is invaluable in computing a workforce’s strengths. As a result of we’re working solely with game-level knowledge, our modeling of attacking and defensive options is computed from earlier match outcomes like Latest scoring charges, conceding charges, Scoring-rate distinction, and Conceding-rate distinction.
| Characteristic group | That means | Examples |
|---|---|---|
| Latest scoring charge | Common objectives scored per match over the earlier 5 or 10 matches. | home_goals_for_per_match_last_5, away_goals_for_per_match_last_10 |
| Latest conceding charge | Common objectives conceded per match over the earlier 5 or 10 matches. | home_goals_against_per_match_last_5, away_goals_against_per_match_last_10 |
| Scoring-rate distinction | Residence workforce’s current scoring charge minus away workforce’s current scoring charge. | form_goals_for_diff_last_5, form_goals_for_diff_last_10 |
| Conceding-rate distinction | Residence workforce’s current conceding charge minus away workforce’s current conceding charge. Decrease values favor the house workforce defensively. | form_goals_against_diff_last_5, form_goals_against_diff_last_10 |

Grid Search
As a result of giant search grids can overfit in cross-validation, and grid search scales multiplicatively, parameters are searched logarithmically (1e-5, 1e-4, 1e-3, 1e-2). Besides with parameters like alpha, which should exist between zero and one.
glmnet_alphaControls the elastic-net mix between ridge and lasso regression, the place zero is Pure ridge, and one is pure lasso.
multinomial_decaypenalizes giant coefficients extra. That may scale back overfitting, however extreme decay can result in underfitting.
Grid Search O(n) = quantity of configurations examined × time to practice one mannequin
| Mannequin household | Grid/configurations proven | What was tuned |
|---|---|---|
| Baselines | majority_baseline, frequency_baseline, rating_diff_multinom |
Largely not tuned; comparability baselines |
| glmnet | alpha = 0, .25, .5, .75, 1 |
Elastic-net mixing parameter |
| multinom | decay = 0, 1e-5, 1e-4, 1e-3, 1e-2 |
L2 weight decay / coefficient shrinkage |
| LightGBM | less_regular, deeper, more_regular, current_final, l2_regularized, shallower, l1_l2_regularized, compact_robust, faster_small, slower_small |
Named bundles of tree-depth, learning-rate, boosting-round, and regularization settings |
LightGBM was essentially the most complicated mannequin household within the comparability. In contrast to the baseline fashions, which used few or no tuning parameters, LightGBM required selections about tree complexity, studying charge, boosting rounds, and regularization. This made it extra versatile, but additionally elevated the chance of overfitting if the parameters weren’t tuned rigorously. We additionally must take care to not use a mannequin that’s extra sophisticated than our knowledge requires, as we may lose out on interpretability.
The GBM parameters have been tuned by evaluating a compact grid of LightGBM configurations. These configurations diversified tree complexity, studying velocity, variety of boosting rounds, and regularization power, protecting the very best mannequin scored on log-loss. Beneath is a listing of the LightGBM parameters.
| Parameter | That means |
|---|---|
learning_rate |
How a lot every new tree is allowed to alter the mannequin. Decrease values be taught extra slowly however can generalize higher. |
num_iterations / nrounds |
Variety of boosting rounds, that means what number of timber are added. Extra timber can enhance efficiency however may also overfit. |
num_leaves |
Controls how complicated every tree will be. Extra leaves permit extra detailed patterns however enhance overfitting danger. |
max_depth |
Most depth of every tree. Deeper timber seize extra complicated interactions. Shallower timber are less complicated and safer. |
min_data_in_leaf |
Minimal variety of observations required in a leaf. Greater values make the mannequin much less delicate to small noisy patterns. |
lambda_l1 |
L1 regularization. Pushes some results towards zero, making the mannequin less complicated. |
lambda_l2 |
L2 regularization. Shrinks giant results and reduces overconfidence. |
feature_fraction |
Fraction of options used for every tree. Utilizing fewer options can scale back overfitting. |
bagging_fraction |
Fraction of rows used for every tree. Utilizing fewer rows may also scale back overfitting. |
bagging_freq |
How typically row subsampling is utilized. If set to 0, bagging is normally off. |


Ultimate Mannequin
The official chosen mannequin was LightGBM with the safe_plus_form_compact characteristic set, utilizing 20 pre-match options drawn from Elo scores, match context, and lagged workforce summaries. It was chosen primarily based on the bottom validation-set multiclass log loss, with the take a look at set reserved for last reporting.
The chosen LightGBM mannequin achieved a validation log lack of 0.893 and a take a look at log loss of 0.873. Its validation consequence was the very best throughout the Mannequin comparability, however the margin over regression was small: multinomial regression trailed by solely about 0.002 log-loss factors on validation. On the held-out take a look at set, multinomial regression barely outperformed LightGBM on each log loss and macro F1.

Which means the consequence must be interpreted cautiously. LightGBM is the formally chosen predictive mannequin, however the proof doesn’t present that gradient boosting clearly dominates less complicated regression fashions for the given knowledge. Regression fashions stay extremely essential as a result of they’re simpler to interpret and carry out almost in addition to, and in some take a look at metrics barely higher than, different strategies.

Characteristic engineering produced equally modest good points. Compact lagged options improved validation log loss relative to baseline, however the take a look at enchancment was tiny. Goalscorer options didn’t meaningfully enhance log loss within the Mannequin comparability.

The clearest limitation was draw prediction. The chosen mannequin virtually by no means predicted draw as the highest class: on the take a look at set, it appropriately predicted solely 2 attracts out of 1,784 precise attracts, for draw recall of 0.11%. This implies that the mannequin’s likelihood estimates should include helpful data, however argmax classification stays strongly biased towards dwelling and away wins, making a separate mannequin for draw modeling an affordable subsequent step. Elo and compact pre-match kind present a helpful sign stack, however the good points over robust baselines are incremental.
The mannequin is a lot better at predicting dwelling wins than away wins on the take a look at set:
- It appropriately identifies about 87% of precise dwelling wins
- It appropriately identifies about 63% of precise away wins
The mannequin can be able to outputting a likelihood distribution over Residence, Draw, and Away wins, which is commonly extra helpful than only a single onerous prediction.
Calibration

The baseline-plus fashions are broadly nicely calibrated on the take a look at set. Throughout confidence bins. This implies predicted confidence tracks noticed accuracy, that means when the fashions are reasonably assured, they’re appropriate at roughly the corresponding charge, and when confidence rises, noticed accuracy rises with it. The deviations from the perfect calibration line are modest, suggesting that the fashions’ likelihood estimates are usually usable relatively than only a rank-ordering of outcomes.
The plot under measures calibration of the highest predicted class—the mannequin’s confidence in whichever end result it selected—not calibration for dwelling wins, attracts, and away wins individually. A mannequin can subsequently look nicely calibrated general whereas nonetheless misestimating one class, particularly attracts. The mixture calibration plot helps the declare that the fashions’ confidence scores are broadly reliable, however it doesn’t, by itself, present that the draw possibilities are nicely calibrated.

The category-specific calibration plots present the place that combination image holds and the place it turns into extra sophisticated. Residence-win and away-win possibilities observe the perfect calibration line intently throughout most bins: because the mannequin assigns greater likelihood to both end result, the noticed frequency rises at roughly the identical charge. In sensible phrases, the mannequin’s dwelling and away possibilities behave like significant possibilities, not simply scores.

Attracts are completely different. The mannequin’s draw possibilities are fairly calibrated inside its vary, however that vary is slim. It not often assigns draw possibilities a lot above the low-to-middle vary, even when the match is comparatively balanced.
That is the central distinction: the mannequin doesn’t ignore attracts; it normally treats them as danger components relatively than doubtless outcomes. Draw possibilities should be helpful for measuring draw danger, however attracts seldom grow to be the mannequin’s prime prediction, which helps clarify the persistent weak spot in draw recall.

Score Distinction Evaluation
The rating-difference evaluation reveals why attracts are structurally troublesome for the mannequin. Noticed draw charges are highest when the groups are intently matched and decline as absolutely the Elo score hole widens. All three mannequin households be taught this broad sample: their predicted draw possibilities additionally fall as matches grow to be extra lopsided.
The failure shouldn’t be directional however scalar. In essentially the most evenly matched fixtures, the noticed draw charge is roughly one-third, whereas the fashions assign draw possibilities nearer to one-quarter. They appropriately determine balanced matches as extra draw-prone, however they don’t elevate the draw likelihood sufficient. Because of this, the mannequin can acknowledge draw danger with out typically choosing a draw because the more than likely end result. This reconciles the obvious contradiction between affordable draw calibration and weak draw recall: the possibilities transfer in the correct path, however normally not far sufficient to win the argmax choice, that being to choose the category with the very best predicted likelihood.

Characteristic Significance
As you may anticipate, a very powerful characteristic for our mannequin is the score distinction, adopted by whether or not the match was on neural floor—a distant second. By checking the characteristic significance, we will see which of our engineered options supplied significant sign.


Conclusion
I believe it is a good time to debate dataset measurement and mannequin selection. Usually, the bigger and extra complicated the dataset, the extra purpose now we have to decide on a extra sophisticated mannequin. As we noticed on this instance, the good points from switching from regression to LightGBM have been very small; it is a good signal that making an attempt a extra complicated mannequin on this knowledge is not going to yield higher predictions. Soccer forecasting is much less about discovering a magic algorithm and extra about constructing leakage-safe options, evaluating interpretable baselines, and asking whether or not the mannequin’s confidence is deserved.
For now, one factor is obvious: wer’re gonna want extra knowledge if we wish to get a greater prediction. Notably player-level knowledge—realizing if Neymar is sitting out is essential. The granularity of the information can be essential if we wish to change our forecast as the sport progresses.
Apendix
The code for the entire undertaking will be discovered on my GitHub
The information supply has a Artistic Commons CC0-1.0 license
make_team_clean <- operate(team_name) > # Converts accented Latin characters to plain ASCII characters.
stringr::str_to_lower()
- stringr::str_squish()
- stringi::stri_trans_general(“Latin-ASCII”)
- Converts accented Latin characters to plain ASCII characters.
- str_to_lower()
- stringr::str_replace_all(“[^a-z0-9]+”, “_”)
- It replaces something that isn’t a lowercase letter or quantity with an underscore.


