Past the Straight Line: Selecting Between OLS, Interplay Phrases, and Tweedie Regression

a machine studying mannequin to foretell how a lot cash prospects will spend on an e-commerce platform over the following yr. The job is information science enterprise as common: load the information, clear it, perceive it, and mannequin the end result.

The massive query that involves our minds: Which algorithm do you pull out of your toolkit?

For regression fashions, I assume that we are inclined to default to the classics like OLS, or at different occasions, simply bounce straight to a posh ensemble methodology like XGBoost. However by expertise, I’ve been seeing that usually a generalized linear framework is strictly what you want for inference, pace, and interpretability.

The true problem lies in choosing the proper taste.

Ought to we keep on with a regular Abnormal Least Squares (OLS) regression?
Do we have to introduce interplay phrases?
Or is our information bizarre sufficient to warrant a Tweedie regression?

Selecting the improper one can result in fashions that output inconceivable predictions, like a buyer spending a destructive amount of cash.

Let’s break down these three approaches so you’ll be able to confidently decide the correct device on your particular information panorama.

Dataset

The dataset used for this train would be the French Motor Third-Social gathering Legal responsibility Claims dataset, from R package deal CASDatasets, below license GPL >2.

Christophe Dutang and Arthur Charpentier (2026). CASdatasets: Insurance coverage datasets, R package deal model 1.2-1, DOI 10.57745/P0KHAG.

# Fundamentals
import pandas as pd
import numpy as np

# Dataviz
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
import plotly.specific as px

# Stats
import statsmodels.api as sm
import statsmodels.formulation.api as smf
import scipy.stats as stats

# Preprocessing & Modeling
from sklearn.linear_model import TweedieRegressor
from feature_engine.encoding import OneHotEncoder
import warnings
from sklearn.metrics import mean_absolute_error

import warnings
warnings.filterwarnings('ignore')

To get the dataset, there’s a code snippet you should utilize in this sklearn tutorial [4].

# Load Knowledge
df = load_mtpl2(n_samples=250_000)

View of the dataset. Picture by the creator.

After exploring the information somewhat, there are a few fascinating transformations we are able to make, resembling clipping the information at $500,000 most and maintain the variety of claims below 4, as numbers over which are method too loopy (outliers) and can definetely distort our predictions.

# Clipped
df = df.question('ClaimAmount < 500000')
df = df.question('ClaimNb < 4')

1. Common OLS: The Dependable Freeway

Let’s begin with a baseline mannequin. The best Abnormal Least Squares (OLS) regression is the bedrock of statistical modeling. If information science had a “default” settings button, it will be OLS.

However the “draw back” is that it requires assumptions that the information will behave in a sure method.

For instance, it will assume:

The goal modifications at a relentless, predictable fee as your impartial variables change.
For those who improve the variety of claims by 1, your Declare Quantity goes up by a hard and fast quantity, no matter whether or not that is your first ever declare or not.

Mathematically, OLS seems to be for the road that minimizes the sum of the squared variations between your precise information factors and that line.

When does it work?

OLS is extremely highly effective when your information meets just a few core assumptions:

Linearity: The connection between your options and your goal within reason straight.
Homoscedasticity: The variance of your residuals (errors) is fixed. In plain English, your mannequin isn’t wildly extra inaccurate for big values than it’s for small values.
Regular Distribution: The errors are usually distributed.

If you’re predicting one thing steady and steady (e.g., the peak of a tree based mostly on its age, or the temperature based mostly on elevation), common OLS is commonly your greatest wager. It’s clear, quick, and simply understood.

For this drawback, although, we’ll see it’s not the most effective wager, given our information is inflated with zeroes declare quantities.

# Outline the OLS mannequin formulation
formulation = "ClaimAmount ~ Publicity + VehPower + VehAge + DrivAge + BonusMalus + Density + C(Space) + C(VehBrand) + C(VehGas) + C(Area)"

# Match the OLS mannequin
ols_model = smf.ols(formulation=formulation, information=df).match()

# Show the mannequin abstract
print(ols_model.abstract())

After working our baseline mannequin, let’s take a look on the outcomes.

If we take a look at the mannequin’s residuals, the OLS mannequin’s residuals, as proven by the histogram and Q-Q plot, are clearly not usually distributed, which is a standard difficulty when coping with datasets which have many zero values like insurance coverage declare quantities.

Histogram and QQ-Plot of the residuals: Mannequin struggles with increased values. Picture by the creator.

2. OLS with Interplay Phrases: The “It Relies upon” Issue

However what occurs when the true world will get sophisticated? Let’s say you’re predicting home costs based mostly on two options: the variety of bedrooms and whether or not the home has a swimming pool.

A daily OLS mannequin seems to be at these independently. It says, “An additional bed room provides $50,000, and a pool provides $30,000.” However is that all the time true? Does a pool add the identical worth to a 1-bedroom rental because it does to a 5-bedroom luxurious property?

In all probability not. The worth of the pool relies upon on the dimensions of the home. That is the place interplay phrases come into play.

An interplay time period tells your mannequin that the impact of 1 variable modifications relying on the extent of one other variable. In your information, you signify this by actually multiplying the 2 options collectively:

One other method of it’s by utilizing this analogy with cooking. Interplay phrases are like baking.

Flour by itself is dry.
Water by itself is moist.
However if you combine them collectively, you don’t simply get “dry-wet”; you get dough, a totally new property.

When must you use it?

It is best to improve your OLS mannequin with interplay phrases if you suspect synergy or antagonism between options.

Synergy: Two options collectively are extra highly effective than the sum of their components (e.g., Advertising Spend $occasions$ Vacation Season).
Antagonism: One function dampens the impact of one other (e.g., Dosage of Drug A $occasions$ Dosage of Drug B).

In case your common OLS mannequin is underperforming and you’ve got sturdy area data suggesting that your variables speak to one another, don’t simply throw extra information at it. Strive including an interplay time period.

Nicely, on this case, let’s see the way it performs.

# Outline the OLS mannequin formulation with one other interplay time period and with out 'Area'
formula_no_region = "ClaimAmount ~ Publicity * VehBrand + Publicity * BonusMalus + DrivAge + BonusMalus + Density + C(Space) + VehPower + C(VehGas)"

# Match the OLS mannequin with two interplay phrases and with out 'Area'
ols_no_region_model = smf.ols(formulation=formula_no_region, information=df).match()

# Show the mannequin abstract
print(ols_no_region_model.abstract())

OLS with interplay phrases outcomes. Picture by the creator.

Regardless of introducing interplay phrases, the OLS mannequin’s efficiency on the ClaimAmount information is unchanged. This is because of excessive focus of zero claims and a closely skewed distribution for non-zero claims.

Abnormal Least Squares fashions assume usually distributed errors, an assumption that’s clearly violated on this zero-inflated dataset. Consequently, including interplay phrases can doubtlessly seize extra advanced relationships, however can’t mannequin this information’s true distribution, resulting in solely marginal enhancements.

3. Tweedie Regression: Embracing the Messy Zeroes

What in case your information isn’t a neat, steady bell curve? What in case your information seems to be like a large wall of zeroes adopted by an extended, trailing tail of constructive numbers?

As we are able to see in our instance, in any given yr, the overwhelming majority of insurance coverage policyholders file precisely $0 in claims. They don’t get into accidents. However for the small share who do, their claims aren’t simply $5 or $10. They are often hundreds or tens of hundreds of {dollars}.

For those who attempt to match a daily OLS line to this information, the mannequin breaks down. Why? As a result of OLS assumes a standard distribution. To accommodate all these zeroes and some large outliers, OLS may begin predicting that some low-risk prospects may have destructive claims.

That is the rationale we are going to attempt the Tweedie distribution.

The Tweedie distribution is a particular household of likelihood distributions that may deal with a mix of a discrete mass at zero and a steady, right-skewed distribution for constructive values. It acts as a hybrid between a Poisson distribution (which counts occasions) and a Gamma distribution (which measures steady constructive quantities).

Consider Tweedie regression like measuring rainfall in a desert. On most days, the rainfall is strictly zero. However when it does rain, it pours, and also you get a steady, extremely variable quantity of water. You can’t mannequin desert rainfall utilizing a easy, symmetrical bell curve.

When must you use it?

Tweedie regression is the gold commonplace for information characterised by:

Strictly non-negative values (you can’t have lower than $0 in gross sales or claims).
An enormous spike at precisely zero (zero-inflation).
A extremely skewed distribution for the non-zero values.

If you’re working in insurance coverage tech, credit score danger, or modeling buyer lifetime worth the place many customers by no means convert, Tweedie is your potential buddy.

# OHE categorical variables
ohe = OneHotEncoder(drop_last=True)
df_ohe = ohe.fit_transform(df.drop(columns=['OLS_pred', 'OLS_Int_Term_pred'], axis=1))

# X and Y
X = df_ohe.drop(columns=['ClaimAmount'], axis=1)
y = df_ohe['ClaimAmount']

# Extra granular seek for tweedie_powers round 1.75-1.8
tweedie_powers = np.arange(1.7, 1.8, 0.01) # Instance: From 1.7 to 1.8 with step 0.01
alpha_values = [0.1, 0.5, 1.0] # Check completely different regularization strengths

outcomes = []

print("Trying to find greatest Tweedie energy and alpha...")
for energy in tweedie_powers:
    for alpha in alpha_values:
        sklearn_tweedie_model = TweedieRegressor(
            energy=energy, 
            hyperlink='auto', 
            solver='newton-cholesky', 
            max_iter=500, 
            alpha=alpha # Add regularization
        )
        attempt:
            # Match the mannequin, utilizing Publicity as an offset
            sklearn_tweedie_model.match(X, y, sample_weight=df['Exposure'])
            mae = mean_absolute_error(y, sklearn_tweedie_model.predict(X))
            outcomes.append((energy, alpha, mae))
        besides Exception as e:
            # Catch potential convergence points for some parameter combos
            outcomes.append((energy, alpha, float('inf'))) # Assign excessive MAE for failed matches
            

# Kind outcomes to seek out the most effective mannequin
best_result = min(outcomes, key=lambda x: x[2])
best_power, best_alpha, best_mae = best_result

print(f"Finest Tweedie Energy: {best_power:.2f}, Finest Alpha: {best_alpha:.1f}, Finest MAE: {best_mae:.2f}")

# Now, match the most effective mannequin
sklearn_tweedie_model = TweedieRegressor(
    energy=best_power, 
    hyperlink='auto', 
    solver='newton-cholesky', 
    max_iter=500, 
    alpha=best_alpha
)
sklearn_tweedie_model.match(X, y, sample_weight=df['Exposure'])

print("Sklearn Tweedie Regressor fitted efficiently with optimized parameters!")
print(outcomes)

Trying to find greatest Tweedie energy and alpha...
Finest Tweedie Energy: 1.76, Finest Alpha: 1.0, Finest MAE: 110.31
Sklearn Tweedie Regressor fitted efficiently with optimized parameters!

Let’s predict with this mannequin, and subsequent verify how the QQ plot seems to be like.

# Make predictions utilizing the optimized mannequin
df['Tweedie_pred'] = sklearn_tweedie_model.predict(X)

This can be a comparability of the actuals vs predictions of this final mannequin.

Tweedie Mannequin predictions vs actuals. Picture by the creator.

Discover how the errors are nonetheless very current, particularly for ClaimNb == 1. Whereas the Tweedie mannequin’s residuals nonetheless present deviations from normality, this mannequin follows one other distribution, which is the Tweedie. So, the true enchancment may be seen within the MAE metric, which brings a price about 35% decrease.

Comparability of the Fashions

cols=['ClaimNb','ClaimAmount', 'OLS_pred', 'OLS_Int_Term_pred', 'Tweedie_pred']
df[cols].pattern(10).spherical()

Evaluating the fashions. Picture by the creator.

We are able to see that the zero claims are nearer to zero within the Tweedie Regression, significantly better outcomes than the OLS estimates, since these will deliver values near the common for all of the ClaimNb values. Tweedie regressor can perceive the zero inflation, nevertheless it received’t predict zero, nevertheless it provides a smaller worth.

from sklearn.metrics import mean_absolute_error

# Calculate MAE for OLS
mae_ols = mean_absolute_error(df['ClaimAmount'], df['OLS_pred'])
print(f"MAE for OLS Mannequin: {mae_ols:.2f}")

# Calculate MAE for OLS Interplay Phrases
mae_ols = mean_absolute_error(df['ClaimAmount'], df['OLS_Int_Term_pred'])
print(f"MAE for OLS Mannequin w/ Interplay Phrases: {mae_ols:.2f}")

# Calculate MAE for Statsmodels Tweedie
mae_sm_tweedie = mean_absolute_error(df['ClaimAmount'], df['Tweedie_pred'])
print(f"MAE for Tweedie Mannequin: {mae_sm_tweedie:.2f}")

# Calculate MAE for Zero Inflated
mae_zero_inflated = mean_absolute_error(df['ClaimAmount'], df['ZeroInflated_pred'])
print(f"MAE for Zero-Inflated Mannequin: {mae_zero_inflated:.2f}")

MAE for OLS Mannequin: 174.17
MAE for OLS Mannequin w/ Interplay Phrases: 172.24
MAE for Tweedie Mannequin: 111.97

The Resolution Matrix: Selecting Your Path

If you end up looking at your dataset, making an attempt to resolve which path to take, use this desk to information your determination:

Mannequin	Goal Variable Kind	Key Assumption / Energy	Finest Used For…
Common OLS	Steady, symmetric, bell-shaped	Relationships are linear and impartial.	Secure bodily measurements, primary forecasting.
OLS with Interactions	Steady, symmetric	The impact of function A modifications based mostly on function B.	Advertising attribution, actual property pricing, economics.
Tweedie Regression	Extremely skewed, non-negative, largely zeroes	Handles a large spike at zero adopted by an extended tail.	Insurance coverage claims, buyer lifetime worth, uncommon occasion prices.

Bonus Mannequin

However I’m nonetheless not proud of this mannequin. Actually, the outcomes will not be that good in follow. We see that the Tweedie mannequin can, in actual fact, drastically scale back the error, nevertheless it nonetheless does that by controling very excessive and really low numbers (0) and driving all the things nearer to the common.

The one resolution left is to create a two-step mannequin the place we are going to mannequin a Zero Inflated distribution:

Run a classifier that may inform us if there was a declare quantity or not. It returns the likelihood of getting a declare quantity.
With the results of the primary mannequin, we are able to use the Tweedie distribution to mannequin solely these observations with a declare quantity. In any other case, the mannequin will simply return zero.

Let’s code that.

We start by getting ready our dataset with One Scorching Encoding to do away with the classes.

import lightgbm as lgb
from sklearn.metrics import classification_report

# Take away earlier predictions from df
cols = ['ClaimNb', 'Exposure', 'Area', 'VehPower', 'VehAge', 'DrivAge',
       'BonusMalus', 'VehBrand', 'VehGas', 'Density', 'Region', 'ClaimAmount']
df2 = df[cols].copy()

# 1. Create the binary end result variable for having a declare
df2['has_claim'] = (df2['ClaimAmount'] > 0).astype(int)

# Put together the information for "zero declare" classification
df_ohe = ohe.fit_transform(df2)
X = df_ohe.drop(columns=['ClaimAmount', 'has_claim'])
y = df_ohe['has_claim']

Subsequent, there’s a bunch of code.

We’ll cut up in prepare and validation units for LightGBM.
Calculates weights to deal with class imbalance.
Units tuned LightGBM mannequin parameters.
Trains LightGBM mannequin with early stopping.

from sklearn.model_selection import train_test_split

print("Tuning LightGBM for P(Declare > 0) with up to date parameters...")

# Cut up the coaching information into coaching and validation units for early stopping
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

train_data_split = lgb.Dataset(X_train, label=y_train)
val_data_split = lgb.Dataset(X_val, label=y_val)

# Calculate scale_pos_weight for higher dealing with of sophistication imbalance
scale_pos_weight_value = (y_train == 0).sum() / (y_train == 1).sum()

# Outline up to date parameters for higher management and to mitigate the warning
params_tuned = {
    'goal': 'binary',
    'metric': 'auc',
    'learning_rate': 0.05,  # Diminished studying fee
    'num_leaves': 31,       # Elevated num_leaves (default is 31), permitting for extra complexity
    'max_depth': 7,         # Elevated tree depth, permitting for extra complexity
    'min_child_samples': 15, # Barely decreased minimal information in leaf
    'min_gain_to_split': 0.01, # Decreased minimal acquire for a cut up
    'force_col_wise': True, # Maintain for dataset compatibility
    'scale_pos_weight': scale_pos_weight_value, # Explicitly deal with imbalance
    'verbose': -1           # Suppress verbose output throughout coaching
}

# Re-train LightGBM mannequin with tuned parameters
# Enhance num_boost_round given the decreased studying fee and elevated complexity
model_tuned = lgb.prepare(
    params_tuned,
    train_data_split,
    num_boost_round=300, # Elevated variety of boosting rounds
    valid_sets=[val_data_split], # Present validation set for early stopping
    callbacks=[lgb.early_stopping(20, verbose=False)] # Add early stopping and suppress verbose, elevated endurance
)

Predicts declare prevalence chances. Transforms it to a binary 0 or 1 for classification report calculation.
Evaluates mannequin utilizing classification report.
Shops predicted chances in dataframe.
Units chances under 0.5 to zero.

# Predictions and analysis with the tuned mannequin
y_pred = (model_tuned.predict(X) > 0.5).astype(int)
print("nClassification Report for Tuned LightGBM Mannequin:")
print(classification_report(y, y_pred))

# Retailer the possibilities for the following step of the Zero-Inflated mannequin
df2['prob_has_claim'] = model_tuned.predict(X)

# If prob_has_claim is below 0.5, make it 0
df2.loc[df2['prob_has_claim'] < 0.5, 'prob_has_claim'] = 0

Now, half 2 of the mannequin is filtering solely these claims with greenback quantity related and use the Tweedie Regression to estimate it.

# Half 2: Mannequin the severity (ClaimAmount > 0) utilizing Tweedie Regression
# Filter information for constructive claims solely
df_positive = df_ohe[df_ohe['has_claim'] == 1].copy()
X_positive = df_ohe.drop(columns=['ClaimAmount', 'has_claim'], axis=1)
y_positive = df_ohe['ClaimAmount']

# Use Publicity as offset for the severity mannequin
print("nFitting Tweedie Mannequin E[ClaimAmount | Claim > 0]...")
sklearn_tweedie_model = TweedieRegressor(
    energy=1.85,
    hyperlink='auto',
    solver='newton-cholesky',
    max_iter=500,
    alpha=1
)
sklearn_tweedie_model.match(X_positive, y_positive, sample_weight=df['Exposure'])

Lastly, the prediction of this mannequin would be the likelihood estimated by the primary mannequin multiplied by the Tweedie Regressor estimated quantity.

# Predict conditional severity for all observations (even these with 0 claims)
# This assumes the severity mannequin discovered from constructive claims can generalize.
df['pred_severity_if_claim'] = sklearn_tweedie_model.predict(df_ohe.drop(columns=['ClaimAmount', 'has_claim'], axis=1))

# Calculate the ultimate Zero-Inflated Mannequin prediction
df['ZeroInflated_pred'] = df2['prob_has_claim'] * df['pred_severity_if_claim']

# Show head of predictions and MAE
print("nZero-Inflated Mannequin Predictions (first 5 rows):")
print(df[['ClaimAmount', 'ZeroInflated_pred']].pattern(5))

mae_zero_inflated = mean_absolute_error(df['ClaimAmount'], df['ZeroInflated_pred'])
print(f"nMAE for Zero-Inflated Mannequin: {mae_zero_inflated:.2f}")

That is the consequence.

Becoming Tweedie Mannequin E[ClaimAmount | Claim > 0]...

Zero-Inflated Mannequin Predictions (first 5 rows):
         ClaimAmount  ZeroInflated_pred
IDpol                                  
1179635          0.0           0.000000
108738           0.0           0.000000
2066923          0.0           0.000000
1007186          0.0           0.000000
1117109       1128.0         729.428288

MAE for Zero-Inflated Mannequin: 87.79

Nice, we bought one other 21% of enchancment in MAE. That is superb!

Zero Inflated mannequin outcomes Precise x Predictions. Picture by the creator.

Conclusion

Knowledge science isn’t about discovering a single “excellent” algorithm that solves each drawback. As a substitute, it’s about matching the mathematical assumptions of your mannequin to the real-world habits of your information.

Begin easy. Have a look at the distribution of your goal variable. If it seems to be like a bell curve, begin with Common OLS. If you understand your options affect each other, introduce Interplay Phrases. However in case you are looking at a mountain of zeroes and an extended tail of huge numbers, skip the complications of OLS fully and leverage the ability of a Tweedie Regression.

Taking a couple of minutes to research your goal’s distribution earlier than modeling can prevent days of tuning the improper structure later.