The of this text is to clarify that, in predictive settings, imputations should all the time be estimated on the coaching set and the ensuing parameters or fashions saved. These ought to then be utilized unchanged to the take a look at, out-of-time, or software information, as a way to keep away from information leakage and guarantee an unbiased evaluation of generalization efficiency.
I need to thank everybody who took the time to learn and interact with my article. Your assist and suggestions are significantly appreciated.
In observe, most real-world datasets comprise lacking values, making lacking information one of the vital widespread challenges in statistical modeling. If it isn’t dealt with correctly, it will possibly result in biased coefficient estimates, lowered statistical energy, and finally incorrect conclusions (Van Buuren, 2018). In predictive modeling, ignoring lacking information by performing full case evaluation or by excluding predictor variables with lacking values can restrict the applicability of the mannequin and end in biased or suboptimal efficiency.
The Three Lacking-Information Mechanisms
To deal with this concern, statisticians classify lacking information into three mechanisms that describe how and why values go lacking. MCAR (Lacking Utterly at Random) refers to instances the place the missingness happens fully at random and is impartial of each noticed and unobserved variables. MAR (Lacking at Random) implies that the chance of missingness will depend on the noticed variables however not on the lacking worth itself. MNAR (Lacking Not at Random) describes probably the most complicated case, through which the chance of missingness will depend on the unobserved worth itself.
Classical Approaches and Their Limits to take care of lacking information
Beneath the MAR assumption, it’s potential to make use of the knowledge contained within the noticed variables to foretell the lacking values. Classical approaches primarily based on this concept embody regression-based imputation, k-nearest neighbors (kNN) imputation, and a number of imputation by chained equations (MICE). These strategies are thought-about multivariate as a result of they explicitly situation the imputation on the noticed variables.These approaches explicitly situation the imputation on the noticed information, however have a major limitation: they don’t deal with combined databases (steady + categorical) nicely and have issue capturing nonlinear relationships and complicated interactions.
The Rise of MissForest carried out in R
It’s to beat these limitations that MissForest (Stekhoven & Bühlmann, 2012) has established itself as a benchmark methodology. Based mostly on random forests, MissForest can seize nonlinear relationships and complicated interactions between variables, usually outperforming conventional imputation methods. Nonetheless, when engaged on a challenge that required a generalizable modeling course of — with a correct practice/take a look at cut up and out-of-time validation — we encountered a major limitation. The R implementation of the missForest
package deal doesn’t retailer the imputation mannequin parameters as soon as fitted.
A Vital Limitation of MissForest in Prediction Settings
This creates a sensible problem: it’s not possible to coach the imputation mannequin on the coaching set after which apply the very same parameters to the take a look at set. This limitation introduces a threat of data leakage throughout mannequin analysis or a degradation within the high quality and consistency of imputations.
Current options and Their Dangers
Whereas searching for another answer that might enable constant imputation in a predictive modeling setting, we requested ourselves a easy however crucial query:
How can we impute the take a look at information in a method that continues to be absolutely according to the imputations realized on the coaching information?
Exploring this query led us to a dialogue on CrossValidated, the place one other consumer was going through the very same concern and requested:
“The best way to use missForest in R for take a look at information imputation?”
Two principal options have been steered to beat this limitation. The primary consists of merging the coaching and take a look at information earlier than operating the imputation. This strategy usually improves the standard of the imputations as a result of the algorithm has extra information to study from, nevertheless it introduces information leakage, for the reason that take a look at set influences the imputation mannequin. The second strategy imputes the take a look at set individually from the coaching set, which prevents data leakage however forces the algorithm to construct a completely new imputation mannequin utilizing solely the take a look at information, which is commonly a lot smaller. This may result in much less secure imputations and a possible drop in predictive efficiency.
Even the well-known tutorial by Liam Morgan arrives at an identical workaround. His proposed answer entails imputing the coaching set, becoming a predictive mannequin, then combining the coaching and take a look at information for a last imputation step:
# 1) Impute the coaching set
imp_train_X <- missForest(train_X)$ximp
# 2) Construct the predictive mannequin
rf <- randomForest(x = imp_train_X, y = practice$creditability)
# 3) Mix practice and take a look at, then re-impute
train_test_X <- rbind(test_X, imp_train_X)
imp_test_X <- missForest(train_test_X)$ximp[1:nrow(test_X), ]
Though this strategy usually might improves imputation high quality, it suffers from the identical weak point as Methodology 1: the take a look at information not directly take part within the studying course of, which can inflate mannequin efficiency metrics and creates an excessively optimistic estimate of generalization.
These examples spotlight a basic dilemma:
- How will we impute lacking values with out biasing mannequin analysis?
- How will we be sure that the imputations utilized to the take a look at set are according to these realized on the coaching set?
Analysis Query and Motivation
These questions motivated our exploration of a extra strong answer that preserves generalization, avoids information leakage, and produces secure imputations appropriate for predictive modeling pipelines.
This paper is organized into 4 principal sections:
- Part 1 introduces the method of figuring out and characterizing lacking values, together with the way to detect, quantify, and describe them.
- Part 2 discusses the MCAR (Lacking Utterly at Random) mechanism and presents strategies for dealing with lacking information beneath this assumption.
- Part 3 focuses on the MAR (Lacking at Random) mechanism, outlining applicable imputation methods and addressing the crucial query: Why does the MissForest implementation in R fail in prediction settings?
- Part 4 examines the MNAR (Lacking Not at Random) mechanism and explores methods for coping with lacking information when the mechanism will depend on the unobserved values themselves.
1. Identification and Characterization of Lacking Values
This step is crucial and must be carried out in shut collaboration with all stakeholders: mannequin builders, area consultants, and future customers of the mannequin. The aim is to determine all lacking values and mark them.
In Python, and significantly when utilizing libraries equivalent to Pandas, NumPy, and Scikit-Be taught, lacking values are represented as NaN
. Values marked as NaN
are ignored by many operations equivalent to sum()
and depend()
. You possibly can mark lacking values utilizing the change()
operate on the related subset of columns in a Pandas DataFrame.
As soon as the lacking values have been marked, the following step is to judge their distribution for every variable. The isnull()
operate can be utilized to determine all NaN
values as True
, and mixed with sum()
to depend the variety of lacking values per column.
Understanding the distribution of lacking values is essential. With this data, stakeholders can assess whether or not the patterns of missingness are cheap. It additionally means that you can outline acceptable thresholds of missingness relying on the character of every variable. As an example, you may resolve that as much as 10% lacking values is suitable for steady variables, whereas the brink for categorical variables ought to stay at 0%.
After deciding on the related variables for modeling, together with these containing lacking values when they’re necessary for prediction, it’s important to separate the dataset into three samples:
- Coaching set to estimate parameters and practice the fashions,
- Check set to judge mannequin efficiency on unseen information,
- Out-of-Time (OOT) set to validate the temporal robustness of the mannequin.
This cut up must be carried out to protect the statistical representativeness of every subsample — for instance, by utilizing stratified sampling if the goal variable is imbalanced.
The evaluation of lacking values ought to then be carried out solely on the coaching set:
- Determine their mechanism (MCAR, MAR, MNAR) utilizing statistical exams,
- Choose the suitable imputation methodology,
- Prepare the imputation fashions on the coaching set.
The imputation parameters and fashions obtained on this step should then be utilized as is to the take a look at set and to the Out-of-Time set. This step is crucial to keep away from data leakage and to make sure an accurate analysis of the mannequin’s generalization efficiency.
Within the subsequent part, we’ll study the MCAR mechanism intimately and current the imputation strategies which might be finest suited to one of these lacking information.
2. Understanding MCAR and Selecting the Proper Imputation Strategies
In easy phrases, MCAR (Lacking Utterly at Random) describes a scenario the place the truth that a worth is lacking is fully unrelated to both the worth itself or some other variables within the dataset. In mathematical phrases, because of this the chance of a knowledge level being lacking doesn’t rely upon the variable’s worth nor on the values of some other variables: the missingness is totally random.
Earlier than formally defining the MCAR mechanism, allow us to introduce the notations that will probably be used on this part and all through the article:
- Contemplate an impartial and identically distributed pattern of n observations:
yi = (yi1, . . ., yip)T, i = 1, 2, . . ., n
the place p is the variety of variables with lacking values and n is the pattern measurement.
- Y ∈ Rnxp represents the variables which will comprise lacking values. That is the set on which we want to carry out imputation.
- We denote the noticed entries and lacking entries of Y as Yo and Ym,
- X ∈ Rnxq represents the absolutely noticed variables, which means they comprise no lacking values.
- To point which elements of yi are noticed or lacking, we outline the indicator vector:
ri = (ri1, . . ., rip)T, i = 1, 2, . . ., n
with rik = 1 if yik is noticed, and 0 in any other case.
- Stacking these vectors yields the whole matrix of presence/absence indicators:
R = (r1, . . ., rn)T
Then the MCAR assumption is outlined as :
Pr(R|Ym ,Yo, X) = Pr(R). (1)
which means that the lacking indicators are utterly impartial of each the lacking information, Ym, and the noticed information, Yo. Observe that right here R can also be impartial of covariates X. Earlier than presenting strategies for dealing with lacking values beneath the MCAR assumption, we’ll first introduce just a few easy methods to evaluate whether or not the MCAR assumption is prone to maintain.
2.1 Assessing the MCAR Assumption
On this part, we’ll simulate a dataset with 10,000 observations and 4 variables beneath the MCAR assumption:
- One steady variable containing 20% lacking values and one categorical variable with two ranges (0 and 1) containing 10% lacking values.
- One steady variable and one categorical variable which might be absolutely noticed, with no lacking values.
- Lastly, a binary goal variable named
goal
, taking values 0 and 1.
import numpy as np
import pandas as pd
# --- Reproducibility ---
np.random.seed(42)
# --- Parameters ---
n = 10000
# --- Utility Capabilities ---
def generate_continuous(imply, std, measurement, missing_rate=0.0):
"""Generate a steady variable with elective MCAR missingness."""
values = np.random.regular(loc=imply, scale=std, measurement=measurement)
if missing_rate > 0:
masks = np.random.rand(measurement) < missing_rate
values[mask] = np.nan
return values
def generate_categorical(ranges, probs, measurement, missing_rate=0.0):
"""Generate a categorical variable with elective MCAR missingness."""
values = np.random.alternative(ranges, measurement=measurement, p=probs).astype(float)
if missing_rate > 0:
masks = np.random.rand(measurement) < missing_rate
values[mask] = np.nan
return values
# --- Variable Technology ---
variables = {
"cont_mcar": generate_continuous(imply=100, std=20, measurement=n, missing_rate=0.20),
"cat_mcar": generate_categorical(ranges=[0, 1], probs=[0.7, 0.3], measurement=n, missing_rate=0.10),
"cont_full": generate_continuous(imply=50, std=10, measurement=n),
"cat_full": generate_categorical(ranges=[0, 1], probs=[0.6, 0.4], measurement=n),
"goal": np.random.alternative([0, 1], measurement=n, p=[0.5, 0.5])
}
# --- Construct DataFrame ---
df = pd.DataFrame(variables)
# --- Show Abstract ---
print(df.head())
print("nMissing worth counts:")
print(df.isnull().sum())
Earlier than performing any evaluation, it’s important to separate the dataset into two elements: a coaching set and a take a look at set.
2.1.1 Getting ready Prepare and Check Information for Evaluation the MCAR
It’s important to separate the dataset into coaching and take a look at units whereas making certain representativeness. This ensures that each the mannequin and the imputation strategies are realized solely on the coaching set after which evaluated on the take a look at set. Doing so prevents information leakage and offers an unbiased estimate of the mannequin’s means to generalize to unseen information.
from sklearn.model_selection import train_test_split
import pandas as pd
def stratified_split(df, strat_vars, test_size=0.3, random_state=None):
"""
Cut up a DataFrame into practice and take a look at units with stratification
primarily based on one or a number of variables.
Parameters
----------
df : pandas.DataFrame
The enter dataset.
strat_vars : listing or str
Column title(s) used for stratification.
test_size : float, default=0.3
Proportion of the dataset to incorporate within the take a look at cut up.
random_state : int, elective
Random seed for reproducibility.
Returns
-------
train_df : pandas.DataFrame
Coaching set.
test_df : pandas.DataFrame
Check set.
"""
# Guarantee strat_vars is an inventory
if isinstance(strat_vars, str):
strat_vars = [strat_vars]
# Create a mixed stratification key
strat_key = df[strat_vars].astype(str).fillna("MISSING").agg("_".be part of, axis=1)
# Carry out stratified cut up
train_df, test_df = train_test_split(
df,
test_size=test_size,
stratify=strat_key,
random_state=random_state
)
return train_df, test_df
# --- Utility ---
# Stratification sur cat_mcar, cat_full et goal
train_df, test_df = stratified_split(df, strat_vars=["cat_mcar", "cat_full", "target"], test_size=0.3, random_state=42)
print(f"Prepare measurement: {train_df.form[0]} ({len(train_df)/len(df):.1%})")
print(f"Check measurement: {test_df.form[0]} ({len(test_df)/len(df):.1%})")
2.1.1 Evaluation MCAR Assumption for steady variables with lacking values
Step one is to create a binary indicator R (the place 1 signifies an noticed worth and 0 signifies a lacking worth) and examine the distributions of Yo, Ym, and X throughout the 2 teams (noticed vs. lacking).
Allow us to illustrate this course of utilizing the variable cont_mcar
for instance. We are going to examine the distribution of cont_full
between observations the place cont_mcar
is lacking and the place it’s noticed, utilizing each a boxplot and a Kolmogorov–Smirnov take a look at. We are going to then carry out an identical evaluation for the specific variable cat_full
, evaluating proportions throughout the 2 teams with a bar plot and a chi-squared take a look at.
import matplotlib.pyplot as plt
import seaborn as sns
# --- Step 1: Prepare/Check Cut up with Stratification ---
train_df, test_df = stratified_split(
df,
strat_vars=["cat_mcar", "cat_full", "target"],
test_size=0.3,
random_state=42
)
# --- Step 2: Create the R indicator on the coaching set ---
train_df = train_df.copy()
train_df["R_cont_mcar"] = np.the place(train_df["cont_mcar"].isnull(), 0, 1)
# --- Step 3: Put together the info for comparability ---
df_obs = pd.DataFrame({
"cont_full": train_df.loc[train_df["R_cont_mcar"] == 1, "cont_full"],
"Group": "Noticed (R=1)"
})
df_miss = pd.DataFrame({
"cont_full": train_df.loc[train_df["R_cont_mcar"] == 0, "cont_full"],
"Group": "Lacking (R=0)"
})
df_all = pd.concat([df_obs, df_miss])
# --- Step 4: KS Check earlier than plotting ---
from scipy.stats import ks_2samp
stat, p_value = ks_2samp(
train_df.loc[train_df["R_cont_mcar"] == 1, "cont_full"],
train_df.loc[train_df["R_cont_mcar"] == 0, "cont_full"]
)
# --- Step 5: Visualization with KS outcome ---
plt.determine(figsize=(8, 6))
sns.boxplot(
x="Group",
y="cont_full",
information=df_all,
palette="Set2",
width=0.6,
fliersize=3
)
# Add pink diamonds for means
means = df_all.groupby("Group")["cont_full"].imply()
for i, m in enumerate(means):
plt.scatter(i, m, colour="pink", marker="D", s=50, zorder=3, label="Imply" if i == 0 else "")
# Title and KS take a look at outcome
plt.title("Distribution of cont_full by Missingness of cont_mcar (Prepare Set)",
fontsize=14, weight="daring")
# Add KS outcome as textual content field
textstr = f"KS Statistic = {stat:.3f}nP-value = {p_value:.3f}"
plt.gca().textual content(
0.05, 0.95, textstr,
rework=plt.gca().transAxes,
fontsize=10,
verticalalignment='prime',
bbox=dict(boxstyle="spherical,pad=0.3", facecolor="white", alpha=0.8)
)
plt.ylabel("cont_full", fontsize=12)
plt.xlabel("")
sns.despine()
plt.legend()
plt.present()

import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import chi2_contingency
# --- Step 1: Construct contingency desk on the TRAIN set ---
contingency_table = pd.crosstab(train_df["R_cont_mcar"], train_df["cat_full"])
chi2, p_value, dof, anticipated = chi2_contingency(contingency_table)
# --- Step 2: Compute proportions for every group ---
# --- Recompute proportions however flip the axes ---
props = contingency_table.div(contingency_table.sum(axis=1), axis=0)
# Rework for plotting: Group (R) on x-axis, Class as hue
df_props = props.reset_index().soften(
id_vars="R_cont_mcar",
var_name="Class",
value_name="Proportion"
)
# Map R values to clear labels
df_props["Group"] = df_props["R_cont_mcar"].map({1: "Noticed (R=1)", 0: "Lacking (R=0)"})
# --- Plot: Group on x-axis, bars present proportions of every class ---
sns.set_theme(fashion="whitegrid")
plt.determine(figsize=(8,6))
sns.barplot(
x="Group", y="Proportion", hue="Class",
information=df_props, palette="Set2"
)
# Title and Chi² outcome
plt.title("Proportion of cat_full by Noticed/Lacking Standing of cont_mcar (Prepare Set)",
fontsize=14, weight="daring")
# Add Chi² outcome as a textual content field
textstr = f"Chi² = {chi2:.3f}, p = {p_value:.3f}"
plt.gca().textual content(
0.05, 0.95, textstr,
rework=plt.gca().transAxes,
fontsize=10,
verticalalignment='prime',
bbox=dict(boxstyle="spherical,pad=0.3", facecolor="white", alpha=0.8)
)
plt.xlabel("Noticed / Lacking Group (R)")
plt.ylabel("Proportion")
plt.legend(title="cat_full Class")
sns.despine()
plt.present()

The 2 figures above present that, beneath the MCAR assumption, the distribution of 𝑌, 𝑌ₘ, and 𝑋 stays unchanged whatever the worth of R (1 = noticed, 0 = lacking). These outcomes are additional supported by the Kolmogorov–Smirnov and Chi-squared exams, which affirm the absence of serious variations between the noticed and lacking teams.
For categorical variables, the identical analyses will be carried out as described above. Whereas these univariate checks will be time-consuming, they’re helpful when the variety of variables is small, as they supply a fast and intuitive first take a look at the lacking information mechanism. For bigger datasets, nevertheless, multivariate strategies must be thought-about.
2.1.3 Multivariate Evaluation of the MCAR Assumption
To the perfect of my information, just one multivariate statistical take a look at is extensively used to evaluate the MCAR assumption on the dataset degree: Little’s chi2 for take a look at MCAR assumption known as mcartest. This take a look at, developed in R language, compares the distributions of noticed variables throughout completely different missing-data patterns and computes a world take a look at statistic that follows a Chi-squared distribution.
Nonetheless, its principal limitation is that it isn’t nicely suited to categorical variables, because it depends on the robust assumption that the variables are usually distributed. We now flip to the strategies for imputing lacking values beneath the MCAR assumption.
2.2 Strategies to take care of lacking information beneath MCAR.
Beneath the MCAR assumption, the missingness indicators R are impartial of Yo, Ym, and X. For the reason that information are lacking utterly at random, dropping incomplete observations doesn’t introduce bias. Nonetheless, this strategy turns into inefficient when the proportion of lacking values is excessive.
In such instances, easy imputation strategies, changing lacking values with the imply, median, or most frequent class, are sometimes most well-liked. They’re straightforward to implement, require little computational effort, and will be managed over time with out including complexity for modelers. Whereas these strategies don’t create bias, they have a tendency to underestimate variance and should distort relationships between variables.
Against this, superior strategies equivalent to regression-based imputation, kNN, or a number of imputation can enhance statistical effectivity and assist protect data when the proportion of lacking information is substantial. Their principal disadvantage lies of their algorithmic complexity, increased computational price, and the higher effort required to take care of them in manufacturing settings.
To impute lacking values beneath the MCAR assumption for prediction functions, proceed as follows:
- Be taught imputation values from the coaching set solely, utilizing the imply for steady variables and probably the most frequent class for categorical variables.
- Apply these values to switch lacking information in each the coaching and the take a look at units.
- Consider the mannequin on the take a look at set, making certain that no data from the take a look at set was used in the course of the imputation course of.
import pandas as pd
def compute_impute_values(df, cont_vars, cat_vars):
"""
Compute imputation values (imply for steady, mode for categorical)
from the coaching set solely.
"""
impute_values = {}
for col in cont_vars:
impute_values[col] = df[col].imply()
for col in cat_vars:
impute_values[col] = df[col].mode().iloc[0]
return impute_values
def apply_imputation(train_df, test_df, impute_values, vars_to_impute):
"""
Apply the realized imputation values to each practice and take a look at units.
"""
train_df[vars_to_impute] = train_df[vars_to_impute].fillna(worth=impute_values)
test_df[vars_to_impute] = test_df[vars_to_impute].fillna(worth=impute_values)
return train_df, test_df
# --- Instance utilization ---
train_df, test_df = stratified_split(
df,
strat_vars=["cat_mcar", "cat_full", "target"],
test_size=0.3,
random_state=42
)
# Variables to impute
cont_vars = ["cont_mcar"]
cat_vars = ["cat_mcar"]
vars_to_impute = cont_vars + cat_vars
# 1. Be taught imputation values on TRAIN
impute_values = compute_impute_values(train_df, cont_vars, cat_vars)
print("Imputation values realized from practice:", impute_values)
# 2. Apply them constantly to TRAIN and TEST
train_df, test_df = apply_imputation(train_df, test_df, impute_values, vars_to_impute)
# 3. Examine
print("Remaining lacking values in practice:n", train_df[vars_to_impute].isnull().sum())
print("Remaining lacking values in take a look at:n", test_df[vars_to_impute].isnull().sum())
This part on understanding MCAR and deciding on the suitable imputation methodology offers a transparent basis for approaching related methods beneath the MAR assumption.
3. Understanding MAR and Selecting the Proper Imputation Strategies
The MAR assumption is outlined as :
Pr(R|Ym ,Yo, X) = Pr(R|Yo, X) (2)
In different phrases, the distribution of the lacking indicators relies upon solely on the noticed information. Even within the case the place R relies upon solely on the covariates X,
Pr(R|Ym ,Yo, X) = Pr(R|X) (3)
This nonetheless falls beneath the MAR assumption.
3.1 Evaluation MAR Assumption for variables with lacking values
Beneath the MAR assumption, the missingness indicators 𝑅 rely solely on the noticed variables Yo and X, however not on the lacking information 𝑌.
To not directly assess the plausibility of this assumption, widespread statistical exams (Scholar’s t-test, Kolmogorov–Smirnov, Chi-squared, and so forth.) will be utilized by evaluating the distributions of noticed variables between teams with and with out lacking values.
For multivariate evaluation, one might also use the mcartest carried out in R, which extends Little’s take a look at of MCAR to judge assumption (3), particularly Pr(R|Ym ,Yo, X) = Pr(R|X), beneath the idea of multivariate normality of the variables.
If this take a look at will not be rejected, the missing-data mechanism can moderately be thought-about MAR (assumption 3) given the auxiliary variables X .
We will now flip to the query of the way to impute one of these lacking information.
3.2 Strategies to take care of lacking information beneath MAR.
Beneath the MAR assumption, the chance of missingness R relies upon solely on the noticed variables Yo and covariates X. On this setting, variables Yok with lacking values will be defined utilizing the opposite out there variables Yo and X, which motivates using superior imputation strategies primarily based on supervised studying.
These approaches contain constructing a predictive mannequin through which the unfinished variable Yok serves because the goal, and the opposite noticed variables Yo and X act as predictors. The mannequin is skilled on full instances ([Yk]o of Y) after which utilized to estimate the lacking values [Yk]m of Yok.
Probably the most generally used imputation strategies within the literature embody:
- k-nearest neighbors (KNNimpute, Troyanskaya et al., 2001), primarily utilized to steady information;
- the saturated multinomial mannequin (Schafer, 1997), designed for categorical information;
- multivariate imputation by chained equations (MICE, Van Buuren & Oudshoorn, 1999), appropriate for combined datasets however depending on tuning parameters and the specification of a parametric mannequin.
All of those approaches depend on assumptions in regards to the underlying information distribution or on the power of the chosen mannequin to adequately seize relationships between variables.
Extra just lately, MissForest (Stekhoven & Bühlmann, 2012) has emerged as a nonparametric various primarily based on random forests, well-suited to combined information sorts and strong to each interactions and nonlinear relationships.
The MissForest algorithm depends on random forests (RF) to impute lacking values. The authors suggest the next process:

Supply: [2] Stekhoven et al.(2012)
As outlined, the MissForest algorithm can’t be used instantly for prediction functions. For every variable, between steps 6 and seven, the random forest mannequin Ms used to foretell ymis(s) from xmis(s)will not be saved. Consequently, it’s neither doubtless nor fascinating for practitioners to depend on MissForest as a predictive mannequin in manufacturing.
The absence of saved fashions Ms or imputation parameters (right here on the coaching set) makes it tough to judge generalization efficiency on new information. Though some have tried to work round this concern by following Liam Morgan‘s strategy, the problem stays unresolved.
Moreover, this limitation will increase algorithmic complexity and computational price, for the reason that total algorithm should be rerun from scratch for every new dataset (as an example, when working with separate coaching and take a look at units).
What must be accomplished? Ought to the MissForest algorithm nonetheless be used?
If the aim is to develop a mannequin for classification or evaluation solely on the out there dataset, with no intention of making use of it to new information, then MissForest is strongly really useful, because it gives excessive accuracy and robustness.
Nonetheless, if the goal is to construct a predictive mannequin that will probably be utilized to new datasets, MissForest must be prevented for the explanations mentioned above. In such instances, it’s preferable to make use of an algorithm that explicitly shops the imputation fashions or the parameters estimated from the coaching set.
Happily, an tailored model now exists: MissForestPredict, out there since 2024 in each R and Python, particularly designed for predictive duties. For additional particulars, we refer the reader to Albu, Elena, et al. (2024).
The usage of the MissForestPredict algorithm for prediction consists of making use of the usual MissForest process to the coaching information. In contrast to the unique MissForest, nevertheless, this model returns and shops the person fashions Ms related to every variable, which makes it potential to reuse them for imputing lacking values in new datasets.

Supply: [4] Albu et al. (2024).
The algorithm under illustrates the way to apply MissForestPredict to new observations, whether or not they come from the take a look at set, an out-of-time pattern, or an software dataset.

Supply: [4] Albu et al. (2024).
We now have all the weather wanted to handle the problems raised within the introduction. Allow us to flip to the ultimate mechanism, MNAR, earlier than shifting on to the conclusion.
4. Understanding MNAR
Lacking Not At Random (MNAR) happens when the lacking information mechanism relies upon instantly on the unobserved values themselves. In different phrases, if a variable Y accommodates lacking values, then the indicator variable R (with R=1 if Y is noticed and R=0 in any other case) relies upon solely on the lacking element Ym.
There isn’t any common statistical methodology to deal with one of these mechanism, for the reason that data wanted to mannequin the dependency is exactly what’s lacking. In such instances, the really useful strategy is to depend on area experience to grasp the explanations behind the nonresponse and to outline context-specific methods for analyzing and addressing the lacking values.
You will need to emphasize, nevertheless, that MAR and MNAR can’t usually be distinguished empirically primarily based on the noticed information alone.
Conclusion
The target of this text was to point out the way to impute lacking values for predictive functions with out biasing the analysis of mannequin efficiency. To this finish, we offered the primary mechanisms that generate lacking information (MCAR, MAR, MNAR), the statistical exams used to evaluate their plausibility, and the imputation strategies finest suited to every.
Our evaluation highlights that, beneath MCAR, easy imputation strategies are usually preferable, as they supply substantial time financial savings with out introducing bias. In observe, nevertheless, lacking information mechanisms are most frequently MAR. On this setting, superior imputation approaches equivalent to MissForest, primarily based on machine studying fashions, are significantly applicable.
However, when the aim is to construct predictive fashions, it’s important to make use of strategies that retailer the imputation parameters or fashions realized from the coaching information after which replicate them constantly on the take a look at, out-of-time, or software datasets. That is exactly the contribution of MissForestPredict (launched in 2024 and out there in each R and Python), which addresses the limitation of the unique MissForest (2012), a way not initially designed for predictive duties.
Utilizing MissForest for prediction with out adaptation might due to this fact result in biased outcomes, until corrective measures are carried out. It will be extremely invaluable for practitioners who’ve deployed MissForest in manufacturing to share the methods they developed to beat this limitation.
References
[1] Audigier, V., White, I. R., Jolani, S., Debray, T. P., Quartagno, M., Carpenter, J., … & Resche-Rigon, M. (2018). A number of imputation for multilevel information with steady and binary variables.
[2] Stekhoven, D. J., & Bühlmann, P. (2012). MissForest—non-parametric lacking worth imputation for mixed-type information. Bioinformatics, 28(1), 112-118.
[3] Li, C. (2013). Little’s take a look at of lacking utterly at random. The Stata Journal, 13(4), 795-809.
[4] Albu, E., Gao, S., Wynants, L., & Van Calster, B. (2024). missForestPredict–Lacking information imputation for prediction settings. arXiv preprint arXiv:2407.03379.
Picture Credit
All photographs and visualizations on this article have been created by the creator utilizing Python (pandas, matplotlib, seaborn, and plotly) and excel, until in any other case said.
Disclaimer
I write to study so errors are the norm, although I strive my finest. Please, if you spot them, let me know. I additionally admire strategies on new subjects!