Information Leakage in Preprocessing, Defined: A Visible Information with Code Examples | by Samy Baladram

Allow us to see how leakage might occur in predicting a easy golf play dataset. That is the dangerous instance and shouldn’t be adopted. Only for demonstration and training functions.

import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OrdinalEncoder, KBinsDiscretizer
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE# Create dataset
dataset_dict = {
'Outlook': ['sunny', 'sunny', 'overcast', 'rain', 'rain', 'rain', 'overcast', 'sunny', 'sunny', 'rain', 'sunny', 'overcast', 'overcast', 'rain', 'sunny', 'overcast', 'rain', 'sunny', 'sunny', 'rain', 'overcast', 'rain', 'sunny', 'overcast', 'sunny', 'overcast', 'rain', 'overcast'],
'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0, 72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0, 88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
'Wind': [False, True, False, False, False, True, True, False, False, False, True, True, False, True, True, False, False, True, False, True, True, False, True, False, False, True, False, False],
'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes']
}
df = pd.DataFrame(dataset_dict)
X, y = df.drop('Play', axis=1), df['Play']
# Preprocess AND apply SMOTE to ALL knowledge first (inflicting leakage)
preprocessor = ColumnTransformer(transformers=[
('temp_transform', Pipeline([
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler()),
('discretizer', KBinsDiscretizer(n_bins=4, encode='ordinal'))
]), ['Temperature']),
('humid_transform', Pipeline([
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler()),
('discretizer', KBinsDiscretizer(n_bins=4, encode='ordinal'))
]), ['Humidity']),
('outlook_transform', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1), 
['Outlook']),
('wind_transform', Pipeline([
('imputer', SimpleImputer(strategy='constant', fill_value=False)),
('scaler', StandardScaler())
]), ['Wind'])
])
# Remodel all knowledge and apply SMOTE earlier than splitting (leakage!)
X_transformed = preprocessor.fit_transform(X)
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_transformed, y)
# Break up the already remodeled and resampled knowledge
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.5, shuffle=False)
# Practice a classifier
clf = DecisionTreeClassifier(random_state=42)
clf.match(X_train, y_train)
print(f"Testing Accuracy (with leakage): {accuracy_score(y_test, clf.predict(X_test)):.2%}")

The code above is utilizing ColumnTransformer, which is a utility in scikit-learn that enables us to use completely different preprocessing steps to completely different columns in a dataset.

Right here’s a breakdown of the preprocessing technique for every column within the dataset:

Temperature:
– Imply imputation to deal with any lacking values
– Customary scaling to normalize the values (imply=0, std=1)
– Equal-width discretization into 4 bins, which means steady values are categorized into 4 equal-width intervals

Humidity:
– Identical technique as Temperature: Imply imputation → Customary scaling → Equal-width discretization (4 bins)

Outlook(categorical):
– Ordinal encoding: converts categorical values into numerical ones
– Unknown values are dealt with by setting them to -1

Wind (binary):
– Fixed imputation with False for lacking values
– Customary scaling to normalize the 0/1 values

Play (goal):
– Label encoding to transform Sure/No to 1/0
– SMOTE utilized after preprocessing to stability lessons by creating artificial examples of the minority class
– A easy choice tree is used to foretell the goal

The whole pipeline demonstrates knowledge leakage as a result of all transformations see the whole dataset throughout becoming, which might be inappropriate in an actual machine studying state of affairs the place we have to maintain check knowledge utterly separate from the coaching course of.

This strategy may also probably present artificially increased check accuracy as a result of the check knowledge traits have been used within the preprocessing steps!

Right here’s the model with out knowledge leakage:

import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OrdinalEncoder, KBinsDiscretizer
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE# Create dataset
dataset_dict = {
'Outlook': ['sunny', 'sunny', 'overcast', 'rain', 'rain', 'rain', 'overcast', 'sunny', 'sunny', 'rain', 'sunny', 'overcast', 'overcast', 'rain', 'sunny', 'overcast', 'rain', 'sunny', 'sunny', 'rain', 'overcast', 'rain', 'sunny', 'overcast', 'sunny', 'overcast', 'rain', 'overcast'],
'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0, 72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0, 88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
'Wind': [False, True, False, False, False, True, True, False, False, False, True, True, False, True, True, False, False, True, False, True, True, False, True, False, False, True, False, False],
'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes']
}
df = pd.DataFrame(dataset_dict)
X, y = df.drop('Play', axis=1), df['Play']
# Break up first (earlier than any processing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, shuffle=False)
# Create pipeline with preprocessing, SMOTE, and classifier
pipeline = Pipeline([
('preprocessor', ColumnTransformer(transformers=[
('temp_transform', Pipeline([
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler()),
('discretizer', KBinsDiscretizer(n_bins=4, encode='ordinal'))
]), ['Temperature']),
('humid_transform', Pipeline([
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler()),
('discretizer', KBinsDiscretizer(n_bins=4, encode='ordinal'))
]), ['Humidity']),
('outlook_transform', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1), 
['Outlook']),
('wind_transform', Pipeline([
('imputer', SimpleImputer(strategy='constant', fill_value=False)),
('scaler', StandardScaler())
]), ['Wind'])
])),
('smote', SMOTE(random_state=42)),
('classifier', DecisionTreeClassifier(random_state=42))
])
# Match pipeline on coaching knowledge solely
pipeline.match(X_train, y_train)
print(f"Coaching Accuracy: {accuracy_score(y_train, pipeline.predict(X_train)):.2%}")
print(f"Testing Accuracy: {accuracy_score(y_test, pipeline.predict(X_test)):.2%}")

Key variations from the leakage model

Break up knowledge first, earlier than any processing
All transformations (preprocessing, SMOTE) are contained in the pipeline
Pipeline ensures:
– Preprocessing parameters realized solely from coaching knowledge
– SMOTE applies solely to coaching knowledge
– Check knowledge stays utterly unseen till prediction

This strategy provides extra sensible efficiency estimates because it maintains correct separation between coaching and check knowledge.

Information Leakage in Preprocessing, Defined: A Visible Information with Code Examples | by Samy Baladram | Oct, 2024

Automate doc processing with Amazon Bedrock Immediate Flows (preview)

Empower your generative AI software with a complete customized observability answer

Empower your generative AI software with a complete customized observability answer

Leave a Reply Cancel reply

Popular News

Greatest practices for Amazon SageMaker HyperPod activity governance

How Cursor Really Indexes Your Codebase

Speed up edge AI improvement with SiMa.ai Edgematic with a seamless AWS integration

Unlocking Japanese LLMs with AWS Trainium: Innovators Showcase from the AWS LLM Growth Assist Program

The Good-Sufficient Fact | In direction of Knowledge Science

About Us

Category

Recent Posts