Univariate amputation

This example demonstrates different ways to amputate a dataset in an univariate manner.

# Author: G. Lemaitre
# License: BSD 3 clause
import sklearn
import seaborn as sns

sklearn.set_config(display="diagram")
sns.set_context("poster")

Let’s create a synthetic dataset composed of 10 features. The idea will be to amputate some of the observations with different strategies.

import pandas as pd
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=10_000, n_features=10, random_state=42)

feature_names = [f"Features #{i}" for i in range(X.shape[1])]
X = pd.DataFrame(X, columns=feature_names)

Missing completely at random (MCAR)

We will show how to amputate the dataset using a MCAR strategy. Thus, we will amputate 3 given features randomly selected.

Now that we selected the features to be amputated, we can create an transformer that can amputate the dataset.

from ampute import UnivariateAmputer

amputer = UnivariateAmputer(
    strategy="mcar",
    subset=features_with_missing_values,
    ratio_missingness=[0.2, 0.3, 0.4],
)

If we want to amputate the full-dataset, we can directly use the instance of UnivariateAmputer as a callable.

Features #0 Features #1 Features #2 Features #3 Features #4 Features #5 Features #6 Features #7 Features #8 Features #9
0 NaN 0.357385 -0.503931 0.935066 0.647981 -0.050796 -1.933989 2.081684 0.041266 -0.258298
1 1.283905 1.109459 -0.908953 1.006586 0.492219 1.107295 1.243526 -0.172200 1.150359 0.147744
2 -0.966476 -0.593314 0.458020 1.032323 1.283685 -0.317640 1.499045 0.434477 0.423678 1.251380
3 2.429309 -1.306530 -1.869925 3.092164 2.028800 -0.879635 NaN -0.101213 -1.624066 NaN
4 NaN 0.078464 0.705181 0.224765 0.618707 1.534946 NaN 2.325055 0.495505 0.538133


We can quickly check if we get the expected amount of missing values for the amputated features.

import matplotlib.pyplot as plt

ax = X_missing[features_with_missing_values].isna().mean().plot.barh()
ax.set_title("Proportion of missing values")
plt.tight_layout()
Proportion of missing values

Thus we see that we have the expected amount of missing values for the selected features.

Now, we can show how to amputate a dataset as part of the scikit-learn Pipeline.

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import ShuffleSplit

model = make_pipeline(
    amputer,
    StandardScaler(),
    SimpleImputer(strategy="mean"),
    LogisticRegression(),
)
model
Pipeline(steps=[('univariateamputer',
                 UnivariateAmputer(ratio_missingness=[0.2, 0.3, 0.4],
                                   subset=array(['Features #9', 'Features #0', 'Features #6'], dtype='
UnivariateAmputer(ratio_missingness=[0.2, 0.3, 0.4],
                  subset=array(['Features #9', 'Features #0', 'Features #6'], dtype='
StandardScaler()
SimpleImputer()
LogisticRegression()


Now that we have our pipeline, we can evaluate it as usual with any cross-validation tools provided by scikit-learn.

n_folds = 100
cv = ShuffleSplit(n_splits=n_folds, random_state=42)
results = pd.Series(
    cross_val_score(model, X, y, cv=cv, n_jobs=2),
    index=[f"Fold #{i}" for i in range(n_folds)],
)
ax = results.plot.hist()
ax.set_xlim([0, 1])
ax.set_xlabel("Accuracy")
ax.set_title("Cross-validation scores")
plt.tight_layout()
Cross-validation scores

Total running time of the script: ( 0 minutes 4.324 seconds)

Gallery generated by Sphinx-Gallery