{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n# Univariate amputation\n\nThis example demonstrates different ways to amputate a dataset in an univariate\nmanner.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Author: G. Lemaitre\n# License: BSD 3 clause"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import sklearn\nimport seaborn as sns\n\nsklearn.set_config(display=\"diagram\")\nsns.set_context(\"poster\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let's create a synthetic dataset composed of 10 features. The idea will be\nto amputate some of the observations with different strategies.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import pandas as pd\nfrom sklearn.datasets import make_classification\n\nX, y = make_classification(n_samples=10_000, n_features=10, random_state=42)\n\nfeature_names = [f\"Features #{i}\" for i in range(X.shape[1])]\nX = pd.DataFrame(X, columns=feature_names)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Missing completely at random (MCAR)\nWe will show how to amputate the dataset using a MCAR strategy. Thus, we\nwill amputate 3 given features randomly selected.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import numpy as np\n\nrng = np.random.default_rng(42)\nn_features_with_missing_values = 3\nfeatures_with_missing_values = rng.choice(\n    feature_names, size=n_features_with_missing_values, replace=False\n)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now that we selected the features to be amputated, we can create an\ntransformer that can amputate the dataset.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from ampute import UnivariateAmputer\n\namputer = UnivariateAmputer(\n    strategy=\"mcar\",\n    subset=features_with_missing_values,\n    ratio_missingness=[0.2, 0.3, 0.4],\n)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "If we want to amputate the full-dataset, we can directly use the instance\nof :class:`~ampute.UnivariateAmputer` as a callable.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "X_missing = amputer(X)\nX_missing.head()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can quickly check if we get the expected amount of missing values for the\namputated features.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import matplotlib.pyplot as plt\n\nax = X_missing[features_with_missing_values].isna().mean().plot.barh()\nax.set_title(\"Proportion of missing values\")\nplt.tight_layout()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Thus we see that we have the expected amount of missing values for the\nselected features.\n\nNow, we can show how to amputate a dataset as part of the `scikit-learn`\n:class:`~sklearn.pipeline.Pipeline`.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.impute import SimpleImputer\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.pipeline import make_pipeline\nfrom sklearn.model_selection import cross_val_score\nfrom sklearn.model_selection import ShuffleSplit\n\nmodel = make_pipeline(\n    amputer,\n    StandardScaler(),\n    SimpleImputer(strategy=\"mean\"),\n    LogisticRegression(),\n)\nmodel"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now that we have our pipeline, we can evaluate it as usual with any\ncross-validation tools provided by `scikit-learn`.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "n_folds = 100\ncv = ShuffleSplit(n_splits=n_folds, random_state=42)\nresults = pd.Series(\n    cross_val_score(model, X, y, cv=cv, n_jobs=2),\n    index=[f\"Fold #{i}\" for i in range(n_folds)],\n)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "ax = results.plot.hist()\nax.set_xlim([0, 1])\nax.set_xlabel(\"Accuracy\")\nax.set_title(\"Cross-validation scores\")\nplt.tight_layout()"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.7.9"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}