1 Introduction

As aspiring data scientists, the first question we must always ask when given a task is: “Do we have the right data?” and “Is our data clean?”. We cannot really help it if the data is just not there (we don’t have any useful data), but we can help the second question. If data is missing, we must come up with a strategy to impute it, or to replace the missing variables with some other sort of information. In this study, we will demonstrate the utility of such methods, and when they are effective vs when they are not. First, lets set up the environment for this study, importing necessary libraries and other formatting tools:

from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import Imputer
from sklearn.base import BaseEstimator
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Callable
from sklearn.impute import SimpleImputer
from pprint import pprint
import pandas as pd
plt.style.use('bmh')

2 Analysis

2.1 Initial Analysis of the Data

First, we will use pandas to perform a cursory analysis of the data, as this is always good practice:

boston = load_boston()
col_names = boston['feature_names']
bos = pd.DataFrame(data = boston['data'], columns = boston['feature_names'])
bos_target = boston['target']
# knitting!

#Check missing values. 
bos.isnull().sum()
#> CRIM       0
#> ZN         0
#> INDUS      0
#> CHAS       0
#> NOX        0
#> RM         0
#> AGE        0
#> DIS        0
#> RAD        0
#> TAX        0
#> PTRATIO    0
#> B          0
#> LSTAT      0
#> dtype: int64
_ = bos.hist(bins=50, figsize = (20,15))
plt.show()
**Figure 1**: The columns each seem to have a different distribution. Other than RM, they either have a strong skew or an almost bimodal distribution.

Figure 1: The columns each seem to have a different distribution. Other than RM, they either have a strong skew or an almost bimodal distribution.

Looks like we have no missing data to start with, so we will need to create some. This plot does suggest some outliers. We can check those in the following manner:

for k, v in bos.items():
    q1 = v.quantile(0.25)
    q3 = v.quantile(0.75)
    irq = q3 - q1
    v_col = v[(v <= q1 - 1.5 * irq) | (v >= q3 + 1.5 * irq)]
    perc = np.shape(v_col)[0] * 100.0 / np.shape(bos)[0]
    print("%s outliers = %8.2f%%" % (k, perc))
#> CRIM outliers =    13.04%
#> ZN outliers =    13.44%
#> INDUS outliers =     0.00%
#> CHAS outliers =   100.00%
#> NOX outliers =     0.00%
#> RM outliers =     5.93%
#> AGE outliers =     0.00%
#> DIS outliers =     0.99%
#> RAD outliers =     0.00%
#> TAX outliers =     0.00%
#> PTRATIO outliers =     2.96%
#> B outliers =    15.22%
#> LSTAT outliers =     1.38%

Now that we have inspected our data and have soothed our cautious souls, we can go ahead and proceed with our analysis.

2.2 Building a Baseline Model.

In general, it is always best to build a simple baseline model first, before we proceed with our analysis. In the case of our missing data analysis, we will want to know the goodness of fit (\(R^2\)) and the loss (mean squared error) of our models. The same is true for the baseline. Let us first define a utility function for fitting and getting scores of a model:

X, y = load_boston(return_X_y=True)
# default parameters to use
parameter_dict = {'n_jobs':-1}
# dictionary for results
results_dict = {}


def get_scores(features: np.ndarray,
               targets: np.ndarray,
               model_class: BaseEstimator=LinearRegression,
               pars: dict=parameter_dict) -> dict:
    # initialize model using external pars
    model = model_class(**pars)
    model.fit(features, targets)
    preds = model.predict(features)
    # get the scores
    goodness = r2_score(y, preds)
    loss = mean_squared_error(y, preds)
    return {'goodness_of_fit':goodness, 'mse':loss}
    

pprint(get_scores(X,y))
#> {'goodness_of_fit': 0.7406426641094095, 'mse': 21.894831181729202}

We have a goodness of fit of 0.74 and a loss of 21.89.

2.3 Missing at Random

Next we will proceed to compare different missing rates and imputation methods. First, lets define a function which replaces a random sample of a random columns values with NaNs:

def random_miss(data: np.ndarray, prop: int) -> np.ndarray:
    # raise an error if our proportion is less than 0 or more than 100
    if type(prop) is not int or prop <= 0 or prop >=100:
        raise ValueError('needs to be an int less than 100 and greater than zero!')
    # sample columns, sample rows
    nrows = data.shape[0]
    idy = np.random.choice(data.shape[-1], 1)
    idx = np.random.choice(nrows, int(nrows*prop/100), replace=False)
    out = data.copy()
    # turn sample into nas
    out[idx, idy] = np.nan
    return out

Now, we need to be sure of ourselves, since this is random, so we will define two more functions: one which repeatedly makes columns missing at random and fits an imputer, and one which takes a list of scores and returns the mean and standard deviation of that list:

def _stats_dict(arr: list) -> dict:
    return {'avg': np.mean(arr), 'std': np.std(arr)}

def iterate_stats(features: np.ndarray,
                  targets: np.ndarray,
                  impute_method: str=None,
                  prop: int=0,
                  iters: int=10,
                  model_class: BaseEstimator=LinearRegression,
                  pars: dict=parameter_dict) -> dict:
    r2 = []
    mse = []
    for _ in range(iters):
        # because we raise that valueerror in random_miss. This allows us to iterate our baseline too
        if prop ==  0:
            data = features
        else:
            tmp = random_miss(features, prop)
            data = SimpleImputer(strategy=impute_method).fit_transform(tmp)
            del tmp
        res = get_scores(data, targets, model_class, pars)
        # append scores to list
        r2 += [res['goodness_of_fit']]
        mse += [res['mse']]
    return {'goodness_of_fit': _stats_dict(r2), 'loss': _stats_dict(mse)}

With all this defined, we can go ahead and build our dictionary of results:

# baseline model
results_dict['baseline'] = iterate_stats(X,y)

# define iterables
impute_types = ['mean','median']
miss_props = [1, 5, 10, 20, 33, 50]
n_rounds = X.shape[-1]*100

for imp in impute_types:
    results_dict[imp] = {p: iterate_stats(X, y, prop=p, iters=n_rounds, impute_method=imp) for p in miss_props}

Lets also check and see how adding an indicator column affects things:


def iterate_stats_indicator(features: np.ndarray,
                  targets: np.ndarray,
                  impute_method: str=None,
                  prop: int=0,
                  iters: int=10,
                  model_class: BaseEstimator=LinearRegression,
                  pars: dict=parameter_dict) -> dict:
    r2 = []
    mse = []
    for _ in range(iters):
        # because we raise that valueerror in random_miss. This allows us to iterate our baseline too
        if prop ==  0:
            data = features
        else:
            tmp = random_miss(features, prop)
            data = SimpleImputer(strategy=impute_method, add_indicator=True).fit_transform(tmp)
            del tmp
        res = get_scores(data, targets, model_class, pars)
        # append scores to list
        r2 += [res['goodness_of_fit']]
        mse += [res['mse']]
    return {'goodness_of_fit': _stats_dict(r2), 'loss': _stats_dict(mse)}

for imp in impute_types:
    results_dict[f"{imp} With Indicator"] = {p: iterate_stats(X, y, prop=p, iters=n_rounds, impute_method=imp) for p in miss_props}

This ultra nested dict is a bit hard to digest. Next, we will define two more functions: the first will take in a nested dictionary (the results at an imputation method), and get the average and one standard deviation upper and lower bounds of the results. The second will allow us to quickly plot our data:

def results_to_series(res: dict, f_type: str) -> dict:
    upper = []
    lower = []
    avg = []
    x = list(res.keys())
    avg = [res[k][f_type]['avg'] for k in x]
    std = [res[k][f_type]['std'] for k in x]
    # for plotting
    # lower and upper bounds for lovely error on the line plot
    upper = [avg[i] + std[i] for i in range(len(avg))]
    lower = [avg[i] - std[i] for i in range(len(avg))]
    return {'x':x, 'y':avg, 'upper': upper, 'lower':lower}


def make_plotter(f_type: str) -> Callable:
    # returns a plotting function to either plot the goodness or the loss
    def plot_generic_stats(results: dict=results_dict)-> None:
        fig = plt.figure(figsize=(12,10))
        ax = plt.subplot()
        base = results_dict['baseline'][f_type]
        # horizontal line for baseline
        ax.axhline(base['avg'], color='r', xmin=0, xmax=50, label='Baseline')
        keys = [k for k in results.keys() if k != 'baseline']
        for k in keys:
            tmp = results_to_series(results[k], f_type)
            ax.plot(tmp['x'], tmp['y'], label=k.title())
            ax.fill_between(tmp['x'], tmp['lower'], tmp['upper'], alpha=0.2, label=f"{k} error".title())
        ax.legend()
        ax.set_title(f"{f_type} vs Percentage of Missing Data, by Imputation Method".title().replace("_"," "))
        ax.set_xlabel('Percentage of Missing Data')
        ax.set_ylabel(f_type.title().replace("_"," "))
        plt.show()
        plt.clf()

    return plot_generic_stats


plot_goodness = make_plotter('goodness_of_fit')
plot_loss = make_plotter('loss')

Now that we have that out of the way, we can go ahead and plot our results. First, lets check out goodness of fit:

plot_goodness()
**Figure 2**: Mean and Median imputation performed almost identically, and neither was much much worse than the baseline. Indicators do not seem to have a massive effect.

Figure 2: Mean and Median imputation performed almost identically, and neither was much much worse than the baseline. Indicators do not seem to have a massive effect.

As the amount of missing data grew, the goodness of fit went down (but not by a lot). However, at low missing rates, this did absolutely fine. That means for simple datasets with only a bit of missing data, its likely we can quickly just throw in a simple imputation method and get near the same results as if our data were all there. Lets double check this is the case with loss:

plot_loss()
**Figure  3**: Mean and Median imputation again performed almost the same. This supports our hypothsis. Again Indicators have little to no effect.

Figure 3: Mean and Median imputation again performed almost the same. This supports our hypothsis. Again Indicators have little to no effect.

It looks like the imputation method (between mean and median, with and without indicators) doesnt make a huge difference, and as you impute more randomly missing data, your results get worse, but only by a bit. Probably if you are imputing lots of columns with few missing observations in each, you are totally fine. You might have to start thinking a bit if you are missing 50+ percent of your data in a few columns.

2.4 Missing at random, with a control

Next, we take two columns and remove out 10%, 20%, and 30% of the data while having a control filter on a separate column. For this we decided to remove data from the AGE and ZN columns. Our control was that the NOX column had a value of greater than 0.40.


X_train, X_test, y_train, y_test = train_test_split(bos, bos_target, test_size=0.20, random_state = 42)

#Define function for SKlearn Imputer
def impute_nation(imputedata):
    impute = SimpleImputer(missing_values=np.nan, strategy="mean", copy=False)
    impute.fit(imputedata)
    impute.transform(imputedata)
    return

def linear_madness(X, y, bos_imp_nan, perc): 
    linreg = LinearRegression().fit(X,y)
    y_pred = linreg.predict(X)
    return_MSE = mean_squared_error(y,y_pred)
    r2 = r2_score(y, y_pred)
    
    print("\nAfter imputing %i%% of the data" % perc)
    print("With %i and %i values imputed from the AGE and ZN columns" % (bos_imp_nan[0], bos_imp_nan[1]))
    print("After filtering the NOX column with values > 0.40")
    print("The MSE is = %.2f" % return_MSE)
    print("Goodness of fit (R_squared) is = %.2f" % r2)
    print('==============================================')


#For this section we choose AGE and ZN for our two columns selected.  
#We'll control by saying anything in the NOX column over .40
perc_list = [10,20,30]
cols_na = ['AGE','ZN']
control = X_train.NOX > 0.40


for x in perc_list:
    np.random.seed(42)  
    bos_imp = X_train.copy()

    bos_imp.loc[bos_imp.loc[control].sample(frac=(x/100)).index, cols_na] = np.nan
    bos_imp_nan = [sum(np.isnan(bos_imp.AGE)),sum(np.isnan(bos_imp.ZN))]
    
    impute_nation(bos_imp)
    linear_madness(bos_imp, y_train, bos_imp_nan, x) 
#> 
#> After imputing 10% of the data
#> With 39 and 39 values imputed from the AGE and ZN columns
#> After filtering the NOX column with values > 0.40
#> The MSE is = 21.66
#> Goodness of fit (R_squared) is = 0.75
#> ==============================================
#> 
#> After imputing 20% of the data
#> With 79 and 79 values imputed from the AGE and ZN columns
#> After filtering the NOX column with values > 0.40
#> The MSE is = 21.81
#> Goodness of fit (R_squared) is = 0.75
#> ==============================================
#> 
#> After imputing 30% of the data
#> With 118 and 118 values imputed from the AGE and ZN columns
#> After filtering the NOX column with values > 0.40
#> The MSE is = 21.84
#> Goodness of fit (R_squared) is = 0.75
#> ==============================================

There was no real change with the R² or MSE from the baseline, which had an \(R^2\) of 0.74 and a loss of 21.89. This means just as in missing at random, our models are pretty robust to missing at random with a control.

2.5 Data Missing Not at Random

The final scenario we need to consider is what happens when our data is not missing completely at random, but there is some sort of pattern to it. In this experiment, we will say that 25% of a column is missing, but not at random. To me, the easiest way to do this is to use quartiles. We will in this example say that the lower quartile is missing, and look at different ways to impute it. Here mean and median imputation should probably not do as well. First lets define a function which creates a missing not at random pattern in our data:

def _mnar(arr: np.ndarray, col: int) -> np.ndarray:
    quant = np.quantile(arr[:, col], 0.25)
    idx = np.where(arr[:,col] <=  quant)
    out = arr.copy()
    out[idx, col] = np.nan
    na_percs = np.isnan(out).astype(int).sum(0) / out.shape[0]
    print(f"na_percentages: column {col} missing {na_percs.sum()} percent")
    return out

Next lets get our baseline, median, and mean imputations recorded. We will have each column be missing not at random, then take the mean and std deviation of those results:

mnar_dict = {}
mnar_dict['baseline'] = results_dict['baseline']
for imp in impute_types:
    r2 = []
    mse = []
    for i in range(X.shape[-1]):
        data = _mnar(X, i)
        data_imputed = SimpleImputer(strategy=imp).fit_transform(data)
        scores = get_scores(data_imputed, y)
        r2 += [scores['goodness_of_fit']]
        mse += [scores['mse']]
    mnar_dict[imp] = {'goodness_of_fit': _stats_dict(r2), 'loss': _stats_dict(mse)}
#> na_percentages: column 0 missing 0.2509881422924901 percent
#> na_percentages: column 1 missing 0.7351778656126482 percent
#> na_percentages: column 2 missing 0.2648221343873518 percent
#> na_percentages: column 3 missing 0.9308300395256917 percent
#> na_percentages: column 4 missing 0.2549407114624506 percent
#> na_percentages: column 5 missing 0.2509881422924901 percent
#> na_percentages: column 6 missing 0.2509881422924901 percent
#> na_percentages: column 7 missing 0.2509881422924901 percent
#> na_percentages: column 8 missing 0.3794466403162055 percent
#> na_percentages: column 9 missing 0.25296442687747034 percent
#> na_percentages: column 10 missing 0.2845849802371542 percent
#> na_percentages: column 11 missing 0.2509881422924901 percent
#> na_percentages: column 12 missing 0.2509881422924901 percent
#> na_percentages: column 0 missing 0.2509881422924901 percent
#> na_percentages: column 1 missing 0.7351778656126482 percent
#> na_percentages: column 2 missing 0.2648221343873518 percent
#> na_percentages: column 3 missing 0.9308300395256917 percent
#> na_percentages: column 4 missing 0.2549407114624506 percent
#> na_percentages: column 5 missing 0.2509881422924901 percent
#> na_percentages: column 6 missing 0.2509881422924901 percent
#> na_percentages: column 7 missing 0.2509881422924901 percent
#> na_percentages: column 8 missing 0.3794466403162055 percent
#> na_percentages: column 9 missing 0.25296442687747034 percent
#> na_percentages: column 10 missing 0.2845849802371542 percent
#> na_percentages: column 11 missing 0.2509881422924901 percent
#> na_percentages: column 12 missing 0.2509881422924901 percent

We will also want to try setting a constant imputation method, with a big outlier as the missing number

r2 = []
mse = []
for i in range(X.shape[-1]):
    data = _mnar(X,i)
    data_imputed = SimpleImputer(strategy='constant', fill_value=-100).fit_transform(data)
    scores = get_scores(data_imputed, y)
    r2 += [scores['goodness_of_fit']]
    mse += [scores['mse']]
#> na_percentages: column 0 missing 0.2509881422924901 percent
#> na_percentages: column 1 missing 0.7351778656126482 percent
#> na_percentages: column 2 missing 0.2648221343873518 percent
#> na_percentages: column 3 missing 0.9308300395256917 percent
#> na_percentages: column 4 missing 0.2549407114624506 percent
#> na_percentages: column 5 missing 0.2509881422924901 percent
#> na_percentages: column 6 missing 0.2509881422924901 percent
#> na_percentages: column 7 missing 0.2509881422924901 percent
#> na_percentages: column 8 missing 0.3794466403162055 percent
#> na_percentages: column 9 missing 0.25296442687747034 percent
#> na_percentages: column 10 missing 0.2845849802371542 percent
#> na_percentages: column 11 missing 0.2509881422924901 percent
#> na_percentages: column 12 missing 0.2509881422924901 percent
mnar_dict['constant'] = {'goodness_of_fit': _stats_dict(r2), 'loss': _stats_dict(mse)}

Since mean and median are also pretty unlikely for this pattern, we should also test what happens if we add a new feature which indicates if an observation was imputed or not:

for imp in impute_types:
    r2 = []
    mse = []
    for i in range(X.shape[-1]):
        data = _mnar(X, i)
        data_imputed = SimpleImputer(strategy=imp, add_indicator=True).fit_transform(data)
        scores = get_scores(data_imputed, y)
        r2 += [scores['goodness_of_fit']]
        mse += [scores['mse']]
    mnar_dict[f"{imp} With Indicator"] = {'goodness_of_fit': _stats_dict(r2), 'loss': _stats_dict(mse)}
#> na_percentages: column 0 missing 0.2509881422924901 percent
#> na_percentages: column 1 missing 0.7351778656126482 percent
#> na_percentages: column 2 missing 0.2648221343873518 percent
#> na_percentages: column 3 missing 0.9308300395256917 percent
#> na_percentages: column 4 missing 0.2549407114624506 percent
#> na_percentages: column 5 missing 0.2509881422924901 percent
#> na_percentages: column 6 missing 0.2509881422924901 percent
#> na_percentages: column 7 missing 0.2509881422924901 percent
#> na_percentages: column 8 missing 0.3794466403162055 percent
#> na_percentages: column 9 missing 0.25296442687747034 percent
#> na_percentages: column 10 missing 0.2845849802371542 percent
#> na_percentages: column 11 missing 0.2509881422924901 percent
#> na_percentages: column 12 missing 0.2509881422924901 percent
#> na_percentages: column 0 missing 0.2509881422924901 percent
#> na_percentages: column 1 missing 0.7351778656126482 percent
#> na_percentages: column 2 missing 0.2648221343873518 percent
#> na_percentages: column 3 missing 0.9308300395256917 percent
#> na_percentages: column 4 missing 0.2549407114624506 percent
#> na_percentages: column 5 missing 0.2509881422924901 percent
#> na_percentages: column 6 missing 0.2509881422924901 percent
#> na_percentages: column 7 missing 0.2509881422924901 percent
#> na_percentages: column 8 missing 0.3794466403162055 percent
#> na_percentages: column 9 missing 0.25296442687747034 percent
#> na_percentages: column 10 missing 0.2845849802371542 percent
#> na_percentages: column 11 missing 0.2509881422924901 percent
#> na_percentages: column 12 missing 0.2509881422924901 percent

Finally, lets plot our results:

ax = plt.subplot()
keys = list(mnar_dict.keys())
avgs = [mnar_dict[imp]['loss']['avg'] for imp in mnar_dict.keys()]
stds = [mnar_dict[imp]['loss']['std'] for imp in mnar_dict.keys()]
for k in range(len(keys)):
    _ = ax.barh(keys[k], avgs[k], label = keys[k].title(), xerr = stds[k])
_ = ax.set_title('Loss of missing not at random imputation'.title())
_ = ax.set_xlabel('Loss')
_ = ax.set_ylabel('Imputation Method')
_ = ax.axvline(avgs[0], color='r', ymin=0, ymax=50, label='Baseline')
_ = plt.yticks(rotation=45)
_ = ax.legend()
plt.show()
**Figure 4**: The model appears fairly robust to imputation scheme in this case, however unlike with missing at random, adding an 'imputed' feature seems to improve the results a bit. The constant imputation strategy was not fruitful in this case.

Figure 4: The model appears fairly robust to imputation scheme in this case, however unlike with missing at random, adding an ‘imputed’ feature seems to improve the results a bit. The constant imputation strategy was not fruitful in this case.

It seems the “constant” imputation strategy did not work very well. However in this case, the indicators clearly helped a lot here. Lets see how we did with goodness of fit.

ax = plt.subplot()
keys = list(mnar_dict.keys())
avgs = [mnar_dict[imp]['goodness_of_fit']['avg'] for imp in mnar_dict.keys()]
stds = [mnar_dict[imp]['goodness_of_fit']['std'] for imp in mnar_dict.keys()]
for k in range(len(keys)):
    _ = ax.barh(keys[k], avgs[k], label = keys[k].title(), xerr = stds[k])
_ = ax.set_title('goodness of fit of missing not at random imputation'.title())
_ = ax.set_xlabel('goodness of fit'.title())
_ = ax.set_ylabel('Imputation Method')
_ = ax.axvline(avgs[0], color='r', ymin=0, ymax=50, label='Baseline')
_ = plt.yticks(rotation=45)
_ = ax.legend()
plt.show()
**Figure 5**: We see nearly the same results as goodness of fit. Including the indicator variable seems to be the way to go with missing not at random.

Figure 5: We see nearly the same results as goodness of fit. Including the indicator variable seems to be the way to go with missing not at random.

This confirms our hypotheses. First, we see that constant imputation with a ridiculous value does not work, at least for linear regression (it might make sense for a tree based model), but adding the indicator feature for missing not at random definitely helps.

3 Conclusions

There are several important takeaways from this study. First, in general, with missing at random variables, our models are fairly robust to imputation and missing data. As the percentage of missing data increases, our models tend to get a bit worse, but not significantly. If there is a large proportion of missing data, it may be time to consider more sophisticated methods, but in general with missing at random we are fine. If our data is Missing Not at Random (MNAR), then in general we can still use our simple imputation methods, but adding a new feature, which indicates if a variable was imputed or not, can be moderately helpful.