—-Data Set Notes—-

data = imputed, no outliers data2 = same as data but moved Obese_Class to the beginning data3 = removed highly correlated variable data4 = removed insignificant variables per logistic regression


The national dataset used in this study comes from the 2017 County Health Rankings website. Since each state and county have varying populations, we chose to use the data as a percent of the population. The death rate is the age-adjusted Years of Potential Life Lost rate per 100,000. All variables are continuous so we do not need to be concerned about unbalanced categorical variables.

OBJECTIVE1

We want to model the probability of being obsese (BMI > 30) for a particular variable. The team has decided to use the following variables as the predictors.The variable data represent the percentage of the population for a specifc US county.

Health Variables - Smokers - Physically Incactive - Excesive Drinking - Frequent Mental Distress - Frequent Physical Distress - Diabetic - Insufficient Sleep

Demographic Type Variables - Uninsured - Some College - Unemployed - Severe Housing Problems

EXPLORATORY DATA ANALYSIS

# import health data
alldata = read.csv("./Data/CHR_Data.csv")

# get initial data
data <- alldata[,-c(1:3,4,14,18:24)]

# show data summary to identify variables with missing data
summary(data)
##     Smokers          Obese       Physically_Inactive Excessive_Drinking
##  Min.   : 7.00   Min.   :12.00   Min.   : 8.00       Min.   : 8.00     
##  1st Qu.:15.00   1st Qu.:28.00   1st Qu.:23.00       1st Qu.:14.00     
##  Median :17.00   Median :31.00   Median :26.00       Median :17.00     
##  Mean   :17.89   Mean   :31.06   Mean   :26.01       Mean   :16.59     
##  3rd Qu.:20.00   3rd Qu.:34.00   3rd Qu.:29.00       3rd Qu.:19.00     
##  Max.   :41.00   Max.   :48.00   Max.   :41.00       Max.   :27.00     
##                                                                        
##  Frequent_Mental_Distress Frequent_Physical_Distress    Diabetic   
##  Min.   : 7.00            Min.   : 7.00              Min.   : 3.0  
##  1st Qu.:10.00            1st Qu.:10.00              1st Qu.:10.0  
##  Median :12.00            Median :11.00              Median :11.0  
##  Mean   :11.67            Mean   :11.82              Mean   :11.3  
##  3rd Qu.:13.00            3rd Qu.:13.00              3rd Qu.:13.0  
##  Max.   :19.00            Max.   :23.00              Max.   :24.0  
##                                                                    
##  Insufficient_Sleep   Uninsured     Some_College      Unemployed    
##  Min.   :23.00      Min.   : 3.0   Min.   :  5.30   Min.   : 1.800  
##  1st Qu.:30.00      1st Qu.:10.0   1st Qu.: 48.70   1st Qu.: 4.200  
##  Median :33.00      Median :14.0   Median : 56.70   Median : 5.300  
##  Mean   :33.09      Mean   :14.4   Mean   : 56.76   Mean   : 5.521  
##  3rd Qu.:36.00      3rd Qu.:18.0   3rd Qu.: 65.10   3rd Qu.: 6.500  
##  Max.   :47.00      Max.   :39.0   Max.   :100.00   Max.   :24.000  
##                     NA's   :1                       NA's   :1       
##  Severe_Housing_Problems
##  Min.   : 1.00          
##  1st Qu.:11.00          
##  Median :14.00          
##  Mean   :14.47          
##  3rd Qu.:17.00          
##  Max.   :62.00          
## 

The summary statistics show missing values for Premature_Death_Rate, Unemployed, Uninsured, and Graduation_Rate. We impute the missing data with the median values since some distributions show a little skew when comparing the mean and medians. We will later show the histograms confirm some skew.

# impute variables with their medians, prefer medians in the event there is any skew
data$Unemployed <- with(data,impute(Unemployed,median))
data$Uninsured <- with(data,impute(Uninsured,median))

EDA-SKEW

The histograms confirm some skew with Frequent_Mental_Distress, Frequent_Physical_Distress,Uninsured, Unemployed, and Severe_Housing_Problems. There is no concern or need to do anything about the skew since we have a relatively large data set and we are performing a logistic regression.

# get chosen predictors for EDA
predictors <- data[,-2]

# histogram of chosen predictors
par(mfrow=c(3,4))
for(i in 1:11) {
    hist(predictors[,i], main=names(predictors)[i])
}

OUTLIERS (Used SAS, will put screen shots and comments about Cook’s D in final paper)

We remove the data for Yuma county in Arizona due to the outlier it creates for unemployment. The county is along the Mexico border and is predominately a farming community with migrant (seasonal) workers. This situation is uncommon and not typical of U.S. counties. We also removed the data for Imperial county in California for the same reasons. It is adjacent to Yuma county.

We remove the data for Bethel, Northwest Arctic and Yukon-Koyukuk counties in Alaska for Severe Housing Problems. There are four factors that contribute to this category. They are housing units that lack complete kitchens, lack complete plumbing facilities, overcrowded, or severely cost burdened. These counties reside in Alaska where the cost to build is beyond what the residents can afford and therefore overcrowding is above normal compared to the rest of the United States. [Nathan Wiltse, Dustin Madden, 2018 Alaska Housing Assessment, Jan 17, 2018, https://www.ahfc.us/download_file/view/5124/853]

# print data before outlier removal
plot(data$Unemployed)

plot(data$Severe_Housing_Problems)

data <- data[!rowSums(data[c(-1:-10,-12)] > 20),] # removed unemployment outliers for migrant farming counties
data <- data[!rowSums(data[c(-1:-11)] > 50),]     # removed housing prob outliers for poor Alaska counties

# print data after outlier removal
plot(data$Unemployed)

plot(data$Severe_Housing_Problems)

# create obese binary classification where BMI >= 30 is considered Obese, 0 = not obese, 1 = obese
data$Obese_Class <- "Obese"
data[data$Obese < 30, "Obese_Class"] <- "Not Obese"
data$Obese_Class <- as.factor(as.character(data$Obese_Class))

# -- create new data set with obese_class first --
data2 <- data[,c(13,1,3:12)]

# summary stats by group Obese_Class, to add to SAS Box Plots
t(aggregate(Smokers~Obese_Class,data=data2,summary))
##                 [,1]        [,2]      
## Obese_Class     "Not Obese" "Obese"   
## Smokers.Min.    " 7.00000"  " 9.00000"
## Smokers.1st Qu. "14.00000"  "17.00000"
## Smokers.Median  "15.00000"  "19.00000"
## Smokers.Mean    "15.57736"  "19.05311"
## Smokers.3rd Qu. "17.00000"  "21.00000"
## Smokers.Max.    "27.00000"  "41.00000"
t(aggregate(Physically_Inactive~Obese_Class,data=data2,summary))
##                             [,1]        [,2]      
## Obese_Class                 "Not Obese" "Obese"   
## Physically_Inactive.Min.    " 8.00000"  "16.00000"
## Physically_Inactive.1st Qu. "19.00000"  "25.00000"
## Physically_Inactive.Median  "22.00000"  "28.00000"
## Physically_Inactive.Mean    "22.00849"  "28.06422"
## Physically_Inactive.3rd Qu. "25.00000"  "31.00000"
## Physically_Inactive.Max.    "39.00000"  "41.00000"
t(aggregate(Excessive_Drinking~Obese_Class,data=data2,summary))
##                            [,1]        [,2]      
## Obese_Class                "Not Obese" "Obese"   
## Excessive_Drinking.Min.    " 8.00000"  " 9.00000"
## Excessive_Drinking.1st Qu. "16.00000"  "13.00000"
## Excessive_Drinking.Median  "18.00000"  "16.00000"
## Excessive_Drinking.Mean    "17.80377"  "15.96813"
## Excessive_Drinking.3rd Qu. "19.00000"  "18.00000"
## Excessive_Drinking.Max.    "27.00000"  "27.00000"
t(aggregate(Frequent_Mental_Distress~Obese_Class,data=data2,summary))
##                                  [,1]        [,2]      
## Obese_Class                      "Not Obese" "Obese"   
## Frequent_Mental_Distress.Min.    " 7.00000"  " 7.00000"
## Frequent_Mental_Distress.1st Qu. "10.00000"  "11.00000"
## Frequent_Mental_Distress.Median  "11.00000"  "12.00000"
## Frequent_Mental_Distress.Mean    "10.86887"  "12.07388"
## Frequent_Mental_Distress.3rd Qu. "12.00000"  "13.00000"
## Frequent_Mental_Distress.Max.    "16.00000"  "19.00000"
t(aggregate(Frequent_Physical_Distress~Obese_Class,data=data2,summary))
##                                    [,1]        [,2]      
## Obese_Class                        "Not Obese" "Obese"   
## Frequent_Physical_Distress.Min.    " 7.00000"  " 7.00000"
## Frequent_Physical_Distress.1st Qu. "10.00000"  "10.00000"
## Frequent_Physical_Distress.Median  "11.00000"  "12.00000"
## Frequent_Physical_Distress.Mean    "10.88396"  "12.29213"
## Frequent_Physical_Distress.3rd Qu. "12.00000"  "14.00000"
## Frequent_Physical_Distress.Max.    "21.00000"  "23.00000"
t(aggregate(Diabetic~Obese_Class,data=data2,summary))
##                  [,1]        [,2]       
## Obese_Class      "Not Obese" "Obese"    
## Diabetic.Min.    " 3.000000" " 6.000000"
## Diabetic.1st Qu. " 8.000000" "11.000000"
## Diabetic.Median  "10.000000" "12.000000"
## Diabetic.Mean    " 9.570755" "12.187349"
## Diabetic.3rd Qu. "11.000000" "14.000000"
## Diabetic.Max.    "18.000000" "24.000000"
t(aggregate(Insufficient_Sleep~Obese_Class,data=data2,summary))
##                            [,1]        [,2]      
## Obese_Class                "Not Obese" "Obese"   
## Insufficient_Sleep.Min.    "23.00000"  "24.00000"
## Insufficient_Sleep.1st Qu. "29.00000"  "31.00000"
## Insufficient_Sleep.Median  "31.00000"  "34.00000"
## Insufficient_Sleep.Mean    "31.34340"  "33.97682"
## Insufficient_Sleep.3rd Qu. "34.00000"  "37.00000"
## Insufficient_Sleep.Max.    "46.00000"  "47.00000"
t(aggregate(Uninsured~Obese_Class,data=data2,summary))
## 
##  1 values imputed to 14
##                   [,1]        [,2]      
## Obese_Class       "Not Obese" "Obese"   
## Uninsured.Min.    " 3.00000"  " 5.00000"
## Uninsured.1st Qu. "10.00000"  "11.00000"
## Uninsured.Median  "14.00000"  "14.00000"
## Uninsured.Mean    "14.58396"  "14.28006"
## Uninsured.3rd Qu. "19.00000"  "17.00000"
## Uninsured.Max.    "35.00000"  "39.00000"
t(aggregate(Some_College~Obese_Class,data=data2,summary))
##                      [,1]        [,2]       
## Obese_Class          "Not Obese" "Obese"    
## Some_College.Min.    "  5.30000" " 15.80000"
## Some_College.1st Qu. " 53.70000" " 46.90000"
## Some_College.Median  " 62.40000" " 54.20000"
## Some_College.Mean    " 61.38047" " 54.42825"
## Some_College.3rd Qu. " 69.70000" " 62.45000"
## Some_College.Max.    "100.00000" " 89.50000"
t(aggregate(Unemployed~Obese_Class,data=data2,summary))
## 
##  1 values imputed to 5.3
##                    [,1]        [,2]       
## Obese_Class        "Not Obese" "Obese"    
## Unemployed.Min.    " 1.800000" " 1.800000"
## Unemployed.1st Qu. " 3.800000" " 4.400000"
## Unemployed.Median  " 4.700000" " 5.600000"
## Unemployed.Mean    " 5.020943" " 5.745582"
## Unemployed.3rd Qu. " 5.900000" " 6.800000"
## Unemployed.Max.    "17.600000" "16.900000"
t(aggregate(Severe_Housing_Problems~Obese_Class,data=data2,summary))
##                                 [,1]        [,2]      
## Obese_Class                     "Not Obese" "Obese"   
## Severe_Housing_Problems.Min.    " 1.00000"  " 2.00000"
## Severe_Housing_Problems.1st Qu. "12.00000"  "11.00000"
## Severe_Housing_Problems.Median  "15.00000"  "13.00000"
## Severe_Housing_Problems.Mean    "15.60660"  "13.81507"
## Severe_Housing_Problems.3rd Qu. "18.00000"  "16.00000"
## Severe_Housing_Problems.Max.    "35.00000"  "46.00000"
# scatter matrices, with Obese as the colors
pairs(data2[,2:12],col=data$Obese_Class)

For the most part, the colored scatter plot matrix tells us our variables should do a decent job with logistic regression based on the color separation seen in the matrix. Strong correlation is seen between the following: There’s fairly good correlation between the following:

We’ll review a correlation heatmap to get better insights next.

# predictor heatmap correlations to examine whether variables are redundant
cor1 <- cor(data2[,2:12])
heatmap.2(cor1,col=redgreen(75), cexRow=.7,cexCol=0.7,
          density.info="none", trace="none", dendrogram=c("both"), 
          symm=F,symkey=T,symbreaks=T, scale="none",key=T)

The dendogramed heatmap confirms the strong correlation previously seen with Frequenet_Mental_Distress and Physical_Mental_Distress.

Additional correlation is seen between the following:

The correlations identified by the dendogram surprisingly all make practical sense. One would expect to lose sleep if they were unemployed. Drinking being correlated to college makes sense. Diabetic is not uncommon amongst physically incative people. If someone is living in an area with severe housing problems, we might expect they would not be able to afford insurance.

Let’s review correlation with ratings seen in a variable correlation heatmap.

#Correlation Plot 
cor2 <- cor(data2[,2:12])
df_corr <-corrplot(cor2, type="upper", addCoef.col = "white", number.digits = 2, number.cex = 0.5, method="square", order="hclust", title="Variable Corr Heatmap",tl.srt=45, tl.cex = 0.8)

Based on the variable correlation heatmap, the order of correalated variables are:
  1. Frequenet_Mental_Distress, Physical_Mental_Distress
  2. Smokers, Frequenet_Physical_Distress
  3. Smokers, Frequenet_Mental_Distress
  4. Diabetic, Physically_Inactive
  5. Frequent_Mental_Distress, Insufficient Sleep
  6. Unemployed, Frequent_Mental_Distress
  7. Unemployed, Frequent_Physical_Distress
  8. Excessive_Drinking, Frequent_Physical_Distress
  9. Excessive_Drinking, Frequent_Mental_Distress
  10. Diabetic, Frequent_Mental_Distress

Let’s look at the VIFs to confirm the collinear variables.

# Logistics Regression
glm.vifchk <- glm(Obese_Class ~ ., data = data2, family = binomial(link="logit"))
vif(glm.vifchk)
##                    Smokers        Physically_Inactive 
##                   2.526850                   1.795551 
##         Excessive_Drinking   Frequent_Mental_Distress 
##                   2.009784                   6.933023 
## Frequent_Physical_Distress                   Diabetic 
##                   7.698296                   2.041061 
##         Insufficient_Sleep                  Uninsured 
##                   1.873762                   2.258083 
##               Some_College                 Unemployed 
##                   1.873567                   1.959237 
##    Severe_Housing_Problems 
##                   1.874558

The VIFS and multiple visual tools agree there is a strong relationship between Frequent_Mental_Distress and Frequent_Physical_Distress. We choose to remove Frequent_Physical_Distress.

# Remove Frequent_Physical_Distress from data2
data3 <- data2[,-6] 

Lets use PCA to visualize any other insights. It is fortunate to already have our data somewhat normalized on a percentage scale. It reduces the scale sensitivity seen with PCA.

dat.x <- data3[,2:11]
dat.y <- data3[,1]
pc.result<-prcomp(dat.x,scale.=TRUE)
pc.scores<-pc.result$x
pc.scores<-data.frame(pc.scores)
pc.scores$Obese_Class<-dat.y

#Loadings for interpretation
pc.result$rotation
##                                  PC1         PC2         PC3           PC4
## Smokers                   0.35493287  0.04656377 -0.22202874  0.4894240917
## Physically_Inactive       0.32904576  0.42158841  0.01794771  0.0599577536
## Excessive_Drinking       -0.33616920 -0.13335600 -0.16488286  0.6829413679
## Frequent_Mental_Distress  0.39796417 -0.15797014 -0.10316246  0.0646241480
## Diabetic                  0.35701121  0.27112897 -0.13585126 -0.2471879938
## Insufficient_Sleep        0.34059818 -0.16062437 -0.24523333 -0.0836628946
## Uninsured                 0.18988728 -0.18352967  0.80873077  0.0006640387
## Some_College             -0.33055551 -0.02636285 -0.34863064 -0.4518377613
## Unemployed                0.31234663 -0.27422551 -0.23676682 -0.0394306314
## Severe_Housing_Problems   0.08954141 -0.75396801 -0.04806264 -0.1116690259
##                                  PC5         PC6        PC7         PC8
## Smokers                  -0.35145872  0.41526874 -0.1166555  0.11740639
## Physically_Inactive      -0.19238224 -0.06573227  0.4125284 -0.23506705
## Excessive_Drinking        0.02982705 -0.23925809  0.4816263  0.16503090
## Frequent_Mental_Distress -0.10757358  0.32029014 -0.1491607  0.09697534
## Diabetic                  0.05786538 -0.12543176  0.4887014 -0.12475100
## Insufficient_Sleep       -0.24599920 -0.71958709 -0.2114669  0.39802098
## Uninsured                -0.11986096  0.05209674  0.2697138  0.42470621
## Some_College             -0.36161720  0.31563174  0.3303363  0.45961950
## Unemployed                0.73975179  0.16549887  0.1945407  0.25261383
## Severe_Housing_Problems  -0.26603334 -0.01161386  0.2423594 -0.51468156
##                                  PC9        PC10
## Smokers                   0.05332547  0.50676822
## Physically_Inactive      -0.64046056 -0.18198195
## Excessive_Drinking        0.12307647 -0.20767812
## Frequent_Mental_Distress  0.19479536 -0.78619016
## Diabetic                  0.65706869  0.11942351
## Insufficient_Sleep       -0.09058340  0.03726054
## Uninsured                 0.02736213  0.07254999
## Some_College             -0.11607294 -0.01009484
## Unemployed               -0.27486862  0.13338470
## Severe_Housing_Problems  -0.06445638  0.10018070
# Scree plot
pc.eigen<-(pc.result$sdev)^2
pc.prop<-pc.eigen/sum(pc.eigen)
pc.cumprop<-cumsum(pc.prop)
plot(1:10,pc.prop,type="l",main="Scree Plot",ylim=c(0,1),xlab="PC #",ylab="Proportion of Variation")
axis(1, seq(1,10,1))
lines(1:10,pc.cumprop,lty=3)

# Cumulative proportion plot
cumulative.prop<-cumsum(pc.eigen/sum(pc.eigen))
plot(1:10,cumulative.prop,type="l",main="Cumulative proportion",ylim=c(0.5,1))
points(x=6, y=0.9, type="p", pch=10, col="green")

The cumulative plot shows 6 PCs are needed to retain ~90% of the total variation in the data.

#Use ggplot2 to plot pc's
ggplot(data = pc.scores, aes(x = PC1, y = PC2)) +
    geom_point(aes(col=Obese_Class), size=1)+
  geom_hline(yintercept = 0, colour = "gray65") +
    geom_vline(xintercept = 0, colour = "gray65") +
    ggtitle("PCA plot of Health Data")

ggplot(data = pc.scores, aes(x = PC1, y = PC3)) +
  geom_point(aes(col=Obese_Class), size=1)+
  geom_hline(yintercept = 0, colour = "gray65") +
  geom_vline(xintercept = 0, colour = "gray65") +
  ggtitle("PCA plot of Health Data")

ggplot(data = pc.scores, aes(x = PC1, y = PC4)) +
    geom_point(aes(col=Obese_Class), size=1)+
  geom_hline(yintercept = 0, colour = "gray65") +
    geom_vline(xintercept = 0, colour = "gray65") +
    ggtitle("PCA plot of Health Data")

ggplot(data = pc.scores, aes(x = PC1, y = PC5)) +
    geom_point(aes(col=Obese_Class), size=1)+
  geom_hline(yintercept = 0, colour = "gray65") +
    geom_vline(xintercept = 0, colour = "gray65") +
    ggtitle("PCA plot of Health Data")

ggplot(data = pc.scores, aes(x = PC1, y = PC6)) +
    geom_point(aes(col=Obese_Class), size=1)+
  geom_hline(yintercept = 0, colour = "gray65") +
    geom_vline(xintercept = 0, colour = "gray65") +
    ggtitle("PCA plot of Health Data")

ggplot(data = pc.scores, aes(x = PC2, y = PC3)) +
    geom_point(aes(col=Obese_Class), size=1)+
  geom_hline(yintercept = 0, colour = "gray65") +
    geom_vline(xintercept = 0, colour = "gray65") +
    ggtitle("PCA plot of Health Data")

ggplot(data = pc.scores, aes(x = PC2, y = PC4)) +
  geom_point(aes(col=Obese_Class), size=1)+
  geom_hline(yintercept = 0, colour = "gray65") +
  geom_vline(xintercept = 0, colour = "gray65") +
  ggtitle("PCA plot of Health Data")

ggplot(data = pc.scores, aes(x = PC2, y = PC5)) +
    geom_point(aes(col=Obese_Class), size=1)+
  geom_hline(yintercept = 0, colour = "gray65") +
    geom_vline(xintercept = 0, colour = "gray65") +
    ggtitle("PCA plot of Health Data")

ggplot(data = pc.scores, aes(x = PC2, y = PC6)) +
    geom_point(aes(col=Obese_Class), size=1)+
  geom_hline(yintercept = 0, colour = "gray65") +
    geom_vline(xintercept = 0, colour = "gray65") +
    ggtitle("PCA plot of Health Data")

The PCA plots show decent separation but not great separation. We should still be able to get decent results using PC vars but we will use the original variables for our model.

For our model, we choose a logistic regression using LASSO for feature reduction.

# build a training & test data set, match train/test idx to data idx for debug tracking
samplesize=nrow(data3)
train_percent = .8
train_indices = sample(seq(1,samplesize,length = samplesize),train_percent*samplesize) # get random indices
train3 = data3[train_indices,] # random training data
test3 = data3[-train_indices,] # random test data
train3.x <-train3[,2:ncol(train3)]
train3.y <-train3[,1]
# Perform on all data
glm.fit1 <- glm(Obese_Class ~ ., data = data3, family = binomial)
summary(glm.fit1)
## 
## Call:
## glm(formula = Obese_Class ~ ., family = binomial, data = data3)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3086  -0.5233   0.2358   0.6237   2.8536  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -9.131831   1.146259  -7.967 1.63e-15 ***
## Smokers                   0.386425   0.033515  11.530  < 2e-16 ***
## Physically_Inactive       0.195882   0.019418  10.087  < 2e-16 ***
## Excessive_Drinking        0.014749   0.025819   0.571    0.568    
## Frequent_Mental_Distress -0.378120   0.068449  -5.524 3.31e-08 ***
## Diabetic                  0.333991   0.040267   8.294  < 2e-16 ***
## Insufficient_Sleep        0.027773   0.018648   1.489    0.136    
## Uninsured                -0.088001   0.012288  -7.161 8.00e-13 ***
## Some_College             -0.009237   0.006196  -1.491    0.136    
## Unemployed               -0.020646   0.041635  -0.496    0.620    
## Severe_Housing_Problems  -0.020759   0.015447  -1.344    0.179    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4008.1  on 3130  degrees of freedom
## Residual deviance: 2446.1  on 3120  degrees of freedom
## AIC: 2468.1
## 
## Number of Fisher Scoring iterations: 6

After different assessments and iterations, we agree with removing the following suggested insignificant predictors: - Excessive_Drinking - Insufficient_Sleep - Some_College - Unemployed - Sever_Housing_Problems

# removed insignifcant predictors
data4 <- data3[,-c(4,7,9:11)]
# build a training & test data set, match train/test idx to data idx for debug tracking
# using same split and indices used in train3/test3
samplesize=nrow(data3)
train4 = data4[train_indices,] # random training data
test4 = data4[-train_indices,] # random test data
train4.x <-train4[,2:ncol(train4)]
train4.y <-train4[,1]

# checking group proportions for training bias, should see similar proportions of 0s & 1s in each
proportion <- rbind(table(train4$Obese_Class),table(test4$Obese_Class))
dimnames(proportion)<- list("Data Set"=c("Train","Test"), "Obese_Class"=c("Not obese","Obese"))
proportion
##         Obese_Class
## Data Set Not obese Obese
##    Train       858  1646
##    Test        202   425

Rerun the model with only signifcant predictors.

# match fit idx to data idx for debug tracking
logreg.fit4 <- glm(Obese_Class ~ ., data = data4, family = binomial)
summary(logreg.fit4)
## 
## Call:
## glm(formula = Obese_Class ~ ., family = binomial, data = data4)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3368  -0.5369   0.2323   0.6245   2.7561  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -9.39129    0.46967 -19.996  < 2e-16 ***
## Smokers                   0.39861    0.03304  12.065  < 2e-16 ***
## Physically_Inactive       0.20664    0.01837  11.251  < 2e-16 ***
## Frequent_Mental_Distress -0.39161    0.05397  -7.256 3.98e-13 ***
## Diabetic                  0.34344    0.03723   9.225  < 2e-16 ***
## Uninsured                -0.08458    0.01020  -8.289  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4008.1  on 3130  degrees of freedom
## Residual deviance: 2452.5  on 3125  degrees of freedom
## AIC: 2464.5
## 
## Number of Fisher Scoring iterations: 6
confint(logreg.fit4)
## Waiting for profiling to be done...
##                                2.5 %      97.5 %
## (Intercept)              -10.3302875 -8.48849216
## Smokers                    0.3348803  0.46441952
## Physically_Inactive        0.1710067  0.24302822
## Frequent_Mental_Distress  -0.4983124 -0.28665941
## Diabetic                   0.2710860  0.41709065
## Uninsured                 -0.1047575 -0.06473489

MODEL ASSUMPTIONS CHECK1

We choose the receiving operating characteristic (ROC) as our first measure of classier performance.

#--glm ROC--
logreg.pred4<-predict(logreg.fit4, newdata = train4.x, type = "response")
pred <- prediction(logreg.pred4, train4.y)
roc.perf = performance(pred, measure = "tpr", x.measure = "fpr")
auc.train <- performance(pred, measure = "auc")
auc.train <- auc.train@y.values

#Plot glm ROC
plot(roc.perf,main="GLM")
abline(a=0, b= 1) #Ref line indicating poor performance
text(x = .25, y = .75,paste("AUC = ", round(auc.train[[1]],3), sep = ""))

Our area under the curve (AUC) ~ 0.89. Since this is above 0.8 (good rule of thumb) our model does a good job discriminating between obese and not obese.

MODEL ASSUMPTIONS CHECK2

A logistic regression model provides a better fit to the data if it demonstrates an improvement over a model with fewer predictors. We us the likelihood ratio test. We create a model with two key predictors and compare against our final model (“full model”) with five predictors.

logreg.fitless4 <- glm(Obese_Class ~ Smokers+Physically_Inactive, data=data4, family=binomial) # reduced model
anova(logreg.fit4,logreg.fitless4, test="Chisq")
## Analysis of Deviance Table
## 
## Model 1: Obese_Class ~ Smokers + Physically_Inactive + Frequent_Mental_Distress + 
##     Diabetic + Uninsured
## Model 2: Obese_Class ~ Smokers + Physically_Inactive
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1      3125     2452.5                          
## 2      3128     2713.0 -3  -260.51 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Ho: The reduced model is favored over a more full model. Ha: The reduced model is not favored over a more full model.

We reject Ho. With an alpha of 0.05, the results show the observed difference in model fit is statistically significant with a p-value < 2.2e-16. The evidence suggests the glm.fit “full model” is favored.

hoslem.test(logreg.fit4$y, fitted(logreg.fit4), g=10)
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  logreg.fit4$y, fitted(logreg.fit4)
## X-squared = 9.2499, df = 8, p-value = 0.3217

Ho: The model fits the data. Ha: The model does not fit the data. Applying the Hosmer-Lemeshow Test, we fail the reject the null hypothesis with a p-value > 0.3217.

# McFadden's Pseudo R-squared
ll.null <- logreg.fit4$null.deviance/-2  
ll.proposed <- logreg.fit4$deviance/-2
(ll.null - ll.proposed) / ll.null  # the R-squared is basically the overall effect size
## [1] 0.3881248
1-pchisq(2*(ll.proposed - ll.null),df=(length(logreg.fit4$coefficients)-1))
## [1] 0

The R-squared shown is basically the overall effect size. The p-value is very small so the R-squared isn’t due to chance.

Our final fitted model for Problem 1 will be in the paper.

========================================================================================

OBJECTIVE 2:

With a simple logistic regression model as a baseline, perform additional competing models to improve on prediction performance metrics.

First, we obtain our metrics for our base model. We performed a 10-fold cross validation.

# Set desired data set
train=train4
test=test4

# 10 fold cv
control <- trainControl(method = "cv", number = 10, savePredictions = TRUE)

# Model
logreg.fit <- train(Obese_Class ~ ., data=train, method="glm", family=binomial(),trControl=control)
logreg.fit
## Generalized Linear Model 
## 
## 2504 samples
##    5 predictor
##    2 classes: 'Not Obese', 'Obese' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 2254, 2255, 2253, 2253, 2253, 2253, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8142424  0.5752564
# predict on test
logreg.pred <- predict(logreg.fit,test)

# results
confusionMatrix(logreg.pred,test$Obese_Class)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Not Obese Obese
##   Not Obese       139    39
##   Obese            63   386
##                                           
##                Accuracy : 0.8373          
##                  95% CI : (0.8061, 0.8654)
##     No Information Rate : 0.6778          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.6155          
##                                           
##  Mcnemar's Test P-Value : 0.02277         
##                                           
##             Sensitivity : 0.6881          
##             Specificity : 0.9082          
##          Pos Pred Value : 0.7809          
##          Neg Pred Value : 0.8597          
##              Prevalence : 0.3222          
##          Detection Rate : 0.2217          
##    Detection Prevalence : 0.2839          
##       Balanced Accuracy : 0.7982          
##                                           
##        'Positive' Class : Not Obese       
## 

Let’s add an interactive term to see if we can improve the model. We will add the “partiers”, Smokers*Excessive_Drinking with mean Execessive_Drinking will also be added back to the model. So our complex model is shown below.

log(p/(1-p)) = logit(p) = B0 + B1(Smokers) + B2(Physically_Inactive) + B3(Frequent_Mental_Distress) + B4(Diabetic) + B5(Uninsured) + B6(Excessive_Drinking) + B7(Smokers*Excessive_Drinking)

# Add Excessive Drinking to a New Data Set
data5 <- data3[,-c(7,9:11)]

# build a training & test data set with complex model data
samplesize=nrow(data5)
train_percent = .8
train_indices = sample(seq(1,samplesize,length = samplesize),train_percent*samplesize) # get random indices
train = data5[train_indices,] # random training data
test = data5[-train_indices,] # random test data

# 10 fold cv
control <- trainControl(method = "cv", number = 10, savePredictions = TRUE)

# Model
logreg.fit <- train(Obese_Class ~ ., data=train, method="glm", family=binomial(),trControl=control)
logreg.fit
## Generalized Linear Model 
## 
## 2504 samples
##    6 predictor
##    2 classes: 'Not Obese', 'Obese' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 2254, 2254, 2254, 2253, 2253, 2254, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8162948  0.5722829
# predict on test
logreg.pred <- predict(logreg.fit,test)

# results
confusionMatrix(logreg.pred,test$Obese_Class)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Not Obese Obese
##   Not Obese       158    45
##   Obese            66   358
##                                           
##                Accuracy : 0.823           
##                  95% CI : (0.7908, 0.8521)
##     No Information Rate : 0.6427          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.6063          
##                                           
##  Mcnemar's Test P-Value : 0.05765         
##                                           
##             Sensitivity : 0.7054          
##             Specificity : 0.8883          
##          Pos Pred Value : 0.7783          
##          Neg Pred Value : 0.8443          
##              Prevalence : 0.3573          
##          Detection Rate : 0.2520          
##    Detection Prevalence : 0.3238          
##       Balanced Accuracy : 0.7968          
##                                           
##        'Positive' Class : Not Obese       
## 
# reset dataset
data <- read.csv("./Data/CHR_Data.csv")

# remove the id numbers, need to remove the state and county because that doesn't work with trees
# remove Premature_Death_Rate since we are interesting in "living type" predictors
data <- data[,-c(1:3,4)]

# create obese binary classification where BMI >= 30 is considered Obese, 0 = not obese, 1 = obese
data$Obese_Class <- "Obese"
data[data$Obese < 30, "Obese_Class"] <- "Not Obese"
data$Obese_Class <- as.factor(as.character(data$Obese_Class))

# remove numeric obese, and ethnic variables
data <- data[,-c(2,10,15:20)]

# print data before outlier removal
plot(data$Unemployed)

plot(data$Severe_Housing_Problems)

# remove outliers
data <- data[!rowSums(data[-c(1:9,10:13)] > 20),]  # removed unemployment outliers for migrant farming counties
data <- data[!rowSums(data[-c(1:10,12:13)] > 50),] # removed housing prob outliers for poor Alaska counties

# print data after outlier removal
plot(data$Unemployed)

plot(data$Severe_Housing_Problems)

# impute variables with their medians, prefer medians in the event there is any skew
data$Unemployed <- with(data,impute(Unemployed,median))
data$Uninsured <- with(data,impute(Uninsured,median))

#variance
round(cov(data[1:12]), 2) 
##                            Smokers Physically_Inactive Excessive_Drinking
## Smokers                      12.54               11.45              -5.29
## Physically_Inactive          11.45               27.11              -9.61
## Excessive_Drinking           -5.29               -9.61               9.93
## Frequent_Mental_Distress      5.34                5.35              -3.86
## Frequent_Physical_Distress    6.77                6.66              -5.03
## Diabetic                      5.27                9.56              -5.00
## Insufficient_Sleep            8.58               10.10              -6.47
## Uninsured                     3.35                6.41              -6.15
## Some_College                -21.67              -32.61              18.52
## Unemployed                    3.34                3.22              -2.62
## Severe_Housing_Problems       1.89               -5.12              -0.48
## Rural                        22.62               59.63             -23.76
##                            Frequent_Mental_Distress Frequent_Physical_Distress
## Smokers                                        5.34                       6.77
## Physically_Inactive                            5.35                       6.66
## Excessive_Drinking                            -3.86                      -5.03
## Frequent_Mental_Distress                       3.61                       4.21
## Frequent_Physical_Distress                     4.21                       5.71
## Diabetic                                       3.06                       3.53
## Insufficient_Sleep                             5.40                       6.44
## Uninsured                                      3.42                       5.56
## Some_College                                 -13.19                     -18.00
## Unemployed                                     2.49                       3.07
## Severe_Housing_Problems                        2.89                       3.70
## Rural                                          5.46                       9.05
##                            Diabetic Insufficient_Sleep Uninsured Some_College
## Smokers                        5.27               8.58      3.35       -21.67
## Physically_Inactive            9.56              10.10      6.41       -32.61
## Excessive_Drinking            -5.00              -6.47     -6.15        18.52
## Frequent_Mental_Distress       3.06               5.40      3.42       -13.19
## Frequent_Physical_Distress     3.53               6.44      5.56       -18.00
## Diabetic                       6.22               5.93      2.43       -14.90
## Insufficient_Sleep             5.93              16.94      4.11       -23.12
## Uninsured                      2.43               4.11     26.60       -30.12
## Some_College                 -14.90             -23.12    -30.12       135.56
## Unemployed                     2.38               4.36      1.83       -10.62
## Severe_Housing_Problems       -0.79               5.92      5.32        -3.67
## Rural                         25.02             -14.94     26.60      -117.44
##                            Unemployed Severe_Housing_Problems   Rural
## Smokers                          3.34                    1.89   22.62
## Physically_Inactive              3.22                   -5.12   59.63
## Excessive_Drinking              -2.62                   -0.48  -23.76
## Frequent_Mental_Distress         2.49                    2.89    5.46
## Frequent_Physical_Distress       3.07                    3.70    9.05
## Diabetic                         2.38                   -0.79   25.02
## Insufficient_Sleep               4.36                    5.92  -14.94
## Uninsured                        1.83                    5.32   26.60
## Some_College                   -10.62                   -3.67 -117.44
## Unemployed                       3.84                    3.11    3.83
## Severe_Housing_Problems          3.11                   20.93  -52.75
## Rural                            3.83                  -52.75  991.26
sum(diag(cov(data[1:12]))) # total variance
## [1] 1260.256
#--EDA end--

# split data into train and test
samplesize=nrow(data)
train_percent = .8
train_indices = sample(seq(1,samplesize,length = samplesize),train_percent*samplesize) # get random indices
train = data[train_indices,] # random training data 
test = data[-train_indices,] # random test data 

#--LDA--
lda <- lda(Obese_Class~.,data = train)

#lda confusion matrix
lda_prd<-predict(lda, newdata = test)$class
lda_cm<-table(lda_prd,test$Obese_Class)
lda_cm
##            
## lda_prd     Not Obese Obese
##   Not Obese       129    26
##   Obese            72   400
confusionMatrix(lda_prd, as.factor(test$Obese_Class))
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Not Obese Obese
##   Not Obese       129    26
##   Obese            72   400
##                                           
##                Accuracy : 0.8437          
##                  95% CI : (0.8129, 0.8712)
##     No Information Rate : 0.6794          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6181          
##                                           
##  Mcnemar's Test P-Value : 5.476e-06       
##                                           
##             Sensitivity : 0.6418          
##             Specificity : 0.9390          
##          Pos Pred Value : 0.8323          
##          Neg Pred Value : 0.8475          
##              Prevalence : 0.3206          
##          Detection Rate : 0.2057          
##    Detection Prevalence : 0.2472          
##       Balanced Accuracy : 0.7904          
##                                           
##        'Positive' Class : Not Obese       
## 
#--QDA--
qda <- qda(Obese_Class~.,data = train)

#qda confusion matrix
qda_prd<-predict(qda, newdata = test)$class
qda_cm<-table(qda_prd,test$Obese_Class)
qda_cm
##            
## qda_prd     Not Obese Obese
##   Not Obese       128    40
##   Obese            73   386
confusionMatrix(qda_prd, as.factor(test$Obese_Class))
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Not Obese Obese
##   Not Obese       128    40
##   Obese            73   386
##                                           
##                Accuracy : 0.8198          
##                  95% CI : (0.7874, 0.8491)
##     No Information Rate : 0.6794          
##     P-Value [Acc > NIR] : 1.699e-15       
##                                           
##                   Kappa : 0.5675          
##                                           
##  Mcnemar's Test P-Value : 0.00261         
##                                           
##             Sensitivity : 0.6368          
##             Specificity : 0.9061          
##          Pos Pred Value : 0.7619          
##          Neg Pred Value : 0.8410          
##              Prevalence : 0.3206          
##          Detection Rate : 0.2041          
##    Detection Prevalence : 0.2679          
##       Balanced Accuracy : 0.7715          
##                                           
##        'Positive' Class : Not Obese       
## 
#--lda ROC--
ldaprd<-predict(lda, newdata = train)$posterior
#correcting for the way lda creates predicted probabilities
ldaprd<-ldaprd[,2]
pred <- prediction(ldaprd, train$Obese_Class)
roc.perf = performance(pred, measure = "tpr", x.measure = "fpr")
auc.train <- performance(pred, measure = "auc")
auc.train <- auc.train@y.values
#Plot lda ROC
plot(roc.perf,main="LDA")
abline(a=0, b= 1) #Ref line indicating poor performance
text(x = .25, y = .75,paste("AUC = ", round(auc.train[[1]],3), sep = ""))

#--qda ROC--
qdaprd<-predict(qda, newdata = train)$posterior
#correcting for the way lda creates predicted probabilities
qdaprd<-qdaprd[,2]
pred <- prediction(qdaprd, train$Obese_Class)
roc.perf = performance(pred, measure = "tpr", x.measure = "fpr")
auc.train <- performance(pred, measure = "auc")
auc.train <- auc.train@y.values
#Plot lda ROC
plot(roc.perf,main="QDA")
abline(a=0, b= 1) #Ref line indicating poor performance
text(x = .25, y = .75,paste("AUC = ", round(auc.train[[1]],3), sep = ""))

#have to reimport and reimpute some data.  Annoying, apologies.
datarf <- alldata[-c(1:4)]
datarf$Unemployed <- with(datarf,impute(Unemployed,median))
datarf$Uninsured <- with(datarf,impute(Uninsured,median))
datarf$Graduation_Rate <- with(datarf,impute(Graduation_Rate,median))

# create obese binary classification where BMI >= 30 is considered Obese, 0 = not obese, 1 = obese
datarf$Obese_Class <- "Obese"
datarf[datarf$Obese < 30, "Obese_Class"] <- "Not_Obese"
datarf$Obese_Class <- as.factor(as.character(datarf$Obese_Class))
datarf <- datarf[,-2] # remove numeric Obese
# na_count <-sapply(datarf, function(y) sum(length(which(is.na(y)))))
# na_count <- data.frame(na_count)

datarf <- datarf[,c(20,1:19)]
Obese.full.forrest <- randomForest(Obese_Class~., data=datarf, importance = TRUE)

#importance(Obese.full.forrest) #Variable importance for placement order (Forward, Backward, Stepwise)
varImpPlot(Obese.full.forrest,type=1, main='Random Tree Variable Importance')

# reset dataset
data <- read.csv("./Data/CHR_Data.csv")

# remove the id numbers, need to remove the state and county because that doesn't work with trees
# remove Premature_Death_Rate since we are interesting in "living type" predictors
data <- data[,-c(1:4)]

# create obese binary classification where BMI >= 30 is considered Obese, 0 = not obese, 1 = obese
data$Obese_Class <- "Obese"
data[data$Obese < 30, "Obese_Class"] <- "Not_obese"
data$Obese_Class <- as.factor(as.character(data$Obese_Class))

# impute variables with their medians, prefer medians in the event there is any skew
data$Unemployed <- with(data,impute(Unemployed,median))
data$Uninsured <- with(data,impute(Uninsured,median))
data$Graduation_Rate <- with(data,impute(Graduation_Rate))

# remove numeric obese
data <- data[,-2]
#----EDA End----

rand.x <- data[,c(1:19)]
rand.y <- data[,20]

#-- Set parallel processing Begin--

#parallel compute, requires parallel and doParallel libraries
#create parallel cluster, leave one out so you don't lock your computer
cluster <- makeCluster(detectCores() - 1)
registerDoParallel(cluster)

# added Savepredictions, and classprobs to above
samplesize=nrow(data)
train_percent = .8
train_indices = sample(seq(1,samplesize,length = samplesize),train_percent*samplesize) # get random indices
train = data[train_indices,] # random training data 
test = data[-train_indices,] # random test data 

#-- Random Forest Search Parms --
#Random Search
control <- trainControl(method="repeatedcv", number=10, repeats=3, search="random", allowParallel = TRUE)
set.seed(7)
mtry <- sqrt(ncol(rand.x))

rf_random <- train(Obese_Class~., data=data, method="rf", metric="Accuracy", tuneLength=15, trControl=control)
print(rf_random)
## Random Forest 
## 
## 3136 samples
##   19 predictor
##    2 classes: 'Not_obese', 'Obese' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 2822, 2823, 2822, 2823, 2822, 2823, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.8350453  0.6194531
##    3    0.8379119  0.6268550
##    6    0.8391854  0.6313953
##    7    0.8395076  0.6329911
##    8    0.8390816  0.6319787
##   12    0.8362137  0.6255540
##   15    0.8363185  0.6260080
##   16    0.8348313  0.6231863
##   18    0.8339817  0.6214561
##   19    0.8318579  0.6164541
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 7.
plot(rf_random)

#--Random Forest with mtry = 8 --
#set.seed(7)
mtry <- 8 # fixed the mtry at 8
tunegrid <- expand.grid(.mtry=mtry)
rf_default <- train(Obese_Class~., data=train, method="rf", metric="Accuracy", tuneGrid=tunegrid, trControl=control)
print(rf_default)
## Random Forest 
## 
## 2508 samples
##   19 predictor
##    2 classes: 'Not_obese', 'Obese' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 2258, 2257, 2256, 2257, 2258, 2258, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8403851  0.6301758
## 
## Tuning parameter 'mtry' was held constant at a value of 8
rf_default$results$Accuracy #Accuracy
## [1] 0.8403851
rf_default.pred <-predict(rf_default, newdata = test)

confusionMatrix(rf_default.pred, as.factor(test$Obese_Class))
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Not_obese Obese
##   Not_obese       158    41
##   Obese            69   360
##                                           
##                Accuracy : 0.8248          
##                  95% CI : (0.7928, 0.8538)
##     No Information Rate : 0.6385          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.6101          
##                                           
##  Mcnemar's Test P-Value : 0.01004         
##                                           
##             Sensitivity : 0.6960          
##             Specificity : 0.8978          
##          Pos Pred Value : 0.7940          
##          Neg Pred Value : 0.8392          
##              Prevalence : 0.3615          
##          Detection Rate : 0.2516          
##    Detection Prevalence : 0.3169          
##       Balanced Accuracy : 0.7969          
##                                           
##        'Positive' Class : Not_obese       
## 
#-- End parallel processing--
#stop the cluster and force R back to 1 core
stopCluster(cluster)
registerDoSEQ()
# print a random forest tree
mytree <- tree(Smokers~., datarf)
plot(mytree)
text(mytree,pretty=0)

This tree shows that Frequent Physical Distress is a main split point for premature deaths.

library(rpart)
library(rattle)
## Rattle: A free graphical interface for data science with R.
## Version 5.3.0 Copyright (c) 2006-2018 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
## 
## Attaching package: 'rattle'
## The following object is masked from 'package:randomForest':
## 
##     importance
library(rpart.plot)
library(RColorBrewer)
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## 
## Attaching package: 'modeltools'
## The following object is masked from 'package:car':
## 
##     Predict
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
library(partykit)
## Loading required package: libcoin
## 
## Attaching package: 'partykit'
## The following objects are masked from 'package:party':
## 
##     cforest, ctree, ctree_control, edge_simple, mob, mob_control,
##     node_barplot, node_bivplot, node_boxplot, node_inner, node_surv,
##     node_terminal, varimp
# print a better looking tree
form <- as.formula(Obese_Class ~.)
tree.1 <- rpart(form,datarf)
fancyRpartPlot(tree.1, cex=0.65)

# another cross validation tree check
set.seed(3)
cv.obese=cv.tree(mytree,FUN=prune.tree,method="deviance")
names(cv.obese)
## [1] "size"   "dev"    "k"      "method"
cv.obese
## $size
##  [1] 12 11 10  8  7  6  5  4  3  2  1
## 
## $dev
##  [1] 11302.33 11463.30 11479.29 12564.22 12800.31 14567.66 15009.45 15803.86
##  [9] 18495.93 21377.65 39846.63
## 
## $k
##  [1]       -Inf   410.8725   425.5333   520.5120   574.1565   939.1882
##  [7]   989.3840  1314.2226  2529.5295  2997.3627 18468.7386
## 
## $method
## [1] "deviance"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"
plot(cv.obese)

par(mfrow=c(1,1))
plot(cv.obese$size,cv.obese$dev,type="b")

The version also finds mtry = 8 as optimal

We will use KNN as another model

library(class)
#---reset dataset---
data <- read.csv("./Data/CHR_Data.csv")

#remove the id numbers, need to remove the state and county because that doesn't work with trees
data <- data[,-c(1:3)]

#impute for missing values previously identified in EDA
data$Unemployed <- with(data,impute(Unemployed,median))
data$Uninsured <- with(data,impute(Uninsured,median))
data$Graduation_Rate <- with(data,impute(Graduation_Rate))

# create obese binary classification where BMI >= 30 is considered Obese, 0 = not obese, 1 = obese
data$Obese_Class <- "Obese"
data[data$Obese < 30, "Obese_Class"] <- "Not Obese"
data$Obese_Class <- as.factor(as.character(data$Obese_Class))

# remove Premature_Death_Rate since we are interesting in "living type" predictors, remove numeric obese data
data <- data[,-c(1,3)] 

# set impute cols to numeric
data$Unemployed <- as.numeric(data$Unemployed)
data$Uninsured <- as.numeric(data$Uninsured)
data$Graduation_Rate <- as.numeric(data$Graduation_Rate)

# normalize data with lapply
data_norm <- data
num.vars <- sapply(data_norm,is.numeric)
data_norm[num.vars] <- lapply(data_norm[num.vars], scale)
#summary(data_norm[1:19])
#------------------
# build a training & test data set
samplesize=nrow(data_norm)
train_percent = .7
train_indices = sample(seq(1,samplesize,length = samplesize),train_percent*samplesize) # get random indices
knn_train = data_norm[train_indices,] # random training data
knn_test = data_norm[-train_indices,] # random test data

# checking group proportions for training bias, should see similar proportions of 0s & 1s in each
proportion <- rbind(table(knn_train$Obese_Class),table(knn_test$Obese_Class))
dimnames(proportion)<- list("Data Set"=c("Train","Test"), "Obese_Class"=c("Not obese","Obese"))
proportion
##         Obese_Class
## Data Set Not obese Obese
##    Train       738  1457
##    Test        324   617
knn.5 = class::knn(knn_train[,c(1:19)],knn_test[,c(1:19)],knn_train$Obese_Class,k=5)
knn_test$ObesePred = knn.5
confusionMatrix(table(knn_test$Obese_Class,knn_test$ObesePred))
## Confusion Matrix and Statistics
## 
##            
##             Not Obese Obese
##   Not Obese       208   116
##   Obese            80   537
##                                           
##                Accuracy : 0.7917          
##                  95% CI : (0.7643, 0.8172)
##     No Information Rate : 0.6939          
##     P-Value [Acc > NIR] : 1.013e-11       
##                                           
##                   Kappa : 0.5262          
##                                           
##  Mcnemar's Test P-Value : 0.01242         
##                                           
##             Sensitivity : 0.7222          
##             Specificity : 0.8224          
##          Pos Pred Value : 0.6420          
##          Neg Pred Value : 0.8703          
##              Prevalence : 0.3061          
##          Detection Rate : 0.2210          
##    Detection Prevalence : 0.3443          
##       Balanced Accuracy : 0.7723          
##                                           
##        'Positive' Class : Not Obese       
## 
knn.10 = class::knn(knn_train[,c(1:19)],knn_test[,c(1:19)],knn_train$Obese_Class,k=10)
knn_test$ObesePred = knn.10
confusionMatrix(table(knn_test$Obese_Class,knn_test$ObesePred))
## Confusion Matrix and Statistics
## 
##            
##             Not Obese Obese
##   Not Obese       210   114
##   Obese            67   550
##                                          
##                Accuracy : 0.8077         
##                  95% CI : (0.781, 0.8324)
##     No Information Rate : 0.7056         
##     P-Value [Acc > NIR] : 5.733e-13      
##                                          
##                   Kappa : 0.5588         
##                                          
##  Mcnemar's Test P-Value : 0.0006282      
##                                          
##             Sensitivity : 0.7581         
##             Specificity : 0.8283         
##          Pos Pred Value : 0.6481         
##          Neg Pred Value : 0.8914         
##              Prevalence : 0.2944         
##          Detection Rate : 0.2232         
##    Detection Prevalence : 0.3443         
##       Balanced Accuracy : 0.7932         
##                                          
##        'Positive' Class : Not Obese      
## 
knn.20 = class::knn(knn_train[,c(1:19)],knn_test[,c(1:19)],knn_train$Obese_Class,k=20)
knn_test$ObesePred = knn.20
confusionMatrix(table(knn_test$Obese_Class,knn_test$ObesePred))
## Confusion Matrix and Statistics
## 
##            
##             Not Obese Obese
##   Not Obese       205   119
##   Obese            51   566
##                                           
##                Accuracy : 0.8193          
##                  95% CI : (0.7932, 0.8434)
##     No Information Rate : 0.7279          
##     P-Value [Acc > NIR] : 3.405e-11       
##                                           
##                   Kappa : 0.5789          
##                                           
##  Mcnemar's Test P-Value : 2.767e-07       
##                                           
##             Sensitivity : 0.8008          
##             Specificity : 0.8263          
##          Pos Pred Value : 0.6327          
##          Neg Pred Value : 0.9173          
##              Prevalence : 0.2721          
##          Detection Rate : 0.2179          
##    Detection Prevalence : 0.3443          
##       Balanced Accuracy : 0.8135          
##                                           
##        'Positive' Class : Not Obese       
## 
knn.50 = class::knn(knn_train[,c(1:19)],knn_test[,c(1:19)],knn_train$Obese_Class,k=50)
knn_test$ObesePred = knn.50
confusionMatrix(table(knn_test$Obese_Class,knn_test$ObesePred))
## Confusion Matrix and Statistics
## 
##            
##             Not Obese Obese
##   Not Obese       185   139
##   Obese            46   571
##                                           
##                Accuracy : 0.8034          
##                  95% CI : (0.7765, 0.8283)
##     No Information Rate : 0.7545          
##     P-Value [Acc > NIR] : 0.000212        
##                                           
##                   Kappa : 0.5327          
##                                           
##  Mcnemar's Test P-Value : 1.343e-11       
##                                           
##             Sensitivity : 0.8009          
##             Specificity : 0.8042          
##          Pos Pred Value : 0.5710          
##          Neg Pred Value : 0.9254          
##              Prevalence : 0.2455          
##          Detection Rate : 0.1966          
##    Detection Prevalence : 0.3443          
##       Balanced Accuracy : 0.8025          
##                                           
##        'Positive' Class : Not Obese       
## 
knn.100 = class::knn(knn_train[,c(1:19)],knn_test[,c(1:19)],knn_train$Obese_Class,k=100)
knn_test$ObesePred = knn.100
confusionMatrix(table(knn_test$Obese_Class,knn_test$ObesePred))
## Confusion Matrix and Statistics
## 
##            
##             Not Obese Obese
##   Not Obese       170   154
##   Obese            45   572
##                                          
##                Accuracy : 0.7885         
##                  95% CI : (0.761, 0.8142)
##     No Information Rate : 0.7715         
##     P-Value [Acc > NIR] : 0.1137         
##                                          
##                   Kappa : 0.491          
##                                          
##  Mcnemar's Test P-Value : 1.919e-14      
##                                          
##             Sensitivity : 0.7907         
##             Specificity : 0.7879         
##          Pos Pred Value : 0.5247         
##          Neg Pred Value : 0.9271         
##              Prevalence : 0.2285         
##          Detection Rate : 0.1807         
##    Detection Prevalence : 0.3443         
##       Balanced Accuracy : 0.7893         
##                                          
##        'Positive' Class : Not Obese      
## 
# knn.20 seems be the sweet spot but not by much

knn.20 seems be the sweet spot but not by much