—-Data Set Notes—-
data = imputed, no outliers data2 = same as data but moved Obese_Class to the beginning data3 = removed highly correlated variable data4 = removed insignificant variables per logistic regression
The national dataset used in this study comes from the 2017 County Health Rankings website. Since each state and county have varying populations, we chose to use the data as a percent of the population. The death rate is the age-adjusted Years of Potential Life Lost rate per 100,000. All variables are continuous so we do not need to be concerned about unbalanced categorical variables.
OBJECTIVE1
We want to model the probability of being obsese (BMI > 30) for a particular variable. The team has decided to use the following variables as the predictors.The variable data represent the percentage of the population for a specifc US county.
Health Variables - Smokers - Physically Incactive - Excesive Drinking - Frequent Mental Distress - Frequent Physical Distress - Diabetic - Insufficient Sleep
Demographic Type Variables - Uninsured - Some College - Unemployed - Severe Housing Problems
EXPLORATORY DATA ANALYSIS
# import health data
alldata = read.csv("./Data/CHR_Data.csv")
# get initial data
data <- alldata[,-c(1:3,4,14,18:24)]
# show data summary to identify variables with missing data
summary(data)
## Smokers Obese Physically_Inactive Excessive_Drinking
## Min. : 7.00 Min. :12.00 Min. : 8.00 Min. : 8.00
## 1st Qu.:15.00 1st Qu.:28.00 1st Qu.:23.00 1st Qu.:14.00
## Median :17.00 Median :31.00 Median :26.00 Median :17.00
## Mean :17.89 Mean :31.06 Mean :26.01 Mean :16.59
## 3rd Qu.:20.00 3rd Qu.:34.00 3rd Qu.:29.00 3rd Qu.:19.00
## Max. :41.00 Max. :48.00 Max. :41.00 Max. :27.00
##
## Frequent_Mental_Distress Frequent_Physical_Distress Diabetic
## Min. : 7.00 Min. : 7.00 Min. : 3.0
## 1st Qu.:10.00 1st Qu.:10.00 1st Qu.:10.0
## Median :12.00 Median :11.00 Median :11.0
## Mean :11.67 Mean :11.82 Mean :11.3
## 3rd Qu.:13.00 3rd Qu.:13.00 3rd Qu.:13.0
## Max. :19.00 Max. :23.00 Max. :24.0
##
## Insufficient_Sleep Uninsured Some_College Unemployed
## Min. :23.00 Min. : 3.0 Min. : 5.30 Min. : 1.800
## 1st Qu.:30.00 1st Qu.:10.0 1st Qu.: 48.70 1st Qu.: 4.200
## Median :33.00 Median :14.0 Median : 56.70 Median : 5.300
## Mean :33.09 Mean :14.4 Mean : 56.76 Mean : 5.521
## 3rd Qu.:36.00 3rd Qu.:18.0 3rd Qu.: 65.10 3rd Qu.: 6.500
## Max. :47.00 Max. :39.0 Max. :100.00 Max. :24.000
## NA's :1 NA's :1
## Severe_Housing_Problems
## Min. : 1.00
## 1st Qu.:11.00
## Median :14.00
## Mean :14.47
## 3rd Qu.:17.00
## Max. :62.00
##
The summary statistics show missing values for Premature_Death_Rate, Unemployed, Uninsured, and Graduation_Rate. We impute the missing data with the median values since some distributions show a little skew when comparing the mean and medians. We will later show the histograms confirm some skew.
# impute variables with their medians, prefer medians in the event there is any skew
data$Unemployed <- with(data,impute(Unemployed,median))
data$Uninsured <- with(data,impute(Uninsured,median))
EDA-SKEW
The histograms confirm some skew with Frequent_Mental_Distress, Frequent_Physical_Distress,Uninsured, Unemployed, and Severe_Housing_Problems. There is no concern or need to do anything about the skew since we have a relatively large data set and we are performing a logistic regression.
# get chosen predictors for EDA
predictors <- data[,-2]
# histogram of chosen predictors
par(mfrow=c(3,4))
for(i in 1:11) {
hist(predictors[,i], main=names(predictors)[i])
}
OUTLIERS (Used SAS, will put screen shots and comments about Cook’s D in final paper)
We remove the data for Yuma county in Arizona due to the outlier it creates for unemployment. The county is along the Mexico border and is predominately a farming community with migrant (seasonal) workers. This situation is uncommon and not typical of U.S. counties. We also removed the data for Imperial county in California for the same reasons. It is adjacent to Yuma county.
We remove the data for Bethel, Northwest Arctic and Yukon-Koyukuk counties in Alaska for Severe Housing Problems. There are four factors that contribute to this category. They are housing units that lack complete kitchens, lack complete plumbing facilities, overcrowded, or severely cost burdened. These counties reside in Alaska where the cost to build is beyond what the residents can afford and therefore overcrowding is above normal compared to the rest of the United States. [Nathan Wiltse, Dustin Madden, 2018 Alaska Housing Assessment, Jan 17, 2018, https://www.ahfc.us/download_file/view/5124/853]
# print data before outlier removal
plot(data$Unemployed)
plot(data$Severe_Housing_Problems)
data <- data[!rowSums(data[c(-1:-10,-12)] > 20),] # removed unemployment outliers for migrant farming counties
data <- data[!rowSums(data[c(-1:-11)] > 50),] # removed housing prob outliers for poor Alaska counties
# print data after outlier removal
plot(data$Unemployed)
plot(data$Severe_Housing_Problems)
# create obese binary classification where BMI >= 30 is considered Obese, 0 = not obese, 1 = obese
data$Obese_Class <- "Obese"
data[data$Obese < 30, "Obese_Class"] <- "Not Obese"
data$Obese_Class <- as.factor(as.character(data$Obese_Class))
# -- create new data set with obese_class first --
data2 <- data[,c(13,1,3:12)]
# summary stats by group Obese_Class, to add to SAS Box Plots
t(aggregate(Smokers~Obese_Class,data=data2,summary))
## [,1] [,2]
## Obese_Class "Not Obese" "Obese"
## Smokers.Min. " 7.00000" " 9.00000"
## Smokers.1st Qu. "14.00000" "17.00000"
## Smokers.Median "15.00000" "19.00000"
## Smokers.Mean "15.57736" "19.05311"
## Smokers.3rd Qu. "17.00000" "21.00000"
## Smokers.Max. "27.00000" "41.00000"
t(aggregate(Physically_Inactive~Obese_Class,data=data2,summary))
## [,1] [,2]
## Obese_Class "Not Obese" "Obese"
## Physically_Inactive.Min. " 8.00000" "16.00000"
## Physically_Inactive.1st Qu. "19.00000" "25.00000"
## Physically_Inactive.Median "22.00000" "28.00000"
## Physically_Inactive.Mean "22.00849" "28.06422"
## Physically_Inactive.3rd Qu. "25.00000" "31.00000"
## Physically_Inactive.Max. "39.00000" "41.00000"
t(aggregate(Excessive_Drinking~Obese_Class,data=data2,summary))
## [,1] [,2]
## Obese_Class "Not Obese" "Obese"
## Excessive_Drinking.Min. " 8.00000" " 9.00000"
## Excessive_Drinking.1st Qu. "16.00000" "13.00000"
## Excessive_Drinking.Median "18.00000" "16.00000"
## Excessive_Drinking.Mean "17.80377" "15.96813"
## Excessive_Drinking.3rd Qu. "19.00000" "18.00000"
## Excessive_Drinking.Max. "27.00000" "27.00000"
t(aggregate(Frequent_Mental_Distress~Obese_Class,data=data2,summary))
## [,1] [,2]
## Obese_Class "Not Obese" "Obese"
## Frequent_Mental_Distress.Min. " 7.00000" " 7.00000"
## Frequent_Mental_Distress.1st Qu. "10.00000" "11.00000"
## Frequent_Mental_Distress.Median "11.00000" "12.00000"
## Frequent_Mental_Distress.Mean "10.86887" "12.07388"
## Frequent_Mental_Distress.3rd Qu. "12.00000" "13.00000"
## Frequent_Mental_Distress.Max. "16.00000" "19.00000"
t(aggregate(Frequent_Physical_Distress~Obese_Class,data=data2,summary))
## [,1] [,2]
## Obese_Class "Not Obese" "Obese"
## Frequent_Physical_Distress.Min. " 7.00000" " 7.00000"
## Frequent_Physical_Distress.1st Qu. "10.00000" "10.00000"
## Frequent_Physical_Distress.Median "11.00000" "12.00000"
## Frequent_Physical_Distress.Mean "10.88396" "12.29213"
## Frequent_Physical_Distress.3rd Qu. "12.00000" "14.00000"
## Frequent_Physical_Distress.Max. "21.00000" "23.00000"
t(aggregate(Diabetic~Obese_Class,data=data2,summary))
## [,1] [,2]
## Obese_Class "Not Obese" "Obese"
## Diabetic.Min. " 3.000000" " 6.000000"
## Diabetic.1st Qu. " 8.000000" "11.000000"
## Diabetic.Median "10.000000" "12.000000"
## Diabetic.Mean " 9.570755" "12.187349"
## Diabetic.3rd Qu. "11.000000" "14.000000"
## Diabetic.Max. "18.000000" "24.000000"
t(aggregate(Insufficient_Sleep~Obese_Class,data=data2,summary))
## [,1] [,2]
## Obese_Class "Not Obese" "Obese"
## Insufficient_Sleep.Min. "23.00000" "24.00000"
## Insufficient_Sleep.1st Qu. "29.00000" "31.00000"
## Insufficient_Sleep.Median "31.00000" "34.00000"
## Insufficient_Sleep.Mean "31.34340" "33.97682"
## Insufficient_Sleep.3rd Qu. "34.00000" "37.00000"
## Insufficient_Sleep.Max. "46.00000" "47.00000"
t(aggregate(Uninsured~Obese_Class,data=data2,summary))
##
## 1 values imputed to 14
## [,1] [,2]
## Obese_Class "Not Obese" "Obese"
## Uninsured.Min. " 3.00000" " 5.00000"
## Uninsured.1st Qu. "10.00000" "11.00000"
## Uninsured.Median "14.00000" "14.00000"
## Uninsured.Mean "14.58396" "14.28006"
## Uninsured.3rd Qu. "19.00000" "17.00000"
## Uninsured.Max. "35.00000" "39.00000"
t(aggregate(Some_College~Obese_Class,data=data2,summary))
## [,1] [,2]
## Obese_Class "Not Obese" "Obese"
## Some_College.Min. " 5.30000" " 15.80000"
## Some_College.1st Qu. " 53.70000" " 46.90000"
## Some_College.Median " 62.40000" " 54.20000"
## Some_College.Mean " 61.38047" " 54.42825"
## Some_College.3rd Qu. " 69.70000" " 62.45000"
## Some_College.Max. "100.00000" " 89.50000"
t(aggregate(Unemployed~Obese_Class,data=data2,summary))
##
## 1 values imputed to 5.3
## [,1] [,2]
## Obese_Class "Not Obese" "Obese"
## Unemployed.Min. " 1.800000" " 1.800000"
## Unemployed.1st Qu. " 3.800000" " 4.400000"
## Unemployed.Median " 4.700000" " 5.600000"
## Unemployed.Mean " 5.020943" " 5.745582"
## Unemployed.3rd Qu. " 5.900000" " 6.800000"
## Unemployed.Max. "17.600000" "16.900000"
t(aggregate(Severe_Housing_Problems~Obese_Class,data=data2,summary))
## [,1] [,2]
## Obese_Class "Not Obese" "Obese"
## Severe_Housing_Problems.Min. " 1.00000" " 2.00000"
## Severe_Housing_Problems.1st Qu. "12.00000" "11.00000"
## Severe_Housing_Problems.Median "15.00000" "13.00000"
## Severe_Housing_Problems.Mean "15.60660" "13.81507"
## Severe_Housing_Problems.3rd Qu. "18.00000" "16.00000"
## Severe_Housing_Problems.Max. "35.00000" "46.00000"
# scatter matrices, with Obese as the colors
pairs(data2[,2:12],col=data$Obese_Class)
We’ll review a correlation heatmap to get better insights next.
# predictor heatmap correlations to examine whether variables are redundant
cor1 <- cor(data2[,2:12])
heatmap.2(cor1,col=redgreen(75), cexRow=.7,cexCol=0.7,
density.info="none", trace="none", dendrogram=c("both"),
symm=F,symkey=T,symbreaks=T, scale="none",key=T)
The dendogramed heatmap confirms the strong correlation previously seen with Frequenet_Mental_Distress and Physical_Mental_Distress.
Additional correlation is seen between the following:The correlations identified by the dendogram surprisingly all make practical sense. One would expect to lose sleep if they were unemployed. Drinking being correlated to college makes sense. Diabetic is not uncommon amongst physically incative people. If someone is living in an area with severe housing problems, we might expect they would not be able to afford insurance.
Let’s review correlation with ratings seen in a variable correlation heatmap.
#Correlation Plot
cor2 <- cor(data2[,2:12])
df_corr <-corrplot(cor2, type="upper", addCoef.col = "white", number.digits = 2, number.cex = 0.5, method="square", order="hclust", title="Variable Corr Heatmap",tl.srt=45, tl.cex = 0.8)
Let’s look at the VIFs to confirm the collinear variables.
# Logistics Regression
glm.vifchk <- glm(Obese_Class ~ ., data = data2, family = binomial(link="logit"))
vif(glm.vifchk)
## Smokers Physically_Inactive
## 2.526850 1.795551
## Excessive_Drinking Frequent_Mental_Distress
## 2.009784 6.933023
## Frequent_Physical_Distress Diabetic
## 7.698296 2.041061
## Insufficient_Sleep Uninsured
## 1.873762 2.258083
## Some_College Unemployed
## 1.873567 1.959237
## Severe_Housing_Problems
## 1.874558
The VIFS and multiple visual tools agree there is a strong relationship between Frequent_Mental_Distress and Frequent_Physical_Distress. We choose to remove Frequent_Physical_Distress.
# Remove Frequent_Physical_Distress from data2
data3 <- data2[,-6]
Lets use PCA to visualize any other insights. It is fortunate to already have our data somewhat normalized on a percentage scale. It reduces the scale sensitivity seen with PCA.
dat.x <- data3[,2:11]
dat.y <- data3[,1]
pc.result<-prcomp(dat.x,scale.=TRUE)
pc.scores<-pc.result$x
pc.scores<-data.frame(pc.scores)
pc.scores$Obese_Class<-dat.y
#Loadings for interpretation
pc.result$rotation
## PC1 PC2 PC3 PC4
## Smokers 0.35493287 0.04656377 -0.22202874 0.4894240917
## Physically_Inactive 0.32904576 0.42158841 0.01794771 0.0599577536
## Excessive_Drinking -0.33616920 -0.13335600 -0.16488286 0.6829413679
## Frequent_Mental_Distress 0.39796417 -0.15797014 -0.10316246 0.0646241480
## Diabetic 0.35701121 0.27112897 -0.13585126 -0.2471879938
## Insufficient_Sleep 0.34059818 -0.16062437 -0.24523333 -0.0836628946
## Uninsured 0.18988728 -0.18352967 0.80873077 0.0006640387
## Some_College -0.33055551 -0.02636285 -0.34863064 -0.4518377613
## Unemployed 0.31234663 -0.27422551 -0.23676682 -0.0394306314
## Severe_Housing_Problems 0.08954141 -0.75396801 -0.04806264 -0.1116690259
## PC5 PC6 PC7 PC8
## Smokers -0.35145872 0.41526874 -0.1166555 0.11740639
## Physically_Inactive -0.19238224 -0.06573227 0.4125284 -0.23506705
## Excessive_Drinking 0.02982705 -0.23925809 0.4816263 0.16503090
## Frequent_Mental_Distress -0.10757358 0.32029014 -0.1491607 0.09697534
## Diabetic 0.05786538 -0.12543176 0.4887014 -0.12475100
## Insufficient_Sleep -0.24599920 -0.71958709 -0.2114669 0.39802098
## Uninsured -0.11986096 0.05209674 0.2697138 0.42470621
## Some_College -0.36161720 0.31563174 0.3303363 0.45961950
## Unemployed 0.73975179 0.16549887 0.1945407 0.25261383
## Severe_Housing_Problems -0.26603334 -0.01161386 0.2423594 -0.51468156
## PC9 PC10
## Smokers 0.05332547 0.50676822
## Physically_Inactive -0.64046056 -0.18198195
## Excessive_Drinking 0.12307647 -0.20767812
## Frequent_Mental_Distress 0.19479536 -0.78619016
## Diabetic 0.65706869 0.11942351
## Insufficient_Sleep -0.09058340 0.03726054
## Uninsured 0.02736213 0.07254999
## Some_College -0.11607294 -0.01009484
## Unemployed -0.27486862 0.13338470
## Severe_Housing_Problems -0.06445638 0.10018070
# Scree plot
pc.eigen<-(pc.result$sdev)^2
pc.prop<-pc.eigen/sum(pc.eigen)
pc.cumprop<-cumsum(pc.prop)
plot(1:10,pc.prop,type="l",main="Scree Plot",ylim=c(0,1),xlab="PC #",ylab="Proportion of Variation")
axis(1, seq(1,10,1))
lines(1:10,pc.cumprop,lty=3)
# Cumulative proportion plot
cumulative.prop<-cumsum(pc.eigen/sum(pc.eigen))
plot(1:10,cumulative.prop,type="l",main="Cumulative proportion",ylim=c(0.5,1))
points(x=6, y=0.9, type="p", pch=10, col="green")
The cumulative plot shows 6 PCs are needed to retain ~90% of the total variation in the data.
#Use ggplot2 to plot pc's
ggplot(data = pc.scores, aes(x = PC1, y = PC2)) +
geom_point(aes(col=Obese_Class), size=1)+
geom_hline(yintercept = 0, colour = "gray65") +
geom_vline(xintercept = 0, colour = "gray65") +
ggtitle("PCA plot of Health Data")
ggplot(data = pc.scores, aes(x = PC1, y = PC3)) +
geom_point(aes(col=Obese_Class), size=1)+
geom_hline(yintercept = 0, colour = "gray65") +
geom_vline(xintercept = 0, colour = "gray65") +
ggtitle("PCA plot of Health Data")
ggplot(data = pc.scores, aes(x = PC1, y = PC4)) +
geom_point(aes(col=Obese_Class), size=1)+
geom_hline(yintercept = 0, colour = "gray65") +
geom_vline(xintercept = 0, colour = "gray65") +
ggtitle("PCA plot of Health Data")
ggplot(data = pc.scores, aes(x = PC1, y = PC5)) +
geom_point(aes(col=Obese_Class), size=1)+
geom_hline(yintercept = 0, colour = "gray65") +
geom_vline(xintercept = 0, colour = "gray65") +
ggtitle("PCA plot of Health Data")
ggplot(data = pc.scores, aes(x = PC1, y = PC6)) +
geom_point(aes(col=Obese_Class), size=1)+
geom_hline(yintercept = 0, colour = "gray65") +
geom_vline(xintercept = 0, colour = "gray65") +
ggtitle("PCA plot of Health Data")
ggplot(data = pc.scores, aes(x = PC2, y = PC3)) +
geom_point(aes(col=Obese_Class), size=1)+
geom_hline(yintercept = 0, colour = "gray65") +
geom_vline(xintercept = 0, colour = "gray65") +
ggtitle("PCA plot of Health Data")
ggplot(data = pc.scores, aes(x = PC2, y = PC4)) +
geom_point(aes(col=Obese_Class), size=1)+
geom_hline(yintercept = 0, colour = "gray65") +
geom_vline(xintercept = 0, colour = "gray65") +
ggtitle("PCA plot of Health Data")
ggplot(data = pc.scores, aes(x = PC2, y = PC5)) +
geom_point(aes(col=Obese_Class), size=1)+
geom_hline(yintercept = 0, colour = "gray65") +
geom_vline(xintercept = 0, colour = "gray65") +
ggtitle("PCA plot of Health Data")
ggplot(data = pc.scores, aes(x = PC2, y = PC6)) +
geom_point(aes(col=Obese_Class), size=1)+
geom_hline(yintercept = 0, colour = "gray65") +
geom_vline(xintercept = 0, colour = "gray65") +
ggtitle("PCA plot of Health Data")
The PCA plots show decent separation but not great separation. We should still be able to get decent results using PC vars but we will use the original variables for our model.
For our model, we choose a logistic regression using LASSO for feature reduction.
# build a training & test data set, match train/test idx to data idx for debug tracking
samplesize=nrow(data3)
train_percent = .8
train_indices = sample(seq(1,samplesize,length = samplesize),train_percent*samplesize) # get random indices
train3 = data3[train_indices,] # random training data
test3 = data3[-train_indices,] # random test data
train3.x <-train3[,2:ncol(train3)]
train3.y <-train3[,1]
# Perform on all data
glm.fit1 <- glm(Obese_Class ~ ., data = data3, family = binomial)
summary(glm.fit1)
##
## Call:
## glm(formula = Obese_Class ~ ., family = binomial, data = data3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.3086 -0.5233 0.2358 0.6237 2.8536
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.131831 1.146259 -7.967 1.63e-15 ***
## Smokers 0.386425 0.033515 11.530 < 2e-16 ***
## Physically_Inactive 0.195882 0.019418 10.087 < 2e-16 ***
## Excessive_Drinking 0.014749 0.025819 0.571 0.568
## Frequent_Mental_Distress -0.378120 0.068449 -5.524 3.31e-08 ***
## Diabetic 0.333991 0.040267 8.294 < 2e-16 ***
## Insufficient_Sleep 0.027773 0.018648 1.489 0.136
## Uninsured -0.088001 0.012288 -7.161 8.00e-13 ***
## Some_College -0.009237 0.006196 -1.491 0.136
## Unemployed -0.020646 0.041635 -0.496 0.620
## Severe_Housing_Problems -0.020759 0.015447 -1.344 0.179
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4008.1 on 3130 degrees of freedom
## Residual deviance: 2446.1 on 3120 degrees of freedom
## AIC: 2468.1
##
## Number of Fisher Scoring iterations: 6
After different assessments and iterations, we agree with removing the following suggested insignificant predictors: - Excessive_Drinking - Insufficient_Sleep - Some_College - Unemployed - Sever_Housing_Problems
# removed insignifcant predictors
data4 <- data3[,-c(4,7,9:11)]
# build a training & test data set, match train/test idx to data idx for debug tracking
# using same split and indices used in train3/test3
samplesize=nrow(data3)
train4 = data4[train_indices,] # random training data
test4 = data4[-train_indices,] # random test data
train4.x <-train4[,2:ncol(train4)]
train4.y <-train4[,1]
# checking group proportions for training bias, should see similar proportions of 0s & 1s in each
proportion <- rbind(table(train4$Obese_Class),table(test4$Obese_Class))
dimnames(proportion)<- list("Data Set"=c("Train","Test"), "Obese_Class"=c("Not obese","Obese"))
proportion
## Obese_Class
## Data Set Not obese Obese
## Train 858 1646
## Test 202 425
Rerun the model with only signifcant predictors.
# match fit idx to data idx for debug tracking
logreg.fit4 <- glm(Obese_Class ~ ., data = data4, family = binomial)
summary(logreg.fit4)
##
## Call:
## glm(formula = Obese_Class ~ ., family = binomial, data = data4)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.3368 -0.5369 0.2323 0.6245 2.7561
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.39129 0.46967 -19.996 < 2e-16 ***
## Smokers 0.39861 0.03304 12.065 < 2e-16 ***
## Physically_Inactive 0.20664 0.01837 11.251 < 2e-16 ***
## Frequent_Mental_Distress -0.39161 0.05397 -7.256 3.98e-13 ***
## Diabetic 0.34344 0.03723 9.225 < 2e-16 ***
## Uninsured -0.08458 0.01020 -8.289 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4008.1 on 3130 degrees of freedom
## Residual deviance: 2452.5 on 3125 degrees of freedom
## AIC: 2464.5
##
## Number of Fisher Scoring iterations: 6
confint(logreg.fit4)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -10.3302875 -8.48849216
## Smokers 0.3348803 0.46441952
## Physically_Inactive 0.1710067 0.24302822
## Frequent_Mental_Distress -0.4983124 -0.28665941
## Diabetic 0.2710860 0.41709065
## Uninsured -0.1047575 -0.06473489
MODEL ASSUMPTIONS CHECK1
We choose the receiving operating characteristic (ROC) as our first measure of classier performance.
#--glm ROC--
logreg.pred4<-predict(logreg.fit4, newdata = train4.x, type = "response")
pred <- prediction(logreg.pred4, train4.y)
roc.perf = performance(pred, measure = "tpr", x.measure = "fpr")
auc.train <- performance(pred, measure = "auc")
auc.train <- auc.train@y.values
#Plot glm ROC
plot(roc.perf,main="GLM")
abline(a=0, b= 1) #Ref line indicating poor performance
text(x = .25, y = .75,paste("AUC = ", round(auc.train[[1]],3), sep = ""))
Our area under the curve (AUC) ~ 0.89. Since this is above 0.8 (good rule of thumb) our model does a good job discriminating between obese and not obese.
MODEL ASSUMPTIONS CHECK2
A logistic regression model provides a better fit to the data if it demonstrates an improvement over a model with fewer predictors. We us the likelihood ratio test. We create a model with two key predictors and compare against our final model (“full model”) with five predictors.
logreg.fitless4 <- glm(Obese_Class ~ Smokers+Physically_Inactive, data=data4, family=binomial) # reduced model
anova(logreg.fit4,logreg.fitless4, test="Chisq")
## Analysis of Deviance Table
##
## Model 1: Obese_Class ~ Smokers + Physically_Inactive + Frequent_Mental_Distress +
## Diabetic + Uninsured
## Model 2: Obese_Class ~ Smokers + Physically_Inactive
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 3125 2452.5
## 2 3128 2713.0 -3 -260.51 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Ho: The reduced model is favored over a more full model. Ha: The reduced model is not favored over a more full model.
We reject Ho. With an alpha of 0.05, the results show the observed difference in model fit is statistically significant with a p-value < 2.2e-16. The evidence suggests the glm.fit “full model” is favored.
hoslem.test(logreg.fit4$y, fitted(logreg.fit4), g=10)
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: logreg.fit4$y, fitted(logreg.fit4)
## X-squared = 9.2499, df = 8, p-value = 0.3217
Ho: The model fits the data. Ha: The model does not fit the data. Applying the Hosmer-Lemeshow Test, we fail the reject the null hypothesis with a p-value > 0.3217.
# McFadden's Pseudo R-squared
ll.null <- logreg.fit4$null.deviance/-2
ll.proposed <- logreg.fit4$deviance/-2
(ll.null - ll.proposed) / ll.null # the R-squared is basically the overall effect size
## [1] 0.3881248
1-pchisq(2*(ll.proposed - ll.null),df=(length(logreg.fit4$coefficients)-1))
## [1] 0
The R-squared shown is basically the overall effect size. The p-value is very small so the R-squared isn’t due to chance.
Our final fitted model for Problem 1 will be in the paper.
========================================================================================
OBJECTIVE 2:
With a simple logistic regression model as a baseline, perform additional competing models to improve on prediction performance metrics.
First, we obtain our metrics for our base model. We performed a 10-fold cross validation.
# Set desired data set
train=train4
test=test4
# 10 fold cv
control <- trainControl(method = "cv", number = 10, savePredictions = TRUE)
# Model
logreg.fit <- train(Obese_Class ~ ., data=train, method="glm", family=binomial(),trControl=control)
logreg.fit
## Generalized Linear Model
##
## 2504 samples
## 5 predictor
## 2 classes: 'Not Obese', 'Obese'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 2254, 2255, 2253, 2253, 2253, 2253, ...
## Resampling results:
##
## Accuracy Kappa
## 0.8142424 0.5752564
# predict on test
logreg.pred <- predict(logreg.fit,test)
# results
confusionMatrix(logreg.pred,test$Obese_Class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Not Obese Obese
## Not Obese 139 39
## Obese 63 386
##
## Accuracy : 0.8373
## 95% CI : (0.8061, 0.8654)
## No Information Rate : 0.6778
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.6155
##
## Mcnemar's Test P-Value : 0.02277
##
## Sensitivity : 0.6881
## Specificity : 0.9082
## Pos Pred Value : 0.7809
## Neg Pred Value : 0.8597
## Prevalence : 0.3222
## Detection Rate : 0.2217
## Detection Prevalence : 0.2839
## Balanced Accuracy : 0.7982
##
## 'Positive' Class : Not Obese
##
Let’s add an interactive term to see if we can improve the model. We will add the “partiers”, Smokers*Excessive_Drinking with mean Execessive_Drinking will also be added back to the model. So our complex model is shown below.
log(p/(1-p)) = logit(p) = B0 + B1(Smokers) + B2(Physically_Inactive) + B3(Frequent_Mental_Distress) + B4(Diabetic) + B5(Uninsured) + B6(Excessive_Drinking) + B7(Smokers*Excessive_Drinking)
# Add Excessive Drinking to a New Data Set
data5 <- data3[,-c(7,9:11)]
# build a training & test data set with complex model data
samplesize=nrow(data5)
train_percent = .8
train_indices = sample(seq(1,samplesize,length = samplesize),train_percent*samplesize) # get random indices
train = data5[train_indices,] # random training data
test = data5[-train_indices,] # random test data
# 10 fold cv
control <- trainControl(method = "cv", number = 10, savePredictions = TRUE)
# Model
logreg.fit <- train(Obese_Class ~ ., data=train, method="glm", family=binomial(),trControl=control)
logreg.fit
## Generalized Linear Model
##
## 2504 samples
## 6 predictor
## 2 classes: 'Not Obese', 'Obese'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 2254, 2254, 2254, 2253, 2253, 2254, ...
## Resampling results:
##
## Accuracy Kappa
## 0.8162948 0.5722829
# predict on test
logreg.pred <- predict(logreg.fit,test)
# results
confusionMatrix(logreg.pred,test$Obese_Class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Not Obese Obese
## Not Obese 158 45
## Obese 66 358
##
## Accuracy : 0.823
## 95% CI : (0.7908, 0.8521)
## No Information Rate : 0.6427
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.6063
##
## Mcnemar's Test P-Value : 0.05765
##
## Sensitivity : 0.7054
## Specificity : 0.8883
## Pos Pred Value : 0.7783
## Neg Pred Value : 0.8443
## Prevalence : 0.3573
## Detection Rate : 0.2520
## Detection Prevalence : 0.3238
## Balanced Accuracy : 0.7968
##
## 'Positive' Class : Not Obese
##
# reset dataset
data <- read.csv("./Data/CHR_Data.csv")
# remove the id numbers, need to remove the state and county because that doesn't work with trees
# remove Premature_Death_Rate since we are interesting in "living type" predictors
data <- data[,-c(1:3,4)]
# create obese binary classification where BMI >= 30 is considered Obese, 0 = not obese, 1 = obese
data$Obese_Class <- "Obese"
data[data$Obese < 30, "Obese_Class"] <- "Not Obese"
data$Obese_Class <- as.factor(as.character(data$Obese_Class))
# remove numeric obese, and ethnic variables
data <- data[,-c(2,10,15:20)]
# print data before outlier removal
plot(data$Unemployed)
plot(data$Severe_Housing_Problems)
# remove outliers
data <- data[!rowSums(data[-c(1:9,10:13)] > 20),] # removed unemployment outliers for migrant farming counties
data <- data[!rowSums(data[-c(1:10,12:13)] > 50),] # removed housing prob outliers for poor Alaska counties
# print data after outlier removal
plot(data$Unemployed)
plot(data$Severe_Housing_Problems)
# impute variables with their medians, prefer medians in the event there is any skew
data$Unemployed <- with(data,impute(Unemployed,median))
data$Uninsured <- with(data,impute(Uninsured,median))
#variance
round(cov(data[1:12]), 2)
## Smokers Physically_Inactive Excessive_Drinking
## Smokers 12.54 11.45 -5.29
## Physically_Inactive 11.45 27.11 -9.61
## Excessive_Drinking -5.29 -9.61 9.93
## Frequent_Mental_Distress 5.34 5.35 -3.86
## Frequent_Physical_Distress 6.77 6.66 -5.03
## Diabetic 5.27 9.56 -5.00
## Insufficient_Sleep 8.58 10.10 -6.47
## Uninsured 3.35 6.41 -6.15
## Some_College -21.67 -32.61 18.52
## Unemployed 3.34 3.22 -2.62
## Severe_Housing_Problems 1.89 -5.12 -0.48
## Rural 22.62 59.63 -23.76
## Frequent_Mental_Distress Frequent_Physical_Distress
## Smokers 5.34 6.77
## Physically_Inactive 5.35 6.66
## Excessive_Drinking -3.86 -5.03
## Frequent_Mental_Distress 3.61 4.21
## Frequent_Physical_Distress 4.21 5.71
## Diabetic 3.06 3.53
## Insufficient_Sleep 5.40 6.44
## Uninsured 3.42 5.56
## Some_College -13.19 -18.00
## Unemployed 2.49 3.07
## Severe_Housing_Problems 2.89 3.70
## Rural 5.46 9.05
## Diabetic Insufficient_Sleep Uninsured Some_College
## Smokers 5.27 8.58 3.35 -21.67
## Physically_Inactive 9.56 10.10 6.41 -32.61
## Excessive_Drinking -5.00 -6.47 -6.15 18.52
## Frequent_Mental_Distress 3.06 5.40 3.42 -13.19
## Frequent_Physical_Distress 3.53 6.44 5.56 -18.00
## Diabetic 6.22 5.93 2.43 -14.90
## Insufficient_Sleep 5.93 16.94 4.11 -23.12
## Uninsured 2.43 4.11 26.60 -30.12
## Some_College -14.90 -23.12 -30.12 135.56
## Unemployed 2.38 4.36 1.83 -10.62
## Severe_Housing_Problems -0.79 5.92 5.32 -3.67
## Rural 25.02 -14.94 26.60 -117.44
## Unemployed Severe_Housing_Problems Rural
## Smokers 3.34 1.89 22.62
## Physically_Inactive 3.22 -5.12 59.63
## Excessive_Drinking -2.62 -0.48 -23.76
## Frequent_Mental_Distress 2.49 2.89 5.46
## Frequent_Physical_Distress 3.07 3.70 9.05
## Diabetic 2.38 -0.79 25.02
## Insufficient_Sleep 4.36 5.92 -14.94
## Uninsured 1.83 5.32 26.60
## Some_College -10.62 -3.67 -117.44
## Unemployed 3.84 3.11 3.83
## Severe_Housing_Problems 3.11 20.93 -52.75
## Rural 3.83 -52.75 991.26
sum(diag(cov(data[1:12]))) # total variance
## [1] 1260.256
#--EDA end--
# split data into train and test
samplesize=nrow(data)
train_percent = .8
train_indices = sample(seq(1,samplesize,length = samplesize),train_percent*samplesize) # get random indices
train = data[train_indices,] # random training data
test = data[-train_indices,] # random test data
#--LDA--
lda <- lda(Obese_Class~.,data = train)
#lda confusion matrix
lda_prd<-predict(lda, newdata = test)$class
lda_cm<-table(lda_prd,test$Obese_Class)
lda_cm
##
## lda_prd Not Obese Obese
## Not Obese 129 26
## Obese 72 400
confusionMatrix(lda_prd, as.factor(test$Obese_Class))
## Confusion Matrix and Statistics
##
## Reference
## Prediction Not Obese Obese
## Not Obese 129 26
## Obese 72 400
##
## Accuracy : 0.8437
## 95% CI : (0.8129, 0.8712)
## No Information Rate : 0.6794
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6181
##
## Mcnemar's Test P-Value : 5.476e-06
##
## Sensitivity : 0.6418
## Specificity : 0.9390
## Pos Pred Value : 0.8323
## Neg Pred Value : 0.8475
## Prevalence : 0.3206
## Detection Rate : 0.2057
## Detection Prevalence : 0.2472
## Balanced Accuracy : 0.7904
##
## 'Positive' Class : Not Obese
##
#--QDA--
qda <- qda(Obese_Class~.,data = train)
#qda confusion matrix
qda_prd<-predict(qda, newdata = test)$class
qda_cm<-table(qda_prd,test$Obese_Class)
qda_cm
##
## qda_prd Not Obese Obese
## Not Obese 128 40
## Obese 73 386
confusionMatrix(qda_prd, as.factor(test$Obese_Class))
## Confusion Matrix and Statistics
##
## Reference
## Prediction Not Obese Obese
## Not Obese 128 40
## Obese 73 386
##
## Accuracy : 0.8198
## 95% CI : (0.7874, 0.8491)
## No Information Rate : 0.6794
## P-Value [Acc > NIR] : 1.699e-15
##
## Kappa : 0.5675
##
## Mcnemar's Test P-Value : 0.00261
##
## Sensitivity : 0.6368
## Specificity : 0.9061
## Pos Pred Value : 0.7619
## Neg Pred Value : 0.8410
## Prevalence : 0.3206
## Detection Rate : 0.2041
## Detection Prevalence : 0.2679
## Balanced Accuracy : 0.7715
##
## 'Positive' Class : Not Obese
##
#--lda ROC--
ldaprd<-predict(lda, newdata = train)$posterior
#correcting for the way lda creates predicted probabilities
ldaprd<-ldaprd[,2]
pred <- prediction(ldaprd, train$Obese_Class)
roc.perf = performance(pred, measure = "tpr", x.measure = "fpr")
auc.train <- performance(pred, measure = "auc")
auc.train <- auc.train@y.values
#Plot lda ROC
plot(roc.perf,main="LDA")
abline(a=0, b= 1) #Ref line indicating poor performance
text(x = .25, y = .75,paste("AUC = ", round(auc.train[[1]],3), sep = ""))
#--qda ROC--
qdaprd<-predict(qda, newdata = train)$posterior
#correcting for the way lda creates predicted probabilities
qdaprd<-qdaprd[,2]
pred <- prediction(qdaprd, train$Obese_Class)
roc.perf = performance(pred, measure = "tpr", x.measure = "fpr")
auc.train <- performance(pred, measure = "auc")
auc.train <- auc.train@y.values
#Plot lda ROC
plot(roc.perf,main="QDA")
abline(a=0, b= 1) #Ref line indicating poor performance
text(x = .25, y = .75,paste("AUC = ", round(auc.train[[1]],3), sep = ""))
#have to reimport and reimpute some data. Annoying, apologies.
datarf <- alldata[-c(1:4)]
datarf$Unemployed <- with(datarf,impute(Unemployed,median))
datarf$Uninsured <- with(datarf,impute(Uninsured,median))
datarf$Graduation_Rate <- with(datarf,impute(Graduation_Rate,median))
# create obese binary classification where BMI >= 30 is considered Obese, 0 = not obese, 1 = obese
datarf$Obese_Class <- "Obese"
datarf[datarf$Obese < 30, "Obese_Class"] <- "Not_Obese"
datarf$Obese_Class <- as.factor(as.character(datarf$Obese_Class))
datarf <- datarf[,-2] # remove numeric Obese
# na_count <-sapply(datarf, function(y) sum(length(which(is.na(y)))))
# na_count <- data.frame(na_count)
datarf <- datarf[,c(20,1:19)]
Obese.full.forrest <- randomForest(Obese_Class~., data=datarf, importance = TRUE)
#importance(Obese.full.forrest) #Variable importance for placement order (Forward, Backward, Stepwise)
varImpPlot(Obese.full.forrest,type=1, main='Random Tree Variable Importance')
# reset dataset
data <- read.csv("./Data/CHR_Data.csv")
# remove the id numbers, need to remove the state and county because that doesn't work with trees
# remove Premature_Death_Rate since we are interesting in "living type" predictors
data <- data[,-c(1:4)]
# create obese binary classification where BMI >= 30 is considered Obese, 0 = not obese, 1 = obese
data$Obese_Class <- "Obese"
data[data$Obese < 30, "Obese_Class"] <- "Not_obese"
data$Obese_Class <- as.factor(as.character(data$Obese_Class))
# impute variables with their medians, prefer medians in the event there is any skew
data$Unemployed <- with(data,impute(Unemployed,median))
data$Uninsured <- with(data,impute(Uninsured,median))
data$Graduation_Rate <- with(data,impute(Graduation_Rate))
# remove numeric obese
data <- data[,-2]
#----EDA End----
rand.x <- data[,c(1:19)]
rand.y <- data[,20]
#-- Set parallel processing Begin--
#parallel compute, requires parallel and doParallel libraries
#create parallel cluster, leave one out so you don't lock your computer
cluster <- makeCluster(detectCores() - 1)
registerDoParallel(cluster)
# added Savepredictions, and classprobs to above
samplesize=nrow(data)
train_percent = .8
train_indices = sample(seq(1,samplesize,length = samplesize),train_percent*samplesize) # get random indices
train = data[train_indices,] # random training data
test = data[-train_indices,] # random test data
#-- Random Forest Search Parms --
#Random Search
control <- trainControl(method="repeatedcv", number=10, repeats=3, search="random", allowParallel = TRUE)
set.seed(7)
mtry <- sqrt(ncol(rand.x))
rf_random <- train(Obese_Class~., data=data, method="rf", metric="Accuracy", tuneLength=15, trControl=control)
print(rf_random)
## Random Forest
##
## 3136 samples
## 19 predictor
## 2 classes: 'Not_obese', 'Obese'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 2822, 2823, 2822, 2823, 2822, 2823, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8350453 0.6194531
## 3 0.8379119 0.6268550
## 6 0.8391854 0.6313953
## 7 0.8395076 0.6329911
## 8 0.8390816 0.6319787
## 12 0.8362137 0.6255540
## 15 0.8363185 0.6260080
## 16 0.8348313 0.6231863
## 18 0.8339817 0.6214561
## 19 0.8318579 0.6164541
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 7.
plot(rf_random)
#--Random Forest with mtry = 8 --
#set.seed(7)
mtry <- 8 # fixed the mtry at 8
tunegrid <- expand.grid(.mtry=mtry)
rf_default <- train(Obese_Class~., data=train, method="rf", metric="Accuracy", tuneGrid=tunegrid, trControl=control)
print(rf_default)
## Random Forest
##
## 2508 samples
## 19 predictor
## 2 classes: 'Not_obese', 'Obese'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 2258, 2257, 2256, 2257, 2258, 2258, ...
## Resampling results:
##
## Accuracy Kappa
## 0.8403851 0.6301758
##
## Tuning parameter 'mtry' was held constant at a value of 8
rf_default$results$Accuracy #Accuracy
## [1] 0.8403851
rf_default.pred <-predict(rf_default, newdata = test)
confusionMatrix(rf_default.pred, as.factor(test$Obese_Class))
## Confusion Matrix and Statistics
##
## Reference
## Prediction Not_obese Obese
## Not_obese 158 41
## Obese 69 360
##
## Accuracy : 0.8248
## 95% CI : (0.7928, 0.8538)
## No Information Rate : 0.6385
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.6101
##
## Mcnemar's Test P-Value : 0.01004
##
## Sensitivity : 0.6960
## Specificity : 0.8978
## Pos Pred Value : 0.7940
## Neg Pred Value : 0.8392
## Prevalence : 0.3615
## Detection Rate : 0.2516
## Detection Prevalence : 0.3169
## Balanced Accuracy : 0.7969
##
## 'Positive' Class : Not_obese
##
#-- End parallel processing--
#stop the cluster and force R back to 1 core
stopCluster(cluster)
registerDoSEQ()
# print a random forest tree
mytree <- tree(Smokers~., datarf)
plot(mytree)
text(mytree,pretty=0)
This tree shows that Frequent Physical Distress is a main split point for premature deaths.
library(rpart)
library(rattle)
## Rattle: A free graphical interface for data science with R.
## Version 5.3.0 Copyright (c) 2006-2018 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
##
## Attaching package: 'rattle'
## The following object is masked from 'package:randomForest':
##
## importance
library(rpart.plot)
library(RColorBrewer)
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
##
## Attaching package: 'modeltools'
## The following object is masked from 'package:car':
##
## Predict
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
library(partykit)
## Loading required package: libcoin
##
## Attaching package: 'partykit'
## The following objects are masked from 'package:party':
##
## cforest, ctree, ctree_control, edge_simple, mob, mob_control,
## node_barplot, node_bivplot, node_boxplot, node_inner, node_surv,
## node_terminal, varimp
# print a better looking tree
form <- as.formula(Obese_Class ~.)
tree.1 <- rpart(form,datarf)
fancyRpartPlot(tree.1, cex=0.65)
# another cross validation tree check
set.seed(3)
cv.obese=cv.tree(mytree,FUN=prune.tree,method="deviance")
names(cv.obese)
## [1] "size" "dev" "k" "method"
cv.obese
## $size
## [1] 12 11 10 8 7 6 5 4 3 2 1
##
## $dev
## [1] 11302.33 11463.30 11479.29 12564.22 12800.31 14567.66 15009.45 15803.86
## [9] 18495.93 21377.65 39846.63
##
## $k
## [1] -Inf 410.8725 425.5333 520.5120 574.1565 939.1882
## [7] 989.3840 1314.2226 2529.5295 2997.3627 18468.7386
##
## $method
## [1] "deviance"
##
## attr(,"class")
## [1] "prune" "tree.sequence"
plot(cv.obese)
par(mfrow=c(1,1))
plot(cv.obese$size,cv.obese$dev,type="b")
The version also finds mtry = 8 as optimal
We will use KNN as another model
library(class)
#---reset dataset---
data <- read.csv("./Data/CHR_Data.csv")
#remove the id numbers, need to remove the state and county because that doesn't work with trees
data <- data[,-c(1:3)]
#impute for missing values previously identified in EDA
data$Unemployed <- with(data,impute(Unemployed,median))
data$Uninsured <- with(data,impute(Uninsured,median))
data$Graduation_Rate <- with(data,impute(Graduation_Rate))
# create obese binary classification where BMI >= 30 is considered Obese, 0 = not obese, 1 = obese
data$Obese_Class <- "Obese"
data[data$Obese < 30, "Obese_Class"] <- "Not Obese"
data$Obese_Class <- as.factor(as.character(data$Obese_Class))
# remove Premature_Death_Rate since we are interesting in "living type" predictors, remove numeric obese data
data <- data[,-c(1,3)]
# set impute cols to numeric
data$Unemployed <- as.numeric(data$Unemployed)
data$Uninsured <- as.numeric(data$Uninsured)
data$Graduation_Rate <- as.numeric(data$Graduation_Rate)
# normalize data with lapply
data_norm <- data
num.vars <- sapply(data_norm,is.numeric)
data_norm[num.vars] <- lapply(data_norm[num.vars], scale)
#summary(data_norm[1:19])
#------------------
# build a training & test data set
samplesize=nrow(data_norm)
train_percent = .7
train_indices = sample(seq(1,samplesize,length = samplesize),train_percent*samplesize) # get random indices
knn_train = data_norm[train_indices,] # random training data
knn_test = data_norm[-train_indices,] # random test data
# checking group proportions for training bias, should see similar proportions of 0s & 1s in each
proportion <- rbind(table(knn_train$Obese_Class),table(knn_test$Obese_Class))
dimnames(proportion)<- list("Data Set"=c("Train","Test"), "Obese_Class"=c("Not obese","Obese"))
proportion
## Obese_Class
## Data Set Not obese Obese
## Train 738 1457
## Test 324 617
knn.5 = class::knn(knn_train[,c(1:19)],knn_test[,c(1:19)],knn_train$Obese_Class,k=5)
knn_test$ObesePred = knn.5
confusionMatrix(table(knn_test$Obese_Class,knn_test$ObesePred))
## Confusion Matrix and Statistics
##
##
## Not Obese Obese
## Not Obese 208 116
## Obese 80 537
##
## Accuracy : 0.7917
## 95% CI : (0.7643, 0.8172)
## No Information Rate : 0.6939
## P-Value [Acc > NIR] : 1.013e-11
##
## Kappa : 0.5262
##
## Mcnemar's Test P-Value : 0.01242
##
## Sensitivity : 0.7222
## Specificity : 0.8224
## Pos Pred Value : 0.6420
## Neg Pred Value : 0.8703
## Prevalence : 0.3061
## Detection Rate : 0.2210
## Detection Prevalence : 0.3443
## Balanced Accuracy : 0.7723
##
## 'Positive' Class : Not Obese
##
knn.10 = class::knn(knn_train[,c(1:19)],knn_test[,c(1:19)],knn_train$Obese_Class,k=10)
knn_test$ObesePred = knn.10
confusionMatrix(table(knn_test$Obese_Class,knn_test$ObesePred))
## Confusion Matrix and Statistics
##
##
## Not Obese Obese
## Not Obese 210 114
## Obese 67 550
##
## Accuracy : 0.8077
## 95% CI : (0.781, 0.8324)
## No Information Rate : 0.7056
## P-Value [Acc > NIR] : 5.733e-13
##
## Kappa : 0.5588
##
## Mcnemar's Test P-Value : 0.0006282
##
## Sensitivity : 0.7581
## Specificity : 0.8283
## Pos Pred Value : 0.6481
## Neg Pred Value : 0.8914
## Prevalence : 0.2944
## Detection Rate : 0.2232
## Detection Prevalence : 0.3443
## Balanced Accuracy : 0.7932
##
## 'Positive' Class : Not Obese
##
knn.20 = class::knn(knn_train[,c(1:19)],knn_test[,c(1:19)],knn_train$Obese_Class,k=20)
knn_test$ObesePred = knn.20
confusionMatrix(table(knn_test$Obese_Class,knn_test$ObesePred))
## Confusion Matrix and Statistics
##
##
## Not Obese Obese
## Not Obese 205 119
## Obese 51 566
##
## Accuracy : 0.8193
## 95% CI : (0.7932, 0.8434)
## No Information Rate : 0.7279
## P-Value [Acc > NIR] : 3.405e-11
##
## Kappa : 0.5789
##
## Mcnemar's Test P-Value : 2.767e-07
##
## Sensitivity : 0.8008
## Specificity : 0.8263
## Pos Pred Value : 0.6327
## Neg Pred Value : 0.9173
## Prevalence : 0.2721
## Detection Rate : 0.2179
## Detection Prevalence : 0.3443
## Balanced Accuracy : 0.8135
##
## 'Positive' Class : Not Obese
##
knn.50 = class::knn(knn_train[,c(1:19)],knn_test[,c(1:19)],knn_train$Obese_Class,k=50)
knn_test$ObesePred = knn.50
confusionMatrix(table(knn_test$Obese_Class,knn_test$ObesePred))
## Confusion Matrix and Statistics
##
##
## Not Obese Obese
## Not Obese 185 139
## Obese 46 571
##
## Accuracy : 0.8034
## 95% CI : (0.7765, 0.8283)
## No Information Rate : 0.7545
## P-Value [Acc > NIR] : 0.000212
##
## Kappa : 0.5327
##
## Mcnemar's Test P-Value : 1.343e-11
##
## Sensitivity : 0.8009
## Specificity : 0.8042
## Pos Pred Value : 0.5710
## Neg Pred Value : 0.9254
## Prevalence : 0.2455
## Detection Rate : 0.1966
## Detection Prevalence : 0.3443
## Balanced Accuracy : 0.8025
##
## 'Positive' Class : Not Obese
##
knn.100 = class::knn(knn_train[,c(1:19)],knn_test[,c(1:19)],knn_train$Obese_Class,k=100)
knn_test$ObesePred = knn.100
confusionMatrix(table(knn_test$Obese_Class,knn_test$ObesePred))
## Confusion Matrix and Statistics
##
##
## Not Obese Obese
## Not Obese 170 154
## Obese 45 572
##
## Accuracy : 0.7885
## 95% CI : (0.761, 0.8142)
## No Information Rate : 0.7715
## P-Value [Acc > NIR] : 0.1137
##
## Kappa : 0.491
##
## Mcnemar's Test P-Value : 1.919e-14
##
## Sensitivity : 0.7907
## Specificity : 0.7879
## Pos Pred Value : 0.5247
## Neg Pred Value : 0.9271
## Prevalence : 0.2285
## Detection Rate : 0.1807
## Detection Prevalence : 0.3443
## Balanced Accuracy : 0.7893
##
## 'Positive' Class : Not Obese
##
# knn.20 seems be the sweet spot but not by much
knn.20 seems be the sweet spot but not by much