## ACM/ESE 118: Outliers ## February 3, 2011 ## John Bruer rm(list=ls()) #clear the environment library(car) #for outlierTest # NOTE: The library "car" is not in the default installation of R. Before you # run this script, you may need to install this library. ## Alcohol/Tobaco Dataset # Normally we will load data in from a file. Here I've entered it manually. # Note the row.names paramter. This will let you assign labels to the rows # themselves instead of creating a text column in your data.frame. You may # also pass the name of a column in your data file that contains the labels. alctob <- data.frame( cbind( Alcohol = c(6.47, 6.13, 6.19, 4.89, 5.63, 4.52, 5.89, 4.79, 5.27, 6.08, 4.02), Tobacco = c(4.03, 3.76, 3.77, 3.34, 3.47, 2.92, 3.20, 2.71, 3.53, 4.51, 4.56)), row.names = c("North", "Yorkshire", "Northeast", "East Midlands", "West Midlands", "East Anglia", "Southeast", "Southwest", "Wales", "Scotland", "N. Ireland")) alctobwo <- subset(alctob,rownames(alctob)!="N. Ireland") #without North Ireland plot(alctob$Tobacco, alctob$Alcohol, main="Weekly Household Spending on Alcohol vs. Tobacco", xlab="Tobacco Spending (GBP)", ylab="Alcohol Spending (GBP)") #note N. Ireland in the bottom-right # Notice the difference an outlier can make in the correlation cor(alctob) # .22 cor(alctobwo) # .78 # We now want to create two linear models (one without N. Ireland) lmwith<-lm(Alcohol ~ Tobacco, data=alctob) lmwo<-lm(Alcohol ~ Tobacco, data=alctobwo) # Note the large difference in the fit lines. This is why we call N. Ireland # an influential observation. It has greatly influenced our fit estimates. abline(lmwith,lty="dashed") abline(lmwo) # When we look at the summary statistics for both models, we see striking # differences. Pay attention to the significance level of the slope estimate # as well as the R-squared value. summary(lmwith) summary(lmwo) # The conservative outlier test that we talked about in class uses the # Bonferonni inequality to calculate the p-values we associate with the # Student's-t test. In R, we can use the outlierTest command to perform this # test on our model. Remember that when we test for influence, we are testing # the effect of an observation on model coefficients. Therefore we need to # give the outlierTest command a linear model as its input. outlierTest(lmwith) # We can also use R to calculate Cook's distance. Remember that generally we # label any observation with Cook's distance greater than 1 as influential. cooks.distance(lmwith) # Finally, one of the easier ways to evaluate our residuals and look for # for influential points is through plots. In R, the plot command takes a # special form when you pass it an lm object (see ?plot.lm for all of the # details). Here we want to focus on three of the plots available through # the command. # # The first one displays the residuals vs. the fitted values # we use this to evlauate the mean, variance and correlation of residuals. # If our assumptions of constant variance and uncorrelated residuals are # violated we may be able to correct this with a variance-stabilizing # transformation. # # The second plot helps us check the normality of the residuals. If the # residuals are indeed normal, they should fall along the dashed line. # Remember that the normality assumption for our errors allows us to determine # the standard errors of our coefficients and predictions. # # The final plot will display our residuals vs. their leverage. The dashed red # lines are level curves that denote a particular value of Cook's distance. # We will pay attention to points lying beyond the distance of 1. Notice that # when we have data with row labels, the points will be labeled with their # names. Otherwise, the row number will be shown. plot(lmwith, which=c(1,2,5)) # Some of you have asked about the red line that appears in these plots. It is # a LOWESS smoother that may help find patterns in the data. For this class, # pay less attention to the line and use your own eyes to look for the # patterns we have discussed.