## Introduction to R ## ACM/ESE 118, Winter 2011 ## January 7, 2011 ## John Bruer ## ## R Basics ## # In R you can use the <- operator to assign values to objects. # (Note that = appears to work as well, but convention is to use <-.) # We can make a simple assignment, x<-2 # and view the contents of the object we just created: x # We can create vectors such as x<-1:10 #Note that : gives the sequence of numbers as in MATLAB # If we want a more complicated sequence, we can use the command seq: y<-seq(1,20,length=10) z<-seq(1,20,2) # and see the results: y z # If we wanted to get more information on the seq command, we could use # the help feature in R: help(seq) ?seq #NOTE: ? is a shortcut for help # Let's take a closer look at how we can store data in R. The basic data # types are vectors, matrixes and data.frames. So far we have been creating # vectors. To specify a vector explicitly, we use: x<-c(4,8,15,16,23,42) x[5] # retreives the fifth element x[-5] # retreives every element EXCEPT the fifth x[c(1,3)] # retreives the first and third elements # We can also create matrices: y<-matrix(c(11,21,31,12,22,32),3,2) y # observe that we have created a 3x2 matrix that we have filled # column by column y[2] # retrieves the second element (remember column by column!) y[1,] # retrieves the first row y[,1] # retreives the first column y[3,2]# retreives the element in third row of the second column y[-2,]# retreives all but the second row # If we create another vector, z<-c(1,1,2,3,5,8) c(x,z) # we can combine it with the vector x to make a new vector cbind(x,z) # or a matrix (remember we fill column by column). # R will allow us to do basic arithematic. For example we can 2+2 # add numbers x+z # add vectors y+y # add matrices y*y # multiply matrices COMPONENT-wise! (be careful) t(y) # matrix transposition y%*%t(y) # true matrix multiplication ## ## Basic Statistics ## # R is built for statastical work. We can use basic stats functions mean(x) # returns the arithmetic mean of vector x sd(x) # returns the standard deviation var(x) # returns the variance summary(x) # gives summary statistics including quartiles # We also have access to functions that deal with distributions. These follow # a consistent naming convention that consist of a letter (d, p, q or r) # and the name of a distribution. For the standard normal distribution: # dnorm(x) gives the density at x (PDF) # pnorm(q) evaluates the CDF at q # qnorm(p) evaluates the inverse CDF at p # rnorm(n) generates a vector of n random observations # Note the density, distribution and quantile functions also accept vectors. # Also, if we wanted to specify a normal distribution with a given mean # and standard deviation, we could use # dnorm(x,mean=1,sd=2) # or also # dnorm(x,1,2) # The valid distributions and parameters are as follows # Distribution R Name Parameters # beta beta shape1, shape2 # binomial binom size, prob # Cauchy cauchy location, scale # chi-squared chisq df # exponential exp rate # F f df1, df2 # gamma gamma shape, rate # geometric geom p # hypergeometric hyper m, n, k # log-normal lnorm meanlog, adlog # logistic logis location, scale # negative binom. nbinom size, prob # normal norm mean, sd # Poisson pois lambda # Student's-t t df # uniform unif min, max # Weibull weibull shape, scale # As an example, let's calculate the critical values of the Student's-t # distribution with 5 degrees of freedom and a 95% confidence level: qt( c(0.025,1-0.025), df=5 ) # gives us the upper and lower critical # values for a two-sided t-test. ## ## Introduction to data.frames ## # What we really want to do in R is manipulate real data. That's were # data.frames are useful. # We can create a data.frame d<-data.frame(cbind(x,z)) d # shows us that we have created a data.frame with two # columns/variables. names(d) # returns the names of the variables in the data.frame d$x # returns the vector of the variable called x fix(d) # will allow us to edit d visually # While we can create data.frames in this way, most of the time we will want # to load in some dataset that we wish to analyze. We can load in a dataset # from a file. First let's set our working directory using the command setwd # so that we do not have to type a pathname everytime we want to access the # filesystem. You may choose to create a directory anywhere you want. # For example you may use # setwd("C:/Users//Rdata") in Windows Vista or later # setwd("/Users//Rdata") in Mac OS X # On my machine I have setwd("Z:/Documents/Caltech/Teaching/ACM118-WI11/Rdata") # Now we can load a dataset from a file in this directory: forbes<-read.table("forbes.dat") # If we take a look at the created data.frame forbes # we see that something is wrong. R has imported the data, but has interpreted # the variable names as the first row of our data. To fix this, we can tell R # that our data includes a header: forbes<-read.table("forbes.dat",header=T) # and verify that our variables are named correctly: names(forbes) # Instead of downloading datasets that are needed for the course, we can use # a neat trick. For example, to load in the dataset "amazon.dat" from your # first homework assignment, you can use amazon<- read.table("http://www.gps.caltech.edu/~tapio/acm118/Hw/amazon.dat", header=T) # Note that as of 1/7/11, the other files needed for the first homework # assignment have extraneous text in the header line that will need to be # removed prior to importing. # Returning ot our forbes dataset, we see that there are two variables # "Tb" and "logP". Before we can start working with this dataset, though, we # need to know what the data actually represent! This is important for this # course---and in general. In this case, we are using a common sample dataset # on the boiling point of water as it relates to barometric pressure. It was # published in 1857 by James David Forbes and consists of 17 measurements he # made in the Alps. The variables are # Tb - boiling point in degrees F # logP - logarithm of pressure in inches Hg # NOTE: The original data includes pressure in inches Hg, but even though I'm # not a physicist, I can tell that these values are not log inHg. Let's just # assume they are because it won't matter for the rest of our tutorial. # As before, we can access the variable Tb with forbes$Tb # However, this could get cumbersome. Instead if we use attach(forbes) # Then we can reference the variables with their names alone Tb logP # As with matrices, we can extract specific data from this set forbes[,2] # returns the second column (equivalent to logP) forbes[4,] # returns the fourth row # Note that we can also use Boolean conditions such as forbes[Tb>=200,] # returns all rows where Tb > 200 # We can use our statistcal commands on a data.frame too mean(forbes) # returns the mean of each variable summary(forbes) # gives summary statistics for each variable # We can transform data. Let's convert our temperature measurments into # degrees Celsius and create a new data.frame: TbC<-(Tb-32)*(5/9) # create a vector of temps. in deg. C forbes.c<-data.frame(TbC,logP) # create a new data.frame w/ TbC and logP # Or we can use cbind to combine data into one data.frame: forbes.all<-cbind(forbes,TbC) ## ## Basic Plotting ## # Let's say that we want to look at a scatter plot of log pressure vs. # boiling point. We can simply use the plot command plot(logP,Tb) # gives us a plot with Tb on the x-axis # and logP on the y-axis # You should give all of your plots descriptive titles and axes labels: plot(logP,Tb, main="Boiling point vs. logarithmic barometric pressure", xlab="Barometic pressure (log inches Hg)", ylab="Boiling point (degrees F)") # Plot has many more options including point/line type, colors. Refer to the # help files using ?plot to learn more. # We can create histograms using hist: hist(rnorm(1000)) # If we wish to have the area of the histogram normalized to 1, we specify hist(rnorm(1000),probability=T) # Note that R has a time-saving feature that will let you set parameters # using just enough characters to uniquely identify the parameter. So instead # of typing probability, we can type hist(rnorm(1000),pr=T) # POP QUIZ: What's missing here? An approriate title and axis labels. Get in # the habit of properly labeling your plots. ## ## Simple Linear Regression ## # Looking back at the scatter plot of our variables Tb and logP, we see strong # evidence of a linear relationship between the variables. Let's explore this # by fitting a linear model to the data lm(Tb ~ logP) # where ~ indicates that we regress Tb on logP. Note that the model inclues an # intercept coefficient by default. # We see that the output from lm is quite basic. However, we can store all of # the linear model data into an object of type lm: fit<-lm(Tb ~ logP) # If we simply type fit # on the command line, we get the same output as before. However if we type names(fit) # we can get some idea of the kinds of data stored in the lm object. # For example, to compute residual mean squared error, we would calculate # MSE = Residual sum of squares (RSS) / Residual degress of freedom. # By taking a look at the variables stored in fit, we see that we can # calculate RMSE as follows: sum((fit$residuals)^2)/fit$df.residual # or to save typing, we can do the trick we did earlier with function # parameters: sum((fit$resid)^2)/fit$df.resid # While we could continue to calculate other statistics of our regression # model this way, it is time-consuming. Like vectors and data.frames, lm # objects also have an implementation for the summary command. If we type summary(fit) # we are presented with much more information about our linear model. (You can # refer to ?summary.lm for details on everything that is presented. ) We can # see that the summary presents a value for residual standard error. This is # simply the sqaure root of the MSE. To confirm, we can type .4221^2 # which is roughly what we calculated earlier. ## ## Plotting Linear Models ## # If we want to vizualize our linear model, we can take our scatter plot from # earlier plot(logP,Tb, main="Boiling point vs. logarithmic barometric pressure", xlab="Barometic pressure (log inches Hg)", ylab="Boiling point (degrees F)") # and add in the fitted line abline(fit) # We also commonly want to see the residual values vs. the fitted ones plot(fit$fitted,fit$resid, main="Residuals vs. fitted values for model fit", xlab="Fitted values", ylab="Residuals") # (Do you think there might be an outlier in the data?) # Often we also include a horizontal line at y=0. We can add this by using # abline and specifying the intercept and slope manually: abline(0,0,lty="dashed") # Note that lty specifies the line type. You can refer to ?par for more # graphics parameters that can be applied to most plots. # It is also useful to know that lm implements the plot function. By typing plot(fit) # R will generate several plots including the residuals vs. fitted values that # we created above. At this point, don't worry what all the different plots # mean. Generally, they help us visually judge our goodness-of-fit and detect # the presence of outliers. Note that if you just want the Q-Q plot which # appeared second in the sequence, you can type plot(fit,2) ## ## Confidence Intervals and Predictions for Linear Models ## # Since the model parameters we have calculated are estimates, we may want to # find a confidence interval for them. Using the mehtods above, you should be # able to perform this calculation manually. However, since we eill end up # doing this a lot, it will be beneficial to learn how to use confit: confint(fit) # returns 95% confidence intervals for each parameter confint(fit,level=0.9) # returns 90% confidence intervals confint(fit,"logP",level=0.9) # returns 90% CI for logP only # We can also find prediction intervals using the predict command predict(fit,data.frame(logP=131:148),interval="prediction") # where fit is our model, the data.frame represents the values of logP we want # prediction intervals for. (Also see ?predict.lm for more options.) The # output gives the model's predicted value with lower and upper ranges for the # prediction. # We can also view our predictions graphically forbespred<-predict(fit,data.frame(logP=131:148),interval="prediction") matplot(131:148,forbespred, lty=c(1,2,2), type="l", ylab="Predicted boiling point (in degrees F)", xlab="Barometric pressure (in log inHg)") ## This should be enough to get you started using R!