Sunday, June 22, 2014

Introduction to R: Removing NA Values

We have discussed sub-setting objects and elements in our previous topic. Now we will discuss on how to remove NA or missing data in R. This is the most common operation in data analysis because most of the raw data have missing values. The principle is to create a logical vector that will locate the NA values, these NA values are then removed using the [!] operator.

Let's proceed to the exercise:
Say you have a list x<-c(1,NA,3,NA,NA,6,NA,7,8) and you want to remove the missing values. In our previous discussion on missing values, there is an is.na() function whose purpose is to locate the missing values in your vector. We will create a logical vector z<-is.na(x), this logical expression will determine the location of the NA values, to remove the NA values we will use the [!] expression in the form of x[!z].





















REMOVING NA VALUES IN MULTIPLE VECTORS

If you have more than one vector with missing values, you can use the complete.cases() function. For example:

You have two vectors x<-c(2,NA,4,NA,6,NA,8,NA) and y<-c("b", NA,"d", NA,"f",NA,"h",NA). We will create a logical vector that will give an output with deleted NA values using the expression: z<-complete.cases(x,y).






















REMOVING NA VALUES FROM A DATA FRAME

In most of the cases our data are stored in tabular forms such as in .xls files or in csv files. For this exercise we will use a csv file to create a data frame in R. The data from this .csv file is from a quiz used by the John Hopkins School of Public Health and Biostatistics titled "air quality", download the file here.

Step 1: In your R console click File.



Step 2: A menu box will open, click Local Disk (C:)



Step 3: Click the directory Users.


Step 4: Click another directory named Users.



Step 5: The file you have downloaded should be in the Download directory, click this folder then click OK.


Step 6: To determine if the file is in the Download directory type list.files(). The air quality file should be present.



Step 7: To read the data from the file type read.csv("airquality.csv"), the R console will show a data frame which consists of 153 rows and 6 columns (Ozone, Solar.R, Wind, Temp, Month, Day).



Step 8: Say we only want to analyse the first 10 rows, then we type airquality[1:10,]. Recall in our previous topic on matrices, rows have the syntax [nrow,] while the column has the syntax [,ncolumn]



Step 9: If you have observed there are NA values in the data frame, to remove the NA values we type the expression z<-complete.cases(airquality). To give an output where the NA values are removed we type airquality[z,][1:10,].



As you have observed all the rows with an NA values are deleted and are replaced with another row with complete data sets, but there are still 10 number of rows.

No comments:

Post a Comment