DATA ANALYTICS: June 2014

Sunday, June 29, 2014

Introduction to R: Saving an Excel File as CSV

In the previous post we talked about loading and writing files in R, we also tackled the topics about the read.table() and the read.csv() functions. In this post we will learn how to save an Excel file in csv format.

CSV is short for comma separated values, this allows the data to be saved in a table structured format. Excel files that are saved under the csv format takes the form of a text file containing information separated by a series of comma between values.

The R language however can read other file formats:
read.xls() for the Excel files, though you need to have a Perl run time to exist in the system to use this. A more complex form of importing Excel files to R is through the use of the loadWorkbook(), to load the workbook and the readWorksheet() to open the desired worksheet in the workbook. You need to install Java for this.

To import Minitab files to R you can use the read.mtp(), for SPSS R Language has the read.spss() and finally for the csv we have the read.csv().

Before you can load your csv in the R language you need to save your Excel files to the csv format using the following steps:

a. In this example I have an Excel file which contains the current employee census of Australia.

b. Click the home button, click Save as, click Other formats.

c. Choose the CSV(comma delimited) format. You have an option to chosse the CSV(Macintosh) or CSV(MS-DOS). I prefer to use the comma delimited format.

d. Click Save.

e. If you have multiple sheets, Excel will give you a warning that only the current active worksheet will be saved into the csv format and other worksheets should be individually saved.

f. Another warning will pop out that the file may not be compatible with the comma delimited format, you can ignore this and click yes.

Wednesday, June 25, 2014

Introduction to R: Loading Files, Writing Files, and the read.table() and read.csv() Function in R

In our previous discussion we talked about the the ways on how to remove NA values in R. This time we will talk about the different ways of loading and writing files in the R Language. The following are the principal functions for loading, and writing ready-made data into R.

For loading tabular data, such as those in the excel format:
read.table()
read.csv()

For loading lines in a text file:
readlines()

For loading a R code file:
source(), which is the inverse of a dump as we will later discuss.
dget(), which is the inverse of a dput which will be discussed later.

For loading files that are saved in the workspace:
load()

For loading single R objects in a binary code:
unserialize()

For writing tables in R:
write.table(), although a common thing we do is to write the data in the form of a data frame.

For writing lines in R:
writeLines()

For writing codes in R:
dump()
dget()

Saving R files in the workspace:
save()

For writing a binary in R:
serialize()

read.table() & read.csv()Function in R

The read.csv() is identical to the read.table() function the only main difference is that the read.csv() has a comma as a separator while that of the read.table() has a space as a separator between each values. The read.csv() is a common format from excel files.

The read.table is suitable for small or moderately sized data and has the following common arguments:

file() is used to determine the name of file or name of connection.

header() is a simple logical command if the file contains a header line.

sep() is a command that determines how the data in the table are separated.

colClasses() is a command that indicates the class of each column in the data set.

nrows() is a command that indicates the number of rows in the data set.

comment.char() is a string of character that is indicating a comment.

skip() is a command that skips a number of lines from the previous text.

stringAsFactors() is a command that encodes a character variable as a factor.

As previously said, the read.table() function is suitable for small or medium size data, but in cases where analyzing large sets of data are needed the following precautionary measures should be done:

a. First, calculate the memory required to read and store the large data.

To make a rough calculation on the memory needed by your large data follow the formula:

Memory Requirements = number of rows x number of columns x (8 bytes per objects)

Total Memory Requirement = 2 x Memory Requirements

b. Second, remember that R's operation is dependent on the amount of RAM in your computer.

c. Third, it is better to close other applications on the computer to avoid the delay in the R operation.

d. Fourth, use colClasses to specify the classes in the columns, this will orient R the classes of objects in the column.

e. Fifth, cut your data into tidbits when analyzing:

- take the first 100 rows.

- Loop through each column using sapply() function.

- Use the taball operator to determine what classes are present in each column.

the format is as follows:

initial<-read.csv("name of file here", nrows=100)

classes<-sapply(initial, class)

tabAll<-read.csv("name of file here", colClasses=classes)

Sunday, June 22, 2014

Introduction to R: Removing NA Values

We have discussed sub-setting objects and elements in our previous topic. Now we will discuss on how to remove NA or missing data in R. This is the most common operation in data analysis because most of the raw data have missing values. The principle is to create a logical vector that will locate the NA values, these NA values are then removed using the [!] operator.

Let's proceed to the exercise:

Say you have a list x<-c(1,NA,3,NA,NA,6,NA,7,8) and you want to remove the missing values. In our previous discussion on missing values, there is an is.na() function whose purpose is to locate the missing values in your vector. We will create a logical vector z<-is.na(x), this logical expression will determine the location of the NA values, to remove the NA values we will use the [!] expression in the form of x[!z].

REMOVING NA VALUES IN MULTIPLE VECTORS

If you have more than one vector with missing values, you can use the complete.cases() function. For example:

You have two vectors x<-c(2,NA,4,NA,6,NA,8,NA) and y<-c("b", NA,"d", NA,"f",NA,"h",NA). We will create a logical vector that will give an output with deleted NA values using the expression: z<-complete.cases(x,y).

REMOVING NA VALUES FROM A DATA FRAME

In most of the cases our data are stored in tabular forms such as in .xls files or in csv files. For this exercise we will use a csv file to create a data frame in R. The data from this .csv file is from a quiz used by the John Hopkins School of Public Health and Biostatistics titled "air quality", download the file here.

Step 1: In your R console click File.

Step 2: A menu box will open, click Local Disk (C:)

Step 3: Click the directory Users.

Step 4: Click another directory named Users.

Step 5: The file you have downloaded should be in the Download directory, click this folder then click OK.

Step 6: To determine if the file is in the Download directory type list.files(). The air quality file should be present.

Step 7: To read the data from the file type read.csv("airquality.csv"), the R console will show a data frame which consists of 153 rows and 6 columns (Ozone, Solar.R, Wind, Temp, Month, Day).

Step 8: Say we only want to analyse the first 10 rows, then we type airquality[1:10,]. Recall in our previous topic on matrices, rows have the syntax [nrow,] while the column has the syntax [,ncolumn]

Step 9: If you have observed there are NA values in the data frame, to remove the NA values we type the expression z<-complete.cases(airquality). To give an output where the NA values are removed we type airquality[z,][1:10,].

As you have observed all the rows with an NA values are deleted and are replaced with another row with complete data sets, but there are still 10 number of rows.

Thursday, June 19, 2014

Introduction to R: Sub-Setting Part II

Previously we have discussed about sub-setting an object or an element in R. Now we will discuss on how to sub-set objects and elements from a list. The principle are the same the operators [], [[]] and $ can be used.

Let's proceed to the exercises. Create a list with two elements, name the first element with your first name and assign a value to it with the sequence 1 to 5. Name the second element with your favourite fruit and assign your favourite number as its value.
a. Extract the first element in list form.
b. Extract the first element in sequence form.
c. Extract the second element using $ and [[]] operators.
d. Extract the second element in list form by using the name of the element of interest.

The operator [] returns an output with the same class as the original, since x is a list then the expression x[1] will give an output that is also a list with a sequence 1 to 5. The [[ ]] operator only gives an output which is a sequence of the numbers 1 to 5.

You can also use the name of the element inside the [] operator, as in the ["durian"] example, here we use the name "durian" instead of the number index 2.

The [] operator can also be used to extract multiple elements from a list using the [c()] function. Let's proceed to the exercises:
Create a list with three elements, name the first element with your first name and assign a value into it with the sequence 1 to 5. Name the second element with your favourite fruit and assign your favourite number as its value, name the third element with your favourite animal and assign a value which is equivalent to the number of its legs.
a. Extract the elements 1 and 3 from the list that you have created.

Here we have made the expression x[c(1,3)], 1 being the number index of the first element which corresponds to the name gerard and 3 being the number index of the third element in the list which corresponds to owl.

It should be noted however that the [[ ]] and the $ operators have different function when used to retrieve an element from a list. The [[ ]] operator can only be used with a computed index while the $ operator can only be used with a literal name. Let's take the above example:

Here we have made a new vector x<-"gerard", the purpose of this is to create a vector to assign the values of the element gerard=1:5. The x<- expression resulted in a computation that the string "name" is equals to the string "gerard". Notice that the list x only constitute the elements "gerard", "durian" and "owl" but there is no element "name", hence the x[[gerard]] becomes similar with the x[[names]].

However, if we use the x$gerard, the operation $gerard will literally look for the string "gerard" in the list, thus x$gerard is not equivalent to the x$name, if you have observed that when we type the x$name it gives out a null because the string "name" does not exist in the original list: x<-list(gerard=1:5, durian=7, owl=2).

SUB-SETTING ELEMENTS FROM A LIST NESTED INTO ANOTHER LIST

The [[c(number index, number index )]] operator can be used to extract an element from a list that is nested within another list, to illustrate this we go to our next exercise.

Say, we want to extract the number 5 from the expression:

x<-list(gerard=list(1,3,5), durian=list(2,4,6), owl=list(7,8,9)).

As you can observe the number 5 is the third element of a list named "gerard", furthermore, the list that is named "gerard" is the first element of another list which is "x". To extract the number 5 we need to use the number index method we previously discussed. The number index of the number 5 in the list "gerard" is 3, while the number index of the list "gerard" form the original list "x" is 1.

Using operator [[ c()]], the expression becomes x[[c(1,3)]]. 1 for the number index for "gerard" and 3 for the number index of "5".

PARTIAL MATCHING IN R

If you have a list with an incredibly long name and you want to save time in typing, you may opt to use the operators: [[ ]] and $.

For example we have a list containing the elements pneumonoultramicroscopicsilicovolcanosis having the sequence 1 to 5, the second element is another long name monosodiumglutamate with the sequence 6 to 9 Since it would take us time to type the word pneumonoultramicroscopicsilicovolcanosis over and over again every time we want to know its value we can use the expression $p.

The $p expression will search for a name in the list containing a word with p as its first letter.

Care should be taken as R program is quite syntax sensitive, as you can observe X$p and x$P will give a NULL value. On the other hand, the [[ ]] operator has a different approach, as you can see if you type [[ "p"]] gives out a NULL this is because the [[ ]] operator will search for an exact match in the list, since the list x do not have an element named "p", the result is a NULL, to resolve this we can use the [[ , exact=FALSE]] operator, like in our example x[[ "p", exact=FALSE]].

Tuesday, June 17, 2014

Introduction to R: Sub-setting Part I

In our previous discussion we talked about factors, missing values, data frames and naming. In this session we will talk about sub-setting elements in R. Sub-setting is useful if you have a manageable set of data, the operation is helpful when you want to know what element is present on a particular location in your list, vectors or matrices.

There are three operators for sub-setting objects in R:
[ ] is used to extract an object of the same class as the original, by this we mean that if you want to extract a list form a list, the output is a list, or, if you want to extract a character element from a vector the output is a character element.

[[ ]] is used to extract a single element from a list or a data frame. The class of the output is not necessarily the same as the original. It means that if you extract a numeric vector 1 from a list, the output may not necessarily be a numeric vector 1, it can be an integer 1 or a character "1".

$ is used to extract elements from a list or data from that has a name. Remember, in our previous discussion, names are useful to reference an object. Again, the class of the output is not necessarily the same as the original. It means that if you extract a numeric vector 1 from a list, the output may not necessarily be a numeric vector 1, it can be an integer 1 or a character "1".

Let's go to some exercises:
1. Express a character vector having the element a to d.
2. Extract the 1st element, 2nd, 3rd and 4th.
3. Extract all the elements other than a.

Solution:
We can actually do this in two ways, first we can use the numeric index method. In this method R recognizes each element of the character vector with a number say "a"=1, "b"=2, "c"=3 and "d"=4. If we type x[1], x is the vector of interest and [1] is the 1st element within the vector of interest, if we hit enter it gives us an output of "a".

To solve the next problem, we can use the logical index method. By default, R can recognize lexical ordering, that is c>b>a or a<b<c<d<e and so forth, by using this default we can create a logical indexing. Since b,c and d are greater than a we can let y be anything that is greater than a. The expression is y<-x>"a", in this case the expression x>"a" is coercing the values b,c,d to the vector of interest which is x, the expression y<-x gives a new variable vector which is y which will nest all the values from the vector if interest which is x, in this case all values greater than a.

The output for y is actually a series of logical values: FALSE TRUE TRUE TRUE. The first one is FALSE because "a" is not greater than "a". To determine which are greater than a the x(y) function is used and the output is a series of letters that is greater than "a".

Next, we will learn on how to sub-setting a matrix. Matrices can be sub-setted with the (row, column) type index. Let's jump to the exercises.
Ex:
a. Construct a 2 by 3 matrix with a number sequence 1 to 6.
b. Extract the element of row 1, column 2.
c. Extract the element of row 2, column 1.

The solution is simple and the expression x[row, column] is used to extract the element of interest. Let's go further, say we want to:
d. Extract all the element of the first row only.
e. Extract all the elements of the second column only.

Remember in our previous discussion on data frames that the number before the comma represents a row and any number after the comma represents a column; so [1,] is row 1, [2,] is row 2 and so on while [,1] is column 1, [,2] is column 2 and so on. The same expression goes to extracting all the elements of a row or a column in a matrix, using the expression x[row number,] or x[, column number]. so we have x[1,] to extract all the elements at row 1 and x[,2] to extract all the elements at column 2.

By default when an element from a matrix is extracted or sub-setted, R usually gives a vector output. But what if we want the output to be a 1x1 matrix, how can we do this? The expression x[row number, column number, drop=FALSE] can be used. So let's go on to some more exercises. Let's take the previous problem, and say we want to:
a. Extract the element of row 1, column 2.
b. Extract the element of row 1, column 2 in a matrix form.

c. Extract the elements of row 1.
d. Extract the elements of row 1 in matrix form.

Sunday, June 15, 2014

Introduction to R: Factors, Missing Values, Data frames and Names

In our previous discussion, we learn on how to create vectors, matrices and lists in R, now we will learn about factors, missing values, data frames and names in the R language.

We will first discuss what are factors. Factors are ways to represent categorical data in R, a categorical data can be an unordered nominal data such as gender, color or flavor, or an ordered ordinal data such as 1st, 2nd, 3rd, 4th, 5th. Factors are important in modelling functions such as lm() for linear modelling and glm() for generalized linear modelling. Factors are self-describing such as male or female for nominal factors, senior level, mid-level or junior level for ordered factors.

A factor can be created in R using the factor() function, the input of a factor() function is a character vector. Say for example, we want to create a factor with two levels: yes and no, we then want to create a frequency table on how many no's and yes's are present.

The table() function is used to determine the frequency of the yes and no's in our example. The unclass() function gives a default output that is innate in R, 1 being no and 2 being yes, the unclass() function also gives the attributes that is involved in the expression.

R usually determines the baseline function in an alphabetical manner, in the yes and no example the baseline is no. However during modelling, ordered baseline factors are quite important. To create an ordered factor in R we need to use the levels=c() function. In our previous example if you have observed in the unclass(x) attr(,"level") the output starts with a no followed by the yes, this is because in the alphabet the letter n comes before y, so the baseline factor here is no. But what if we want to have yes as our baseline factor?

Here we integrated the levels=c("yes", "no") expression in the factor() function, the first character "yes" tells R that the baseline factor will be "yes", as can be observed in the Levels output that started with a yes, Levels: yes no.

Most of the time you will often encounter factors with a missing value. Missing values are a special kind of objects in R which are defined as NaN if it is an undefined mathematical operator or a NA if it is a missing value or an error. To test for a missing value or an error, the is.na() function can be used, while for undefined mathematical operators the is.nan() function is used.

NA values can have a class, it can be an integer NA, a character NA, a numeric NA, a logical NA, or a complex NA. A NaN value can be a NA but a NA value can never be an NaN as seen in our example. We have also seen that the output is a logical vector: true if there is a missing value or an undefined mathematical operator and false if NA or NaN is not present.

Another component in R is its ability to create or read data frames, these function is used to store and analyse tabular data. It is represented by a special type of list where every element on the list has the same length. Each element of a list is given as a column while each length of an element in the list is given by the row. Data frames can store different classes of objects in each column, unlike matrices which requires that each element should be of the same class. R can read data frames by the read.table() function or the read.csv() function. Data frames can also be converted into a matrix by the data.matrix() function.

Aside from reading tables and text or csv files, data frames can also be created in R using the data.frame() function.

In our example we created a tabular data frame with a number sequence 1 to 3 and followed by a logical sequence of False, True, False. The first object number=1:3 is taken as the first column and the logic=c(F,T,F) is taken as the second column. The nrow() and the ncol() function can be used to determine how many rows and columns are there in the expression.

Aside from data frames, we can create readable codes and self-describing objects in R by the name() function. Say for example we want to create an integer vector from 1 to 5, we will name each integer: cat, dog, ant, bird, tree.

Lists and matrices can also be named as in the following examples:

naming list is quite direct, just by placing the names inside the list() function.

For naming matrices the dimnames() function is used using the list(c()) function as a variable. A matrix expression is first constructed specifying the number of sequence, the number of rows and columns present in the matrix. This is followed by the dimnames() function, as can be observed the first two names ("cat", "dog") are taken as row names, while the ("ant", "bird") are taken as column names.

Friday, June 13, 2014

Introduction to R: Vectors, Matrices and Lists

Our previous discussion talks about a short introduction to R, in this topic we will discuss the basic commands on creating vectors,matrices , and lists. To begin with, the syntax typed into the R console is termed as an expression and the "<-" symbol used in writing an expression is termed as the assignment operator, this gives a variable an assigned "value".

Example:

type the expression x <-5 in your R console, press enter, type x and press enter again. it will show you the value of x. What you did was entering an expression in R.

In our example, the expression x<-5 is a numeric vector with a first element that is a number object 1. our second example is a character vector with a character string My name is Gerard.

Now that you have written your first expression, let's make a sequence of numbers. A number sequence can be created by the colon : operator. Type the the variable x in your R console and assign the value 1:30, then auto-print your expression by entering x. your expression should look like:

The command expresses a sequence of numbers from 1 to 30. The output is an integer vector, none-scalar numbers. The first line has the [1] because the first element is the number 1 and the second line has the [26] because the first element starts with the number 26.

Now we will create vector of objects by the c() function. the c here refers to catenation, this means that it connects a series of objects to form ties or links.

the first example, x<-c("1-2i", "1+i"), is termed as a complex vector because it has the complex, none real number, i.

The second example, x<-c("a","b","c"), is termed as a character vector because it has the characters in the alphabet.

the third and fourth examples, x<-c("TRUE", "FALSE") and x<-c("YES", "NO"), are termed logical vectors because it contains a priori conditions before an action will be executed. We will discuss logical functions in our advance topics.

The fifth example, x<-c(0.099, 1) is a numeric vector because it contains the number with scalar values.

We can also create a vector by the vector() function, it is actually the long hand way of writing a vector in R.

Here we have an expression x<-vector("numeric", length=5), or a vector that contains numeric elements with a length of 5 objects. the output are all zero's because by default an unassigned numeric vector will have a value of 0.

Most of the time we express vectors with mixed objects. The R program prioritizes the vectors depending on the atomic classes present as an element of the vector. the sequence of prioritization are as follows:

1st Priority = character

2nd Priority = numeric

3rd Priority = logical

here is an example:

In the first example, x<-c(2,"b"), the expression is a character vector and the number 2 is coerced as a character vector because of the element b which is under the atomic class of "character". Thus, this expression is a character vector with the elements 2 and b. The second example, x<-c("FALSE", 10) is a numeric vector with the elements "FALSE" and 10. The third example, x<-c("FALSE", "b") is a character vector with the elements "FALSE" and "b".

In R Language, you can express one atomic class element into another element by forced coercion using the as. function.

In our example we created a sequence of numbers from 0 to 3 by the expression x<-0:3, to determine what kind of atomic class it is we used the class() function and typed class(x). The output showed that x is an integer vector, we then tried to force coerced the integer vector into numeric, logical, character and complex using the as. function. Take note that in the logical vector, by default the value 0 is equivalent to FALSE, anything greater than 0 is TRUE.

Note should be taken in forced coercing atomic classes to another atomic classes as there are times that it might end up as a "illogical coercion" which will result in NA. In the next example we have a character vector expression x<-c("a", "b", "c") and we try to express it as numeric, logical, integer and complex. The outputs are all NA because there is no logical way of forcefully expressing "a", "b" and "c" into another set of atomic classes.

Now that you have made vectors, our next step is to create matrices. Matrices are special kind of vectors because it contains a dimension attribute. The dimension attribute in a matrix is defined by rows and columns (nrow, ncol).

Try typing in your R console x<-matrix(nrow=3, ncol=5) and hit enter. The output is a matrix with 3 rows and 5 columns. Try typing dim(x), and hit enter the output will show 3 and 5. 3 for rows and 5 for columns. Now type attribute(x) and hit enter and the output will give you the dimension of your matrix which is 3x5.

Take note that matrices are created in a column-wise manner. This means that the first column is filled first and when the maximum number of rows is reached the next column is then filled. Say for example we type the expression x<-matrix(1:10, nrow=2, ncol=5).

The maximum number of rows in our expression is two, as you have observed that the first column [,1] is filled first and when the maximum number of rows are reached, [1,] and [2,], the next column [,2] is then filled and so on.

Matrices can also be created by its dimension function, dim(). Lets create a series of numbers from 1 to 10, then lets create a 2x5 matrix (a matrix with 2 rows and 5 column) from this series of numbers using the dim() function.

Matrices can also be made by cbind or rbind, this creates a matrix by binding rows and columns. If you want your vectors, say x and y, to be a part of the column, the cbind function cbin() will be used while if you want your vectors as a part of the rows then the rbind function rbind() is used.

Aside from matrices, a list is also a special kind of vector that can be used in R. List are special because it contain different sets of atomic classes. This special kind of vector uses the list function, list(). Say for example you want to create a list x<-list("1+i", 3, "FALSE", "a"). This is a list containing a complex atomic class, a numeric, a logical and a character. The output is different because each element in the list has a different atomic class.