Wednesday, June 25, 2014

Introduction to R: Loading Files, Writing Files, and the read.table() and read.csv() Function in R

In our previous discussion we talked about the the ways on how to remove NA values in R. This time we will talk about the different ways of loading and writing files in the R Language. The following are the principal functions for loading, and writing ready-made data into R.

For  loading tabular data, such as those in the excel format:
read.table()
read.csv()

For loading lines in a text file:
readlines()

For loading a R code file:
source(), which is the inverse of a dump as we will later discuss.
dget(), which is the inverse of a dput which will be discussed later.

For loading files that are saved in the workspace:
load()

For loading single R objects in a binary code:
unserialize()

For writing tables in R:
write.table(), although a common thing we do is to write the data in the form of a data frame.

For writing lines in R:
writeLines()

For writing codes in R:
dump()
dget()

Saving R files in the workspace:
save()

For writing a binary in R:
serialize()

read.table()  & read.csv()Function in R

The read.csv() is identical to the read.table() function the only main difference is that the read.csv() has a comma as a separator while that of the read.table() has a space as a separator between each values. The read.csv() is a common format from excel files.

The read.table is suitable for small or moderately sized data and has the following common arguments:
file() is used to determine the name of file or name of connection.

header() is a simple logical command if the file contains a header line.

sep() is a command that determines how the data in the table are separated.

colClasses() is a command that indicates the class of each column in the data set.

nrows() is a command that indicates the number of rows in the data set.

comment.char() is a string of character that is indicating a comment.

skip() is a command that skips a number of lines from the previous text.

stringAsFactors() is a command that encodes a character variable as a factor.

As previously said, the read.table() function is suitable for small or medium size data, but in cases where analyzing large sets of data are needed the following precautionary measures should be done:

a. First, calculate the memory required to read and store the large data.
To make a rough calculation on the memory needed by your large data follow the formula:

Memory Requirements = number of rows x number of columns x (8 bytes per objects)

Total Memory Requirement = 2 x Memory Requirements

b. Second, remember that R's operation is dependent on the amount of RAM in your computer.
c. Third, it is better to close other applications on the computer to avoid the delay in the R operation.
d. Fourth, use colClasses to specify the classes in the columns, this will orient R the classes of objects in the column.
e. Fifth, cut your data into tidbits when analyzing:
- take the first 100 rows.
- Loop through each column using sapply() function.
- Use the taball operator to determine what classes are present in each column.

the format is as follows:

initial<-read.csv("name of file here", nrows=100)
classes<-sapply(initial, class)
tabAll<-read.csv("name of file here", colClasses=classes)

No comments:

Post a Comment