DATA ANALYTICS: Introduction to R: Factors, Missing Values, Data frames and Names

In our previous discussion, we learn on how to create vectors, matrices and lists in R, now we will learn about factors, missing values, data frames and names in the R language.

We will first discuss what are factors. Factors are ways to represent categorical data in R, a categorical data can be an unordered nominal data such as gender, color or flavor, or an ordered ordinal data such as 1st, 2nd, 3rd, 4th, 5th. Factors are important in modelling functions such as lm() for linear modelling and glm() for generalized linear modelling. Factors are self-describing such as male or female for nominal factors, senior level, mid-level or junior level for ordered factors.

A factor can be created in R using the factor() function, the input of a factor() function is a character vector. Say for example, we want to create a factor with two levels: yes and no, we then want to create a frequency table on how many no's and yes's are present.

The table() function is used to determine the frequency of the yes and no's in our example. The unclass() function gives a default output that is innate in R, 1 being no and 2 being yes, the unclass() function also gives the attributes that is involved in the expression.

R usually determines the baseline function in an alphabetical manner, in the yes and no example the baseline is no. However during modelling, ordered baseline factors are quite important. To create an ordered factor in R we need to use the levels=c() function. In our previous example if you have observed in the unclass(x) attr(,"level") the output starts with a no followed by the yes, this is because in the alphabet the letter n comes before y, so the baseline factor here is no. But what if we want to have yes as our baseline factor?

Here we integrated the levels=c("yes", "no") expression in the factor() function, the first character "yes" tells R that the baseline factor will be "yes", as can be observed in the Levels output that started with a yes, Levels: yes no.

Most of the time you will often encounter factors with a missing value. Missing values are a special kind of objects in R which are defined as NaN if it is an undefined mathematical operator or a NA if it is a missing value or an error. To test for a missing value or an error, the is.na() function can be used, while for undefined mathematical operators the is.nan() function is used.

NA values can have a class, it can be an integer NA, a character NA, a numeric NA, a logical NA, or a complex NA. A NaN value can be a NA but a NA value can never be an NaN as seen in our example. We have also seen that the output is a logical vector: true if there is a missing value or an undefined mathematical operator and false if NA or NaN is not present.

Another component in R is its ability to create or read data frames, these function is used to store and analyse tabular data. It is represented by a special type of list where every element on the list has the same length. Each element of a list is given as a column while each length of an element in the list is given by the row. Data frames can store different classes of objects in each column, unlike matrices which requires that each element should be of the same class. R can read data frames by the read.table() function or the read.csv() function. Data frames can also be converted into a matrix by the data.matrix() function.

Aside from reading tables and text or csv files, data frames can also be created in R using the data.frame() function.

In our example we created a tabular data frame with a number sequence 1 to 3 and followed by a logical sequence of False, True, False. The first object number=1:3 is taken as the first column and the logic=c(F,T,F) is taken as the second column. The nrow() and the ncol() function can be used to determine how many rows and columns are there in the expression.

Aside from data frames, we can create readable codes and self-describing objects in R by the name() function. Say for example we want to create an integer vector from 1 to 5, we will name each integer: cat, dog, ant, bird, tree.

Lists and matrices can also be named as in the following examples:

naming list is quite direct, just by placing the names inside the list() function.

For naming matrices the dimnames() function is used using the list(c()) function as a variable. A matrix expression is first constructed specifying the number of sequence, the number of rows and columns present in the matrix. This is followed by the dimnames() function, as can be observed the first two names ("cat", "dog") are taken as row names, while the ("ant", "bird") are taken as column names.

DATA ANALYTICS

Sunday, June 15, 2014

Introduction to R: Factors, Missing Values, Data frames and Names

No comments:

Post a Comment