Monday, September 24, 2018

Data Science & Analytics: Lecture 1

This is a Condensed Notes on Dr. Eugene Rex Jalao's Lectures on Data Science & Analytics Subject. Dr. Jalao is a Professor in Industrial Engineering from UP Diliman.

I. What is Business Analytics?

- it is the utilisation of organisational and external data to provide: Timely, Accurate, High-value, Actionable decisions. (TAHA)

- it is an umbrella terms that combines: Architecture, Tools, Databases, Analytical Tools, Applications and Methodologies. (ATDAAM)

- the main goal of business analytics is to provide easy access to data and models to make managers capable of making analysis.

- it is an entire field encompassing technology, work processes, human and organisational factors.

- business analytics is not a tool, or a collection of reports, dashboard, and visualisation.

II. History of Business Analytics

1. Early 1970's & Mid-1980's
- paper report stored in early databases.

2. 1980's to 1990's
- Start of early automation, paper reports with decision support systems stored in early data warehouses.

3. Rest of 1990's
- the rise of Online Analytical Platforms, less paper were used and more data warehousing and data-marts.

4. 2000's
- the rise of next generation OLAP integrated with data mining and visualisation.

5. Early 2010's
- The rise of business intelligence and visualisation.

III. Top Technological Strategy Trends for 2018

1. Intelligent:
a. A.I. foundations
b. Intelligent apps & analytics
c. Intelligent things

2. Digital:
a. Digital twins
b. Cloud on the edge
c. Conversational platforms & immersive experiences

3. Mesh
a. Blockchain
b. event driven
c. Continuous adaptive trust & risk

IV. Priority Ranking of I.T. Technology Investment as of 2018
1. B.I. & Big Data
2. Process Automation
3. Cloud Software as a Service
4. Service Management
5. Legacy System Enhancements

VI. Some Challenges Data Professionals Faced
1. Dirty data
2. Lack of data science talents
3. Corporate politics
4. Lack of support
5. Access to data

VII. Business Analytics Framework
1. Source Systems
- are the sources of data that drives the analytics.
- examples: OLTP systems, ERP systems, external data, other sources of data.

2. Integration systems
- Systems that extracts, transforms, and loads data into data warehouses.

3. Data Management Systems
- Data warehouses that stores and loads cleaned data for analytics.

4. Analytics
- examples: EDA, data mining, optimization, and simulation

VIII. Different Types of Business Analytics
a. Descriptive Analytics
- also called exploratory data analysis
- answers the questions:
1. What happened and why?
2. What is happening now?

- the purpose is to describe and summarize the data using graphs and basic statistical techniques to generate reports, dashboards and visualizations.

b. Predictive Analytics
- machine learning and data mining, falls under predictive analytics.
- answers the questions:
1. What is likely to happen?
2. Tell me something interesting, without me asking

- it finds patterns and trends based on historical data to provide useful information for decision making.

- there are two types of machine learning:
1. Supervised Machine Learning
- tries to predict a labelled response variable.
- examples are:
a. Classification: prediction of a categorical response variable.
b. Regression: prediction of a numerical response variable.
c. Time Series: prediction of a numerical response variable based on time predictors.

2. Unsupervised Machine Learning
- the data in unsupervised learning are unlabelled.
- the method "fishes" for patterns.
- examples of unsupervised machine learning are:
Clustering
- it divides datapoint into groups called clusters or segments.
- the variance of data points within the same cluster must be as minimum as possible called coherence, while the variance between two cluster must be as large as possible called separation.

Association Rule
- identify strong rules to associate two items together through a measure of interestingness.
- measures the probability that one item is associated with another item.

Sequential Pattern Analysis
- given a series of items, and its corresponding time sequence, provides an apriori rules that measures that probability that the a precedent item will occur, given an antecedent item have occurred prior.

Text Mining
- uses term document frequency - inverse frequency to determine the dominant words in a text.
- other method uses perplexity and Shannon entropy to determine informaticve words.

Social Media Sentiment Analysis
- from a given set of texts, determines the probability of a sentiment.
- methods used can be naive Bayes classifiers, decision trees, and other classification methods.

c. Prescriptive Analytics
- optimization and simulation falls under prescriptive analytics.
- in optimization, a solution is provided given a series of constraints.
- in simulation, imitates the natural system, and provides an artificial alternative along with inferences.

IX. APEC Recommended Competencies
a. Business & Organizational skills
- involves business analytics, visualization, data management & governance, domain knowledge.

b. Technical skills
- involves statistical techniques, computing, data analytics & research methods.

c. Workplace skills
- communication, storytelling, ethics, and entrepreneurship.

Introduction to R: Control Structure in R Part 2

"While" Loop Function
A while loop begins by testing a logical expression then executes a loop body based on the value of the logical expression. Once the loop is executed, the condition is tested again until the loop is terminated when a limit is reached.

Example: Create a while loop that will print out a value from 1 to 20 and will terminate at the value of 20.



In the example the while loop takes a logical expression that counts from 1 to 20, and prints the value. If the count reaches 20, the loop stops or terminated.

Although the while loop make things easier to read, this type of loop can potentially result in infinite loops when the command is not written properly SO USE WITH CARE. It is safer to use a for {} loop when you have a complex command.


Sunday, August 3, 2014

Introduction to R: Control Structure in R Part 1

In our previous discussion, we talked about other textual formats in R. In this blog we will discuss about the control structures which are commonly used when writing a R program, these structure ensures the control of flow in the R Language. There are several control functions that we will discuss here:
a. if and if, else - used to test a condition.
b.  for loops-  used to execute a loop for a fixed number of times.
c. while loops - used to execute a loop "while" a condition is proven true.
d. break - used to break an execution of a loop.
e. next - used to skip an iteration of a loop.
f. return - used to exit a loop.

Control Structure 1: if () {} else{} & if (){}, else if(){} else{}

The if(){} and if(){} else{}  function tests a logical condition. If the conditions are set to be true, the program will execute a command. If the conditions are set to be false, R will execute another set of commands.
The simplest form of the if(){} control structure is in the form of:

if(Condition){command
}else{
}

For two conditions the if(){}, else if {} structure takes in the form of:

if(Condition 1){command 1
} else if(Condition 2){command 2
}else{
}

If you have observed, the else{} syntax is usually written after each command.

For multiple conditions, then it is obvious that the control structure looks like:

if(condition 1){command 1}
if(condition 2){command 2}
if(condition 3){command 3}
if(condition 4){command 4}
.
.
.
.
.
if(condition N){command N}

the else{} is not necessary in the multiple conditions.

Control Structure 2: for(){} Loops

The for(){} loops is the most common type of looping operator in R. In this operator, there is a loop index which is commonly termed as i, however for multiple loops, the index can take the form of other letters ( k,l,m,n,o,p, etc.). The for(){} loops take an iterator variable and assign into it successive values from a sequence or vectors. the for(){} loops are commonly used over the elements of an object.

The simplest for(){} loops takes the form of:
for(){
command
}

Example 1:
Create a for(){} loop that will loop over numbers 1 to 20.



In this example, the loop takes the variable i and in each iteration of the loop it will print out a value of 1 to 20. After the last number is printed out, the command exits.

Example 2:
Create a  for(){} loop that will loop using the index i, which is sequenced from 1 to 5, the loop will print out the ith element which corresponds to the letter "a", "b", "c", "d", "e".



Example 3:
The seq_along() function takes over a vector as an input and creates an integer sequence which is equal to the length of the vector. Using the previous example, Example 2., we have a vector of length 5 if the seq_along() is used the command will create a vector sequence 1 to 5.



Example 4:
Using the previous example, Example 2., we will create a different index variable and call it "alphabet". This index variable will be taking values from an assigned vector, hence, the index variable does not need to be an integer (since it can take elements from any arbitrary vector). In this example, the for(){}loop is going through the vector c with values "a", "b", "c", "d", "e" and printing out the index variable which is equal to the letter.





















Nested for(){} Loops

A nested for(){} loops is similar to having a for(){} loops existing inside another  for(){} loop.

Example 5:
In this example we will create a matrix with 2 rows and 3 columns. The command will loop over the column with the outer loop that is the "i" index. This, "i" index will loop over the rows. Using the seq_len() function over this "i" index, we have created an integer sequence over the rows. Using another seq_len() function over the "j" index to take the number of columns and create an integer sequence over the column.























A word of caution, be careful when using nested for loops and never go beyond 2 to 3 levels as this will make the command difficult to understand by others.

Thursday, July 10, 2014

Introduction to R: Other Textual Formats in R

Last time we discussed on how to save a .csv file. In this blog we will discuss on other textual formats that is present in R, these are basically functions that can be used for other formats aside from .csv or .txt.

There are two functions that can be used to read or write other textual formats in R, these is the dump() function and the dput() function. These two functions are important for reading and writing other textual formats because it makes these formats editable and recoverable. The downside with dump() and dput() functions is that they are not space efficient.

The dput() Function

This function is another way to pass data into R by desparsing R objects. The dput() function can be read by dget(). The mechanism of the dput() function is that it will take an R object and will create an R code that will essentially reconstruct the object in R.

Example:
We will create a small data frame with two columns. The first column will be named "d" and the second column is named "f". The value of a will be set to 10, will the value of b is set to "d".





















In the first part of our example, we have the expression z<-data.frame(d=10, f="d"). If the data frame is in dput() function, it will reconstruct the R code creating a list with two elements, the new construct has the class at the end, as you can see in the example: class="data.frame". Another essential thing is that for us to retrieve this expression, we must dput it into a file, as in the example dput(z, file="z.R"). The file can now be retrieved using the dget() function, in the example it is x<-dget("z.R"). Hence, dput() function essentially writes an R code which can be used to reconstruct an R object.

Dumping Objects in R

If we have multiple objects that we want to desparse in R, we can use the dump(c(), file=" ") function. Dumping is quite similar to dgetting, the main difference is that dumping is used in multiple objects while dget is used for single objects. R passes the dump as a character vector which contains the name of the object.

Example:
We create two objects, x and y. We assign a string vector "owl" to x and a data frame (a=10, b="a") to y.


Sunday, June 29, 2014

Introduction to R: Saving an Excel File as CSV

In the previous post we talked about loading and writing files in R, we also tackled the topics about the read.table() and the read.csv() functions. In this post we will learn how to save an Excel file in csv format.

CSV is short for comma separated values, this allows the data to be saved in a table structured format. Excel files that are saved under the csv format takes the form of a text file containing information separated by a series of comma between values.

The R language however can read other file formats:
read.xls() for the Excel files, though you need to have a Perl run time to exist in the system to use this. A more complex form of importing Excel files to R is through the use of the loadWorkbook(), to load the workbook and the readWorksheet() to open the desired worksheet in the workbook. You need to install Java for this.

To import Minitab files to R you can use the read.mtp(), for SPSS R Language has the read.spss() and finally for the csv we have the read.csv().

Before you can load your csv in the R language you need to save your Excel files to the csv format using the following steps:

a. In this example I have an Excel file which contains the current employee census of Australia.






















b. Click the home button, click Save as, click Other formats.






















c. Choose the CSV(comma delimited) format. You have an option to chosse the CSV(Macintosh) or CSV(MS-DOS). I prefer to use the comma delimited format.






















d. Click Save.






















e. If you have multiple sheets, Excel will give you a warning that only the current active worksheet will be saved into the csv format and other worksheets should be individually saved.






















f. Another warning will pop out that the file may not be compatible with the comma delimited format, you can ignore this and click yes.






















Wednesday, June 25, 2014

Introduction to R: Loading Files, Writing Files, and the read.table() and read.csv() Function in R

In our previous discussion we talked about the the ways on how to remove NA values in R. This time we will talk about the different ways of loading and writing files in the R Language. The following are the principal functions for loading, and writing ready-made data into R.

For  loading tabular data, such as those in the excel format:
read.table()
read.csv()

For loading lines in a text file:
readlines()

For loading a R code file:
source(), which is the inverse of a dump as we will later discuss.
dget(), which is the inverse of a dput which will be discussed later.

For loading files that are saved in the workspace:
load()

For loading single R objects in a binary code:
unserialize()

For writing tables in R:
write.table(), although a common thing we do is to write the data in the form of a data frame.

For writing lines in R:
writeLines()

For writing codes in R:
dump()
dget()

Saving R files in the workspace:
save()

For writing a binary in R:
serialize()

read.table()  & read.csv()Function in R

The read.csv() is identical to the read.table() function the only main difference is that the read.csv() has a comma as a separator while that of the read.table() has a space as a separator between each values. The read.csv() is a common format from excel files.

The read.table is suitable for small or moderately sized data and has the following common arguments:
file() is used to determine the name of file or name of connection.

header() is a simple logical command if the file contains a header line.

sep() is a command that determines how the data in the table are separated.

colClasses() is a command that indicates the class of each column in the data set.

nrows() is a command that indicates the number of rows in the data set.

comment.char() is a string of character that is indicating a comment.

skip() is a command that skips a number of lines from the previous text.

stringAsFactors() is a command that encodes a character variable as a factor.

As previously said, the read.table() function is suitable for small or medium size data, but in cases where analyzing large sets of data are needed the following precautionary measures should be done:

a. First, calculate the memory required to read and store the large data.
To make a rough calculation on the memory needed by your large data follow the formula:

Memory Requirements = number of rows x number of columns x (8 bytes per objects)

Total Memory Requirement = 2 x Memory Requirements

b. Second, remember that R's operation is dependent on the amount of RAM in your computer.
c. Third, it is better to close other applications on the computer to avoid the delay in the R operation.
d. Fourth, use colClasses to specify the classes in the columns, this will orient R the classes of objects in the column.
e. Fifth, cut your data into tidbits when analyzing:
- take the first 100 rows.
- Loop through each column using sapply() function.
- Use the taball operator to determine what classes are present in each column.

the format is as follows:

initial<-read.csv("name of file here", nrows=100)
classes<-sapply(initial, class)
tabAll<-read.csv("name of file here", colClasses=classes)

Sunday, June 22, 2014

Introduction to R: Removing NA Values

We have discussed sub-setting objects and elements in our previous topic. Now we will discuss on how to remove NA or missing data in R. This is the most common operation in data analysis because most of the raw data have missing values. The principle is to create a logical vector that will locate the NA values, these NA values are then removed using the [!] operator.

Let's proceed to the exercise:
Say you have a list x<-c(1,NA,3,NA,NA,6,NA,7,8) and you want to remove the missing values. In our previous discussion on missing values, there is an is.na() function whose purpose is to locate the missing values in your vector. We will create a logical vector z<-is.na(x), this logical expression will determine the location of the NA values, to remove the NA values we will use the [!] expression in the form of x[!z].





















REMOVING NA VALUES IN MULTIPLE VECTORS

If you have more than one vector with missing values, you can use the complete.cases() function. For example:

You have two vectors x<-c(2,NA,4,NA,6,NA,8,NA) and y<-c("b", NA,"d", NA,"f",NA,"h",NA). We will create a logical vector that will give an output with deleted NA values using the expression: z<-complete.cases(x,y).






















REMOVING NA VALUES FROM A DATA FRAME

In most of the cases our data are stored in tabular forms such as in .xls files or in csv files. For this exercise we will use a csv file to create a data frame in R. The data from this .csv file is from a quiz used by the John Hopkins School of Public Health and Biostatistics titled "air quality", download the file here.

Step 1: In your R console click File.



Step 2: A menu box will open, click Local Disk (C:)



Step 3: Click the directory Users.


Step 4: Click another directory named Users.



Step 5: The file you have downloaded should be in the Download directory, click this folder then click OK.


Step 6: To determine if the file is in the Download directory type list.files(). The air quality file should be present.



Step 7: To read the data from the file type read.csv("airquality.csv"), the R console will show a data frame which consists of 153 rows and 6 columns (Ozone, Solar.R, Wind, Temp, Month, Day).



Step 8: Say we only want to analyse the first 10 rows, then we type airquality[1:10,]. Recall in our previous topic on matrices, rows have the syntax [nrow,] while the column has the syntax [,ncolumn]



Step 9: If you have observed there are NA values in the data frame, to remove the NA values we type the expression z<-complete.cases(airquality). To give an output where the NA values are removed we type airquality[z,][1:10,].



As you have observed all the rows with an NA values are deleted and are replaced with another row with complete data sets, but there are still 10 number of rows.