Introduction to Data Exploration with R Programming

What is Predictive Analytics?

Predictive Analytics is the branch of the advanced analytics which is used to make predictions about unknown future events. Predictive Analytics uses many techniques from data mining, statistics, modelling, machine learning, and artificial intelligence to analyse current data to make predictions about future.

This is where R Language Comes in.

It was originally written for statisticians to do statistical analysis, including predictive analytics. It’s open-source software, used extensively in academia to teach such disciplines as statistics, bio-informatics, and economics. From its humble beginnings, it has since been extended to do data modelling, data mining, and predictive analysis.

R has a very active community; free code contributions are being made constantly and consistently. One of the benefits of using an open-source tool such as R is that most of the data analysis that you’ll want to do has already been done by someone.

Because R is free to use, it’s the perfect tool to use to build a rapid prototype to show management the benefits of predictive analytics.

It involves 3 main areas: Data Exploration, Manipulation and Analysis. But for this blog, we will focus on Data Exploration.

What is Data Exploration?

Data exploration is the first step in data analysis and typically involves summarizing the main characteristics of a dataset. It has 3 main stages: Understanding the structure of the data set, looking at the data, and visualizing the data. Under each contains several functions that would help you to do so.

Understanding the structure

This is the stage where we could use several functions that would allow us to explore the data that we have such as:

· class() – View the Class

This is used to verify that the data that we are using is in fact a data frame or a 2-dimensional table consisting of rows and columns where each column is a single data type (numeric, character, etc)

· dim() – View Dimensions

This function is useful, because it tells us whether it would be okay to print the entire data frame to the console. It will show you how many rows and columns you have for your dataset, and it will always display the number of rows first then the number of columns.

· names() – Column Names

Each data set has column names and this function allows you to see the names of each column

str() – View Structure

The most versatile and useful function in the R Language because it can be called on any object and would normally provide a useful and compact summary of its internal structure. It tells us how many rows and columns we have where in this function, rows are observations and columns are variables. Additionally, it shows you the name of each column, the datatype, and a preview of the data contained in it.

· summary() – Descriptive Statistics

The summary provides descriptive statistics including the min, max, mean, median, and quartiles of each column. When a dataset as character or factor variables, the summary will produce different summaries.

Looking at the Data

This is the stage where we get to see the contents of our data, and the basic functions are:

· head() – View top 6 ; head(dataset, n=10) – View top 10

This function allows you to view the first 6 rows of your data set by default. But if you add an additional argument n, you can define how many rows you want to be displayed.

· tail() – View bottom 6; tail(dataset, n=10) – View bottom 10

Tail function works similarly the same as head function but, this shows you the bottom 6 by default. And like the head function, adding the argument n allows you to set how many bottom rows to be displayed.

· print() – To View Entire Database

This function allows you to print your entire data into the console. However, it is not recommended for large data sets to use this function, but if you have small data set then this should be fine.

· View() – Excel Format Display of the Data Set

This window provides vertical and horizontal (if enough columns to justify) scroll bars for you to browse the entire data set. It looks exactly like an Excel spreadsheet–you just can’t manipulate any of the data.

Visualizing the Data

The Final stage for exploring the data is visualizing the data. There are several visualization functions available in the R Language but the following are the common functions:

· Hist() – Histogram

Histogram is basically a plot that breaks the data into bins (or breaks) and shows frequency distribution of these bins. You can change the breaks also and see the effect it has in data visualization in terms of understandability.

· barplot() – Bar Graph

Bar Plots are suitable for showing comparison between cumulative totals across several groups. This is recommended when you want to plot a categorical variable or a combination of continuous and categorical variable.

· heatmap() – Heat Map

Heat maps enable you to do exploratory data analysis with two dimensions as the axis and the third dimension shown by intensity of colour.

· plot() – Line Plot

Line Charts are commonly preferred when we are to analyse a trend spread over a time period. Furthermore, line plot is also suitable to plots where we need to compare relative changes in quantities across some variable (like time)

In Summary

Predictive analytics is the branch of advanced analytics to make predictions about the unknown future.

R language was written for statisticians to do statistical analysis, including predictive analytics. It is an open-source software that has a very active community and is free to use which makes it the perfect tool to use to build a rapid prototype to show management the benefits of predictive analytics.

R Language involves 3 major areas: Data Exploration, Data Manipulation, and Data Analysis, but we focused on Data Exploration.

We discussed the 3 stages of Data Exploration and the common functions used in each:

Understanding the structure

class()
dim()
names()
str()
summary()

Looking at the Data

head()
tail()
print()
View()

Visualizing the Data

hist()
barplot()
heatmap()
plot()

VicUni Predictive Analytics

Search This Blog