What is Predictive Analytics?
Predictive Analytics is the
branch of the advanced analytics which is used to make predictions about unknown
future events. Predictive Analytics uses many techniques from data mining,
statistics, modelling, machine learning, and artificial intelligence to analyse
current data to make predictions about future.
This is where R Language Comes in.
It was originally written for
statisticians to do statistical analysis, including predictive analytics. It’s
open-source software, used extensively in academia to teach such disciplines as
statistics, bio-informatics, and economics. From its humble beginnings, it has
since been extended to do data modelling, data mining, and predictive analysis.
R has a very active community;
free code contributions are being made constantly and consistently. One of the
benefits of using an open-source tool such as R is that most of the data
analysis that you’ll want to do has already been done by someone.
Because R is free to use, it’s
the perfect tool to use to build a rapid prototype to show management the
benefits of predictive analytics.
It involves 3 main areas: Data
Exploration, Manipulation and Analysis. But for this blog, we will focus on
Data Exploration.
What is Data Exploration?
Data exploration is the first
step in data analysis and typically involves summarizing the main
characteristics of a dataset. It has 3 main stages: Understanding the structure
of the data set, looking at the data, and visualizing the data. Under each
contains several functions that would help you to do so.
Understanding the structure
This is the stage where we could
use several functions that would allow us to explore the data that we have such
as:
· class() –
View the Class
This is used to verify
that the data that we are using is in fact a data frame or a 2-dimensional
table consisting of rows and columns where each column is a single data type
(numeric, character, etc)
· dim() – View Dimensions
This function
is useful, because it tells us whether it would be okay to print the entire
data frame to the console. It will show you how many rows and columns you have
for your dataset, and it will always display the number of rows first then the
number of columns.
· names() –
Column Names
Each data set has column
names and this function allows you to see the names of each column
str() – View Structure
The most
versatile and useful function in the R Language because it can be called on any
object and would normally provide a useful and compact summary of its internal
structure. It tells us how many rows and columns we have where in this
function, rows are observations and columns are variables. Additionally, it
shows you the name of each column, the datatype, and a preview of the data
contained in it.
· summary()
– Descriptive Statistics
The summary provides
descriptive statistics including the min, max, mean, median, and quartiles of
each column. When a dataset as character or factor variables, the summary will
produce different summaries.
Looking at the Data
This is the stage where we get to
see the contents of our data, and the basic functions are:
· head() – View
top 6 ; head(dataset, n=10) – View top 10
This function allows you
to view the first 6 rows of your data set by default. But if you add an
additional argument n, you can define how many rows you want to be displayed.
· tail() – View
bottom 6; tail(dataset, n=10) – View bottom 10
Tail function works
similarly the same as head function but, this shows you the bottom 6 by
default. And like the head function, adding the argument n allows you to set
how many bottom rows to be displayed.
· print() –
To View Entire Database
This function allows you
to print your entire data into the console. However, it is not recommended for
large data sets to use this function, but if you have small data set then this
should be fine.
· View()
– Excel Format Display of the Data Set
This window provides
vertical and horizontal (if enough columns to justify) scroll bars for you to
browse the entire data set. It looks exactly like an Excel spreadsheet–you just
can’t manipulate any of the data.
Visualizing the Data
The Final stage for exploring the
data is visualizing the data. There are several visualization functions
available in the R Language but the following are the common functions:
· Hist() –
Histogram
Histogram is basically a
plot that breaks the data into bins (or breaks) and shows frequency
distribution of these bins. You can change
the breaks also and see the effect it has in data visualization in terms of
understandability.
· barplot()
– Bar Graph
Bar Plots are suitable for
showing comparison between cumulative totals across several groups. This is
recommended when you want to plot a categorical variable or a combination of
continuous and categorical variable.
· heatmap()
– Heat Map
Heat maps enable you to do
exploratory data analysis with two dimensions as the axis and the third
dimension shown by intensity of colour.
· plot() –
Line Plot
Line Charts are commonly
preferred when we are to analyse a trend spread over a time period. Furthermore,
line plot is also suitable to plots where we need to compare relative changes
in quantities across some variable (like time)
In Summary
Predictive analytics is the
branch of advanced analytics to make predictions about the unknown future.
R language was written for
statisticians to do statistical analysis, including predictive analytics. It is
an open-source software that has a very active community and is free to use
which makes it the perfect tool to use to build a rapid prototype to show management
the benefits of predictive analytics.
R Language involves 3 major areas: Data
Exploration, Data Manipulation, and Data Analysis, but we focused on Data
Exploration.
We discussed the 3 stages of Data
Exploration and the common functions used in each:
|
Understanding the structure
|
Looking at the Data
|
Visualizing the Data
|
Comments
Post a Comment