Data Science – an expansion of probability, statistics, and programming that implements computational technology to solve more advanced questions.
Analysis Versus Analytics
- Analysis is performed on events that occurred in the past.
- Analytics explores potential future events. There are two types of Analytics:
- Qualitative – Uses intuition and analysis. Is less tangible. It concerns subjective characteristics and opinions – things that cannot be expressed as a number.
- Quantitative – Uses formulas and algorithms. Involves looking at the hard data, the actual numbers.
Business Analytics, Data Analytics, and Data Science: An Introduction
This is how Business Analytics, Data Analytics, Data Science and Machine Learning intersect in a business environment:
Data Science Infographic
When are these Disciples Used?
- Data – Information stored in a digital format, which can then be used as a base for performing analyses and decision making. Two types:
- Traditional Data can be done one one computer.
- Big Data requires multiple computer.
- Data Science – combines statistical, mathematical, programming, problem solving, and data management tools. Three types:
- Business Intelligence – analyzes the past data that you have acquired. It includes all technology-driven tools involved in the process of analyzing, understanding, and reporting available past data.
- Traditional Methods – a set of methods that are derived mainly from statistics and are adapted for business. These are perfect for forecasting future performance with great accuracy. Examples include regression, cluster, and factor analysis.
- Machine Learning – predicts outcomes from data without being explicitly programmed to do so.
Why are these Disciplines Used – The Benefit of Each Discipline
- Traditional Methods are used with Traditional Data.
- Machine Learning is used with Big Data.
What Techniques are Involved?
- Traditional Data
- Start with Raw Data from a database.
- Raw data is also known as Raw Facts, or Primary Data.
- Raw data cannot be analyzed right away. It is untouched data you have accumulated and stored on a server.
- Data can be raw facts, processed data, or information.
- Data Collection – the gathering raw data. ex. surveys, cookies
- Raw Data → Data Pre-processing → Processing → Information
- Data Pre-processing attempts to solve problems encountered in data collection.
- Class Labeling – labeling a data point to the correct data type, or arranging data by category.
- Numerical – can be manipulated.
- Categorical – cannot be manipulated.
- Data Cleansing (or Data Cleaning, Data Scrubbing) – deals with inconsistent data, such as misspelled data.
- Missing Values – find ways to enter this missing data.
- Balancing – Ensuring that different groups have equal representation in the data. For instance, male and female customers.
- Data Shuffling – Randomizes the order of the data, like shuffling cards. Prevents unwanted patterns. Improves predictive performance. Helps avoid misleading results.
- Relational Database Management Systems – Use Entity-Relationship Diagrams (ER Diagrams) illustrates a databases architecture.
- Relational Schema – each table represents a distinct data table, and the connecting lines show which tables are related.
- Class Labeling – labeling a data point to the correct data type, or arranging data by category.
- Big Data
- Preprocessing can be more involved as big data not only contains numerical and categorical data, but also digital images, digital video data, digital audio data, etc.
- As a result, there are a wider range of data cleansing methods.
- Big Data is derived from many sources in a complex manner.
- Examples of big data include social media data and financial market data.
- Text Data Mining – the process of deriving valuable, unstructured data from a text.
- Data Masking – Allows us to analyze the information without compromising private details.
- Conceals the original data with random or false data.
- Conduct analysis.
- Keep all confidential information in a secure place.
- These are called Confidentiality Preserving Data Mining techniques.
- Business Intelligence
- Business Intelligence uses data skills, business knowledge, and intuition.
- It explains past performance such as: What happened? When did it happen? How many units did we sell? In which region did we sell the most units?
- Quantification – the process of representing observations as numbers.
- Measure – the accumulation of observations to show some information. Measures are related to simple descriptive statistic of past performance. ex. The monthly revenue for a specific product is a measure.
- Metric – aims at gauging business performance or progress. Metrics are a measure plus some business meaning. ex. Quarterly revenue per customer. Metrics are very useful for comparisons.
- Observations → Quantifications → Measures → Metrics → Key Performance Indicators
- Thousands of metrics can result from this process, too many to be useful. We use Key Performance Indicators (KPIs) to do this. KPIs are metrics that relate to your business objectives.
- Quantitative meaning in BI must be visualized with dashboards, reports, etc.
- BI can be used in areas such as Price Optimization, and Inventory Management.
- Traditional Methods
- Now we stop dealing with Analysis, and start dealing with Predictive Analytics.
- Regression – a model used for quantifying causal relationships among the different variables included in your analysis.
- Logistic Regression – a non-linear model. Output is 0 or 1. Useful for decision making processes.
- Cluster Analysis – Clustering is about grouping observations or data points.
- Factor Analysis – grouping explanatory variables together. This reduces the dimensionality of the problem.
- Time Series – following certain values over time such as stock prices or sales volume.
- Used in the real world in customer User Experience (UX), and Sales Forecasting.
- Machine Learning
- ML – creating an algorithm which a computer then uses to find a model that fits the data as well as possible. Accurate predictions can then be made using this model. We do not provide the machine with instructions, but rather, algorithms to solve a problem.
- ML uses a trial and error process. Each consecutive trial is at least as good as the previous trial.
- There are four ingredients to an ML algorithm:
- Data
- Model – is iteratively trained.
- Objective Function – calculates how far predictions are from their target. This is iteratively minimized by use of the Optimization Algorithm.
- Optimization Algorithm – mechanics that will improve the model’s performance.
- Three types of ML
- Supervised Learning – uses and Objective Function to measure inaccuracy, and an Optimization Algorithm to improve the accuracy during training. Notable supervised learning examples are:
- Support Vector Machines (SVM)
- Neural Networks
- Deep Learning
- Random Forest
- Bayesian Networks
- Unsupervised Learning – Data is unlabeled.
- K-Means
- Deep Learning
- Reinforcement Learning – a reward system is introduced when the machine performs as desired. Similar the supervised learning, but instead of minimizing loss, one maximizes reward. Note that Deep Learning is now used in all three types of ML.
- Supervised Learning – uses and Objective Function to measure inaccuracy, and an Optimization Algorithm to improve the accuracy during training. Notable supervised learning examples are:
- ML is used in the real world in areas such as Fraud Detection and Client Retention.
How are these Disciplines Used – Tools Used
- Programming Languages
- R and Python are the most common.
- They are suitable for mathematical and statistical computations.
- They are adaptable.
- They are not able to address problems specific to some domains.
- SQL – created for working with relational datable management systems.
- Matlab – ideal for working with mathematical functions or matrix manipulations.
- Java and Scala are used with Big Data.
- R and Python are the most common.
- Software
- Application Software can be made using the programming languages above. The provide a smaller scope, and are easier for users to learn.
- Excel – able to perform relatively complex computations and good visualizations quickly.
Who uses these Disciples
Traditional Data and Big Data
- Data Architect – designs the way data will be retrieved, processed, and consumed.
- Data Engineer – processes the obtained data so that it is ready for analysis.
- Database Administrator – handles the control of data. Mainly works with Traditional Data.
Business Intelligence
- BI Analyst – performs analyses and reporting of past historical data.
- BI Consultant – external BI Analyst.
- BI Developer – Python and SQL. Performs analyses specifically designed for the company.
Traditional Methods and Machine Learning
- Data Scientist – employs traditional statistical methods or unconventional machine learning techniques for making predictions.
- Data Analyst – prepares more advanced types of analyses.
- ML Engineer – applies state of the art computational models.