**Big Data – 4Vs**

**Veracity – **Refers to the biases, noise and abnormality in data.

**Variety – **Represents the diversity of data. Data sets will vary by type (e.g. social networking, media, text) and by how well they’re structured.

**Velocity – **Reflects the speed at which data is generated and used.

**Volume – **Reflects the size of a data set.

**Value – **And finally, having access to data creates value only when you have right data to clean strategic insights.

**Structured Vs Unstructured Data**

Structured data are numeric data in traditional databases (e.g. Excel, Access, SAP, SAS…)

Unstructured data are unorganized text documents, emails, videos, audio and financial transactions.

**Analytical Approaches**

**A/B Testing** – An experiment whereby two versions (A and B) are compared. They’re identical expect for one variation that might affect a user’s behavior. Version A might be the currently used version (control) while Version B is modified in some respect (treatment).

**Data Discovery** – A Business Intelligence (BI) architecture which allows users to explore data for hidden patterns and trends. It focuses on dynamic, easy-to-use reports, whereas traditional BI reports are static.

**Descriptive Analytics – **Describes what happened in a given situation (e.g. number of posts, mentions, sales closed…).

**Diagnostic Analytics – **Describes why did it happen (e.g. root cause analysis).

**Predictive Analytics – **Describes what could happen in the future.

**Prescriptive Analytics – **Describes what should be done.

** **

**Analytical Techniques**

**Cluster Analysis** – task of grouping a set of objects in such a way that objects in the same group (cluster) are more similar, in some sense or another, to each other than those in other clusters.

**Comparative Analysis** – step-by-step procedure of comparisons and calculations to detect patterns within very large data sets.

**Descriptive Tree Analysis **– a decision support tool that uses a tree-like graph of decisions and their possible consequences including chance event outcomes and resource costs.

**Factor Analysis** – analyze large numbers of dependent variables to detect certain aspects of independent variables (factors) affecting those dependent variables.

**Machine Learning** – a type of AI which provides computers with the ability to learn without being explicitly programmed.

**Multivariate Analysis** – observation and analysis of more than one statistical outcome variable at a time

**Regression Analysis –** a statistical process for estimating relationships between a dependent variable and one or more independent variables.

**Segmentation Analysis** – divides a broad category into subsets that have, or are perceived to have, common features, needs, interests or priorities.

**Sentiment Analysis** – process of identifying and categorizing opinions expressed in a piece of text to determine whether the writer’s attitude towards a topic or issue is positive, negative or neutral

**Simulation** – imitation of the operation of a real-world process or system over time. It requires a model that represents the key characteristics of behaviors of the select physical or abstract system or process.

**Time Series Analysis** – Comprises methods for analyzing time series data to extract meaningful statistics and other characteristics of the data.

** **

**How to organize and process data – questions to ask yourself**

- How do we get the data?
- How do we store the data?
- How do we process the data?
- How do we visualize the data?

**How do we get the data? – factors to consider**

- Manual input
- Point of sales systems
- Web forms
- sensors

**How do we store the data? – factors to consider**

- Type of data being collected
- Granularity of the data
- Time the data is captured
- Completeness of the data set

**How do we process the data? – tools to consider**

- Excel
- SAS
- R
- Python

** **

**How do we visualize the data? – factors to consider**

- Power BI
- QlikView
- Tableau

** **

**Why Excel?**

- Easy to use
- Everyone has it
- Extremely versatile
- Dynamic and what-if scenarios
- Automate data processing
- Complex data can be presented in a visually appealing way

**What is SAS?**

- A licensed software system for data analysis, graphs and report writing.

** **** **

**Why SAS?**

- Computing simple and complex statistical analysis

** **

**What is R?**

- Open-source software environment and language for statistical computing and graphics.

**Why R?**

- Contains over 7,000 packages in cutting-edge statistics, econometrics, optimization, machine learning and simulation techniques

** **

**What is Python?**

- Open-source general purpose programming language

** **

**Why Python?**

- Shines when data is unstructured, or analytics need to be embedded in other programs
- Has efficient libraries to process all kinds of data and interact with big data tools like Hadoop or Apache Spark.