Big Data – 4Vs
Veracity – Refers to the biases, noise and abnormality in data.
Variety – Represents the diversity of data. Data sets will vary by type (e.g. social networking, media, text) and by how well they’re structured.
Velocity – Reflects the speed at which data is generated and used.
Volume – Reflects the size of a data set.
Value – And finally, having access to data creates value only when you have right data to clean strategic insights.
Structured Vs Unstructured Data
Structured data are numeric data in traditional databases (e.g. Excel, Access, SAP, SAS…)
Unstructured data are unorganized text documents, emails, videos, audio and financial transactions.
A/B Testing – An experiment whereby two versions (A and B) are compared. They’re identical expect for one variation that might affect a user’s behavior. Version A might be the currently used version (control) while Version B is modified in some respect (treatment).
Data Discovery – A Business Intelligence (BI) architecture which allows users to explore data for hidden patterns and trends. It focuses on dynamic, easy-to-use reports, whereas traditional BI reports are static.
Descriptive Analytics – Describes what happened in a given situation (e.g. number of posts, mentions, sales closed…).
Diagnostic Analytics – Describes why did it happen (e.g. root cause analysis).
Predictive Analytics – Describes what could happen in the future.
Prescriptive Analytics – Describes what should be done.
Cluster Analysis – task of grouping a set of objects in such a way that objects in the same group (cluster) are more similar, in some sense or another, to each other than those in other clusters.
Comparative Analysis – step-by-step procedure of comparisons and calculations to detect patterns within very large data sets.
Descriptive Tree Analysis – a decision support tool that uses a tree-like graph of decisions and their possible consequences including chance event outcomes and resource costs.
Factor Analysis – analyze large numbers of dependent variables to detect certain aspects of independent variables (factors) affecting those dependent variables.
Machine Learning – a type of AI which provides computers with the ability to learn without being explicitly programmed.
Multivariate Analysis – observation and analysis of more than one statistical outcome variable at a time
Regression Analysis – a statistical process for estimating relationships between a dependent variable and one or more independent variables.
Segmentation Analysis – divides a broad category into subsets that have, or are perceived to have, common features, needs, interests or priorities.
Sentiment Analysis – process of identifying and categorizing opinions expressed in a piece of text to determine whether the writer’s attitude towards a topic or issue is positive, negative or neutral
Simulation – imitation of the operation of a real-world process or system over time. It requires a model that represents the key characteristics of behaviors of the select physical or abstract system or process.
Time Series Analysis – Comprises methods for analyzing time series data to extract meaningful statistics and other characteristics of the data.
How to organize and process data – questions to ask yourself
- How do we get the data?
- How do we store the data?
- How do we process the data?
- How do we visualize the data?
How do we get the data? – factors to consider
- Manual input
- Point of sales systems
- Web forms
How do we store the data? – factors to consider
- Type of data being collected
- Granularity of the data
- Time the data is captured
- Completeness of the data set
How do we process the data? – tools to consider
How do we visualize the data? – factors to consider
- Power BI
- Easy to use
- Everyone has it
- Extremely versatile
- Dynamic and what-if scenarios
- Automate data processing
- Complex data can be presented in a visually appealing way
What is SAS?
- A licensed software system for data analysis, graphs and report writing.
- Computing simple and complex statistical analysis
What is R?
- Open-source software environment and language for statistical computing and graphics.
- Contains over 7,000 packages in cutting-edge statistics, econometrics, optimization, machine learning and simulation techniques
What is Python?
- Open-source general purpose programming language
- Shines when data is unstructured, or analytics need to be embedded in other programs
- Has efficient libraries to process all kinds of data and interact with big data tools like Hadoop or Apache Spark.