Big Data Analytics

1. Introduction
2. Definition of Big Data
3. Types of Data Analytics
4. Database Architectures
5. Change in Trend
6. Size Matters
7. Transaction Processing
8. Challenges
9. Brewer’s CAP Theorem
10. BASE Theorem
11. NoSQL
12. Hadoop
13. Apache Spark
- 13.1. What it is
- 13.2. Resilient Distributed Dataset

1. Introduction

1.1. What is Big Data?

Big Data involves datasets that are too large or complex for traditional data processing application softwares.
It counts as too large, if the data is in Gigabytes, Terabytes, Petabytes or maybe even Exabytes.
The future involves an unofficial data size called Brontobyte (\(10^{27}\) bytes).
The term “big data” was first coined by John Mashey in the 90s.

1.2. What is Big Data Analytics?

Process of uncovering trends, patterns and correlations in large datasets.
This is useful for risk management, and making better decisions.

1.3. Three V’s (Currently there are a lot more V’s)

These are core challenges involved with big data.
These were coined by Doug Laney in 2001, in a report for Gartner.

1.3.1. Volume

The data itself is in larger units.

1.3.2. Velocity

This is the rate at which data is being generated.
As of 2022, 2.5 quintillion bytes of data is created every day. In 2026, that number is 3.5 quintillion. Refer to the common sizes.
This emphasizes on how close we are to real time processing of data.

1.3.3. Variety

Structured
- Data with a fixed format.
- For examples, spreadsheets and databases are structured data.
- DBMS operations are very fast, and this is very scalable.
Unstructured
- Data with no fixed format.
- For example, images, videos, text.
- 80% of the data is unstructured.
- To deal with unstructured data, the various ways are
  - Data mining
  - NLP
  - Text Analytics
  - Noisy Text Analytics
Semi-Structured
- It might have a fixed format, but might contain unstructured data.
- XML is a great example - it has a format (tags), but the content can be unstructured. A simpler example would be an email (from, to, CC, BCC form a structure, but the content can be anything).
- It’s often either data with labels, or data in the form of key-value pairs.

1.3.4. Newer V’s introduced in recent times

Veracity
- Trustworthiness/credibility of data.
Value
- The ability to extract meaningful data.
Validity
- Is the data correct or appropriate for its intended use?
Visualization
- Putting data in a human-interpretable form.
Variability
- Changes in the nature of the data.
Vulnerability
- Compromises in system security.
Viscosity
- How easy/difficult it is to combine/transfer data.
Volatility
- How long data is valid for, and how long it should be stored.
- This is the characteristics of data that deals with its retention.
Virality
- How far the existing data spreads around.

1.4. Human-Generated Data vs Machine-Generated Data

Examples of human-generated data are text messages, emails, documents.
Examples of machine-generated data are outputs of sensors or system logs.

1.5. Issues with Terminology related to Structure

HTML is often debated between considered unstructured or semi-structured.
This is because questions are being raised on whether the structure is actually helping analytics or not.

2. Definition of Big Data

It’s a high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.

3. Types of Data Analytics

3.1. Descriptive Analytics

Describe the data i.e. tell you what happened in the past.
This is done by summarizing the past data into a human-interpretable form (eg. charts, graphs, etc).

3.2. Diagnostic Analytics

Understand why something happened in the past.
This is done using techniques like drill-down, data-mining and data-discovery.

3.3. Predictive Analytics

Predicts what’s most likely to happen in the future.

3.4. Prescriptive Analytics

What actions to take to affect those outcomes.

4. Database Architectures

Refer to OLTP vs OLAP: key differences, use cases, and architectures.

4.1. OLTP

Stands for Online Transaction Processing.
This is essentially DBMS.

4.2. OLAP

Online Analytical Processing.
This is also alled data warehousing and these are ways to analyse the results of OLTP.

4.3. RTAP

Real-time Analytics Processing.
It’s also known as HTAP - Hybrid Transactional/ Analytical Processing

5. Change in Trend

Models have changed: previously the amount of people generating big-data was less, and the consumers were high in number.
Now, everyone is both generating and consuming big data.

6. Size Matters

6.1. In Bits

Unit	Number of Bits
Bit	1
Byte	8
Kibibyte	1024
Kilobyte	1000
Mebibyte	1024²
Megabyte	1000²
Gibibyte	1024³
Gigabyte	1000³
Tebibyte	1024⁴
Terabyte	1000⁴

6.2. In General

Unit	Equivalent Number
Thousand	10³
Million	10⁶
Billion	10⁹
Trillion	10¹²
Quadrillion	10¹⁵
Quintillion	10¹⁸

As a function of time:

1 Million Seconds is 12 days
1 Billion Seconds is 31.7 years
1 Trillion Seconds is 31.7 Thousand (31,700) years
1 Quadrillion Seconds is 31.7 Million Years
1 Quintillion Seconds is 31.7 Billion Years

6.3. Traditional Business Intelligence vs Big Data

Traditional Business Intelligence	Big Data
The data size is GB-TB	The data size is PB to ZB/YB
The file system is centralized	The file system is distributed
The data is structured	The data is variable
Move Data to Code	Move Code to Data

6.4. What is a Data lake

A data lake is a data repository that stores data in its native/original format until needed.
Data lakes can store both relational and non-relational data, while data warehouses is purely relational in nature.
Data lakes are way easier to scale at a lower cost, compared to data warehouses.

7. Transaction Processing

Ensures that DBMS operations happen as one single indivisible unit with the help of ACID properties

7.1. Atomicity

Transaction either completely finishes, or doesn’t happen at all. There’s no partially completed state.

7.2. Consistency

Changes must reflect everywhere.
All integrity constraints must satisfy.

7.3. Isolation

Two transactions shouldn’t interfere with each other

7.4. Durability

Transactions must be permanent.

8. Challenges

8.1. Vertical and Horizontal Scaling

Vertical Scaling (scaling-up) is adding more CPUs, RAM to a single server.
Horizontal Scaling (scaling-out) is adding more servers altogether to distribute the work load.

8.2. Security

Lack of proper authentication and authorization mechanisms.

8.3. Schema

Has to be dynamic.

8.4. Consistency

8.5. Continuous Availability

8.6. Partition Tolerent

This is regarding the partitions in a network.
Partition Tolerant tells us how we have to take care of hardware and software failures involved in a Network.

8.7. Data Quality

8.7.1. 3Cs

Completeness
Consistency
Cleanliness

9. Brewer’s CAP Theorem

9.1. What CAP is

9.1.1. Consistency

All reads receive the most recent write or an error.
Everyone always sees the same latest data.

9.1.2. Availability

All reads contain data, but it might not be the most recent.
The system will always provide a response, but that response could be outdated too.

9.1.3. Partition Tolerance

The system continues to be operate despite network failure.

9.2. CAP Theorem

A system that has consistency and partition tolerance, but compromises availability is called a CP system. (note that it’s compromised, not non-existent).
A system that has availability and partition tolerance, but compromises availability is called an AP system.
You can make a CP system or an AP system.
The theorem says that partition tolerence is unavoidable, because we are dealing with networks.
Departing from CAP theorem, if P was compromised, it would be a CA system. There’s no network involved and hence we’re talking about your local system.
Refer to CAP Twelve Years Later: How the “Rules” Have Changed - InfoQ.

9.3. Examples of Different Systems

9.3.1. CP

Availability meant there will always be a response (despite being outdated).
Lack of availability simply means during partition, requests could be rejected.
It chooses failure over giving wrong data.
MongoDB is an example, because one server is primary and everthing else is a replica. If it can’t reach the primary server, the write won’t happen.

9.3.2. AP

Consistency meant all reads from any server, either always show the most recent read, or it gives an error.
The lack of consistency means that it chooses to give slightly incorrect/inconsistent data instead of just failing.
Apache Cassandra was designed for facebook, where downtime is unacceptable. So instead of just failing, it’s okay to be a little inconsistent.

9.3.3. CA

Like previously mentioned, the lack of partition tolerance means there’s no distributed network to partition.
Some examples include MySQL or Oracle Database.

10. BASE Theorem

BASE stands for Basically Available Soft-state Eventually-Consistent.
BASE theorem provides a good model for AP systems and how they can deal with the lack of consistency.

10.1. Basically Available

If a particular node fails in a network, data on that will not be available. But the entire data layer is still operational.
It conveys the same idea of consistency in CAP theorem.

10.2. Soft-state

The design principle in which the state of the system may change over a period of time, even without any input, is called soft-state.
Maintaining strong consistency might be costly and could be impractical at times. This is when soft-state is preferred.

10.3. Eventually Consistent

It’s a guarantee that the system will have consistent values applied at a certain point of time, if not immediately.

11. NoSQL

It stands for Not Only SQL, and it was coined by Carlo Strozzi in 1998.

11.1. Sharding

Splitting a large dataset into smaller, more manageable pieces called “shards” is called sharding.
These shards are distributed across multiple systems to improve scalability and performance. (because this paves way for parallel processing).

11.2. Types of NoSQL

Key-Value	Document Oriented	Column Oriented	Graph Based
Designed for horizontal scaling	JSON-like documents	Work on columns instead of rows, and they’re good for faster analytics	Good for representing relationships
Eg. Redis	Eg. MongoDB	Eg. Cassandra	Eg. InfiniteGraph

11.3. Difference between SQL, NoSQL and NewSQL

	Traditional SQL	NoSQL	NewSQL
Horizontal Scaling/Distributed Computing	No	Yes	Yes
Data Format Flexibility	No	Yes	Maybe
OLTP/OLAP	Yes	No	Yes
Community Support	Huge	Growing	Slowly Growing
Properties	Follows ACID Properties	Follows Brewer’s CAP theorem	Follows both
P	Not preferred for hierarchical data (tree-like structure)	Best for hierarchical data (because JSON can effectively store this)

12. Hadoop

It’s an open-source framework written in Java used to process enormous data sets.
It was first developed by Doug Cutting and Mike Cafarella in 2005.
Hadoop allows distributed processing, across a cluster of commodity hardware (inexpensive, widely available hardware).
Hadoop was inspired by Google File System for distributed storage.
In 2007, Yahoo started using Hadoop on a 1000 node cluster. It was in 2008 Hadoop defeated supercomputers.

12.1. Hadoop Cluster

A Hadoop Cluster is a collection of computers (which are broadly of two types) which store and work on data.
The two types of computers in a cluster, are master node and slave node.
- The master node (primary node) is responsible for managing the cluster.
- The slave nodes (secondary nodes) are the workers which actually store data and perform computation.
One cluster has a single namespace.

12.2. Hadoop Versions

12.2.1. Hadoop 1.x

These are the components of a node using Hadoop 1.x.

HDFS
- This is for reliable storage.
- HDFS stands for Hadoop Distributed File System.
- Given any file, you split the file into blocks
```
File (1GB)
↓
Block1 (128MB)
Block2 (128MB)
Block3 (128MB)
...
```
  Then you distribute the blocks across different nodes:
```
Node1 → Block1
Node2 → Block2
Node3 → Block3
Node4 → Block4
```
1. Blocks
  - This is the minimum amount of data that can be read or written. (64 MB or 128 MB).
  - By default there are 3 replicas of data, and they will be placed across at least 2 racks.
    - The first replica is placed on the local node
    - The second and third replica are palaced on a remote rack during write pipeline.
  - The client reads from the nearest replica.
2. Rack
  - It’s a collection of 40-50 data nodes using the same network switch.
  Rack1 Rack2 Rack3
  
  Block1 _ _
  
  _ Block1 _
  
  _ _ _
  
  _ Block1 _
3. Rack Awareness
  - Choosing data node racks that are closest to each other is called rack awareness.
  - Improves cluster performance by reducing network traffic.
  - This is how you move the computation to where the data is stored.
4. Master/Slave Architecture - Node
  1. NameNode
    - Master of the system
    - Maintains and manages the blocks present on the Data Node.
    - It doesn’t store the files themselves. It only stores metadata, locates the files and tracks changes.
    - These are up and running all the time.
    - There exists a secondary NameNode, which takes over if the primary NameNode is not there.
    - There are two components of a NameNode:
      
      FSImage (File System Image) Edit Logs
      
      Point-in-time snapshot of the file system metadata Journals recording every change (write/delete) made to the namespace since the last snapshot
    1. Secondary NameNode
      NameNode only stores metadata (FSImage) and logs. So incase it restarts, it has to replay the entire log to manage the cluster.
      
      The Secondary NameNode pulls the current FSImage and Edit Logs from the Active NameNode.
      
      It merges them, creating a new, up-to-date FSImage. This means the current snapshot is modified to accommodate the new changes presented in Edit logs.
      
      The new FSImage is sent back to the Primary NameNode, and the old edit log is cleared
  2. DataNode
    - These are the machines that actually store data and serve it to the client.
    - They send heartbeat signals to the NameNode about every 3 seconds to indicate health.
    - Responsible for serving read/write requests to the client.
MapReduce
- MapReduce is a programming model for cluster resource management and data processing.
- It’s also known as Hadoop’s processing unit.
- Map is for loading, parsing, transforming and filtering key-value pairs.
- Reduce is for grouping and aggregating data produced by map.
- There are various steps for MapReduce:
1. Record Reader
  - You read documents, where the key is the name of the file and the value is the contents of the file.
  - For example, a <key, value> pair could be <input.txt, "the quick brown fox the">.
2. Mapper
  - Mapper splits each word from the value, and now each word will be the key and the count of that word will be the value.
  - So the output of Mapper would be:
    <"the", 1> <"quick", 1> <"brown", 1> <"fox", 1> <"the", 1>
    This is hardcoded, so if a word appears twice, there will be duplicate keys, and each key will have value 1.
3. Combiner
  - This reads each key-value pair and for common keys, it combines the values as a collection.
  - So the output of Combiner would be:
    <"the", [1, 1]> <"quick", [1]> <"brown", [1]> <"fox", [1]>
4. Reducer
  - This reduces a set of intermediate values (the values are now a collection), into a final value.
  - So the output of Reducer would be:
    <"the", 2> <"quick", 1> <"brown", 1> <"fox", 1>
5. Some Map-Reduce Terminology
```
Job
 ├── Task (Map)
 │    ├── Task Attempt 1
 │    ├── Task Attempt 2
 │    └── Task Attempt 3
 ├── Task (Map)
 │    └── Task Attempt 1
 └── Task (Reduce)
      ├── Task Attempt 1
      └── Task Attempt 2
```
  1. Job
    - The problem statement i.e. what you want to run Map-Reduce on.
  2. Task
    - Smaller piece of a job
    - There’s the map task and the reduce task
  3. Task Attempt
    - It’s an execution attempt of a task.
    - For one task, there may be multiple task attempts, because of node failures or task crashes.
6. Some applications of MapReduce
  1. Reverse Web-Link Graph
    - You have a bunch of tuples (a,b) where there’s a link from page a to b.
    - The output is that for each page we have to display the list of pages that link to it.
      
      For example, given (a,b), (c,b) and (d,b), the output will be
      
      <b, [a,c,d]>
    - Map will take (a,b) as input and output (b,a) (it simply swaps, so that we can group by b)
      
      map(a, b): emit(b, a)
    - Hadoop automatically does something called shuffle and sort phase, where all entries with the same key are grouped.
    - The reduce phase would simply output the list.
      
      reduce(b, list_of_a): emit(b, list_of_a)
7. Some things you CAN’T use Map-Reduce for
  - Real-time analytics because of high latency. MapReduce was designed for massive batch jobs.
  - Iterative algorithms because re-reading and writing data to HDFS is expensive.
  - A large number of small files, because
    - They can overwhelm the NameNode and create metadata overhead.
    - The shuffle and sort phase moves data all over the nodes consuming bandwidth.
  - Tasks which need Caching, because in MapReduce, the same dataset is re-read from the disk for every subsequent job.
8. Daemons in Hadoop 1.x
  - NameNode (Master Node in HDFS)
  - Secondary NameNode (Helper)
  - DataNode (Slave Node in HDFS)
  - JobTracker (Master Node in MapReduce)
  - TaskTracker (Slave Node in MapReduce)

Rack1	Rack2	Rack3
Block1	`_`	`_`
`_`	Block1	`_`
`_`	`_`	`_`
`_`	Block1	`_`

FSImage (File System Image)	Edit Logs
Point-in-time snapshot of the file system metadata	Journals recording every change (write/delete) made to the namespace since the last snapshot

12.2.2. Hadoop 2.x

These are the components of a node using Hadoop 2.x.

HDFS
- Reliable Storage.
- HDFS is still the file system of Hadoop 2.x.
1. Standby NameNode
  - Along with primary and secondary NameNode, there also exists a Standby NameNode.
  - It acts as a passive backup to the active NameNode.
2. Data Pipelining
  - It’s a multistep processing of moving large volumes of data across a distributed computing cluster.
MapReduce
- Data processing.
- There is no longer any concept of job tracker or task tracker
Yarn
- Cluster Resource Managment
- Yarn stands for Yet Another Resource Negotiator.
- Yarn is called the operating system of Hadoop 2.x.
- Yarn defines two resources:
  - v-cores: virtual cores (CPU power allocated to a container on a node)
  - memory: memory allocated to a container on a node
Daemons in Hadoop 2.x
- NameNode (Master node in HDFS)
- Secondary NameNode (helper)
- DataNode (slave node in HDFS)
- Resource Manager (master node in YARN)
  - This keeps a running total of the cluster’s available resources.
- NodeManager (slave node in YARN)
- ApplicationMaster
  - This isn’t a daemon, but it’s a framework-specific library to manage the entire lifecycle of a job.
- Container
  - These are the resources allocated to a single node on a cluster.
How Yarn works
- First the application talks to the ResourceManager for the cluster
- In response, the ResourceManager first allocates one single container.
- The ApplicationMaster starts running within that container.
- The ApplicationMaster requests subsequent containers from ResourceManager.
- Once all tasks are finished, ApplicationMaster exists and the container is de-allocated.
Key Features
1. Scalability
2. Flexibility
  - Supports way more processing models than just MapReduce, including interactive SQL, real-time streaming.
3. Improved Utilization

12.3. HBase

HBase is an open-source, non-relational distributed database.
It’s a column oriented NoSQL database, that runs on top of HDFS.

HDFS	HBase
Does Batch Processing	Does Real Time Processing
Does sequential access of data	Does Random access of data
It’s a File System	HBase is a Column oriented Database (i’s a data warehouse)

13. Apache Spark

13.1. What it is

Instead of just “map” and “reduce”, you define a large set of operations (transformations and actions).
It was started by Matei Zaharia.
It’s a unified computing engine and a set of libraries for parallel data processing on computer clusters.

13.2. Resilient Distributed Dataset

All of the data is in the form of immutable, fault-tolerant, distributed collection of objects that can be operated on in parallel.
It’s an in-memory collection of items/data distributed across many compute nodes.
Spark core provides many APIs for building and manipulating these collections.

Big Data Analytics

Table of Contents

1. Introduction

1.1. What is Big Data?

1.2. What is Big Data Analytics?

1.3. Three V’s (Currently there are a lot more V’s)

1.3.1. Volume

1.3.2. Velocity

1.3.3. Variety

1.3.4. Newer V’s introduced in recent times

1.4. Human-Generated Data vs Machine-Generated Data

1.5. Issues with Terminology related to Structure

2. Definition of Big Data

3. Types of Data Analytics

3.1. Descriptive Analytics

3.2. Diagnostic Analytics

3.3. Predictive Analytics

3.4. Prescriptive Analytics

4. Database Architectures

4.1. OLTP

4.2. OLAP

4.3. RTAP

5. Change in Trend

6. Size Matters

6.1. In Bits

6.2. In General

6.3. Traditional Business Intelligence vs Big Data

6.4. What is a Data lake

7. Transaction Processing

7.1. Atomicity

7.2. Consistency

7.3. Isolation

7.4. Durability

8. Challenges

8.1. Vertical and Horizontal Scaling

8.2. Security

8.3. Schema

8.4. Consistency

8.5. Continuous Availability

8.6. Partition Tolerent

8.7. Data Quality

8.7.1. 3Cs

9. Brewer’s CAP Theorem

9.1. What CAP is

9.1.1. Consistency

9.1.2. Availability

9.1.3. Partition Tolerance

9.2. CAP Theorem

9.3. Examples of Different Systems

9.3.1. CP

9.3.2. AP

9.3.3. CA

10. BASE Theorem

10.1. Basically Available

10.2. Soft-state

10.3. Eventually Consistent

11. NoSQL

11.1. Sharding

11.2. Types of NoSQL

11.3. Difference between SQL, NoSQL and NewSQL

12. Hadoop

12.1. Hadoop Cluster

12.2. Hadoop Versions

12.2.1. Hadoop 1.x

12.2.2. Hadoop 2.x

12.3. HBase

13. Apache Spark

13.1. What it is

13.2. Resilient Distributed Dataset