CompileArtisan

Big Data Analytics

Table of Contents

1. Introduction

1.1. What is Big Data?

  • Big Data involves datasets that are too large or complex for traditional data processing application softwares.
  • It counts as too large, if the data is in Gigabytes, Terabytes, Petabytes or maybe even Exabytes.
  • The future involves an unofficial data size called Brontobyte (\(10^{27}\) bytes).
  • The term “big data” was first coined by John Mashey in the 90s.

1.2. What is Big Data Analytics?

  • Process of uncovering trends, patterns and correlations in large datasets.
  • This is useful for risk management, and making better decisions.

1.3. Three V’s (Currently there are a lot more V’s)

  • These are core challenges involved with big data.
  • These were coined by Doug Laney in 2001, in a report for Gartner.

1.3.1. Volume

  • The data itself is in larger units.

1.3.2. Velocity

  • This is the rate at which data is being generated.
  • As of 2022, 2.5 quintillion bytes of data is created every day. In 2026, that number is 3.5 quintillion. Refer to the common sizes.
  • This emphasizes on how close we are to real time processing of data.

1.3.3. Variety

  1. Structured
    • Data with a fixed format.
    • For examples, spreadsheets and databases are structured data.
    • DBMS operations are very fast, and this is very scalable.
  2. Unstructured
    • Data with no fixed format.
    • For example, images, videos, text.
    • 80% of the data is unstructured.
    • To deal with unstructured data, the various ways are
      • Data mining
      • NLP
      • Text Analytics
      • Noisy Text Analytics
  3. Semi-Structured
    • It might have a fixed format, but might contain unstructured data.
    • XML is a great example - it has a format (tags), but the content can be unstructured. A simpler example would be an email (from, to, CC, BCC form a structure, but the content can be anything).
    • It’s often either data with labels, or data in the form of key-value pairs.

1.3.4. Newer V’s introduced in recent times

  1. Veracity
    • Trustworthiness/credibility of data.
  2. Value
    • The ability to extract meaningful data.
  3. Validity
    • Is the data correct or appropriate for its intended use?
  4. Visualization
    • Putting data in a human-interpretable form.
  5. Variability
    • Changes in the nature of the data.
  6. Vulnerability
    • Compromises in system security.
  7. Viscosity
    • How easy/difficult it is to combine/transfer data.
  8. Volatility
    • How long data is valid for, and how long it should be stored.
    • This is the characteristics of data that deals with its retention.
  9. Virality
    • How far the existing data spreads around.

1.4. Human-Generated Data vs Machine-Generated Data

  • Examples of human-generated data are text messages, emails, documents.
  • Examples of machine-generated data are outputs of sensors or system logs.

1.5. Issues with Terminology related to Structure

  • HTML is often debated between considered unstructured or semi-structured.
  • This is because questions are being raised on whether the structure is actually helping analytics or not.

2. Definition of Big Data

It’s a high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.

3. Types of Data Analytics

3.1. Descriptive Analytics

  • Describe the data i.e. tell you what happened in the past.
  • This is done by summarizing the past data into a human-interpretable form (eg. charts, graphs, etc).

3.2. Diagnostic Analytics

  • Understand why something happened in the past.
  • This is done using techniques like drill-down, data-mining and data-discovery.

3.3. Predictive Analytics

  • Predicts what’s most likely to happen in the future.

3.4. Prescriptive Analytics

  • What actions to take to affect those outcomes.

4. Database Architectures

4.1. OLTP

  • Stands for Online Transaction Processing.
  • This is essentially DBMS.

4.2. OLAP

  • Online Analytical Processing.
  • This is also alled data warehousing and these are ways to analyse the results of OLTP.

4.3. RTAP

  • Real-time Analytics Processing.
  • It’s also known as HTAP - Hybrid Transactional/ Analytical Processing

5. Change in Trend

  • Models have changed: previously the amount of people generating big-data was less, and the consumers were high in number.
  • Now, everyone is both generating and consuming big data.

6. Size Matters

6.1. In Bits

Unit Number of Bits
Bit 1
Byte 8
Kibibyte 1024
Kilobyte 1000
Mebibyte 10242
Megabyte 10002
Gibibyte 10243
Gigabyte 10003
Tebibyte 10244
Terabyte 10004

6.2. In General

Unit Equivalent Number
Thousand 103
Million 106
Billion 109
Trillion 1012
Quadrillion 1015
Quintillion 1018

As a function of time:

  • 1 Million Seconds is 12 days
  • 1 Billion Seconds is 31.7 years
  • 1 Trillion Seconds is 31.7 Thousand (31,700) years
  • 1 Quadrillion Seconds is 31.7 Million Years
  • 1 Quintillion Seconds is 31.7 Billion Years

6.3. Traditional Business Intelligence vs Big Data

Traditional Business Intelligence Big Data
The data size is GB-TB The data size is PB to ZB/YB
The file system is centralized The file system is distributed
The data is structured The data is variable
Move Data to Code Move Code to Data

6.4. What is a Data lake

  • A data lake is a data repository that stores data in its native/original format until needed.
  • Data lakes can store both relational and non-relational data, while data warehouses is purely relational in nature.
  • Data lakes are way easier to scale at a lower cost, compared to data warehouses.

7. Transaction Processing

  • Ensures that DBMS operations happen as one single indivisible unit with the help of ACID properties

7.1. Atomicity

  • Transaction either completely finishes, or doesn’t happen at all. There’s no partially completed state.

7.2. Consistency

  • Changes must reflect everywhere.
  • All integrity constraints must satisfy.

7.3. Isolation

  • Two transactions shouldn’t interfere with each other

7.4. Durability

  • Transactions must be permanent.

8. Challenges

8.1. Vertical and Horizontal Scaling

  • Vertical Scaling (scaling-up) is adding more CPUs, RAM to a single server.
  • Horizontal Scaling (scaling-out) is adding more servers altogether to distribute the work load.

8.2. Security

  • Lack of proper authentication and authorization mechanisms.

8.3. Schema

  • Has to be dynamic.

8.4. Consistency

8.5. Continuous Availability

8.6. Partition Tolerent

  • This is regarding the partitions in a network.
  • Partition Tolerant tells us how we have to take care of hardware and software failures involved in a Network.

8.7. Data Quality

8.7.1. 3Cs

  • Completeness
  • Consistency
  • Cleanliness

9. Brewer’s CAP Theorem

9.1. What CAP is

9.1.1. Consistency

  • All reads receive the most recent write or an error.
  • Everyone always sees the same latest data.

9.1.2. Availability

  • All reads contain data, but it might not be the most recent.
  • The system will always provide a response, but that response could be outdated too.

9.1.3. Partition Tolerance

  • The system continues to be operate despite network failure.

9.2. CAP Theorem

  • A system that has consistency and partition tolerance, but compromises availability is called a CP system. (note that it’s compromised, not non-existent).
  • A system that has availability and partition tolerance, but compromises availability is called an AP system.
  • You can make a CP system or an AP system.
  • The theorem says that partition tolerence is unavoidable, because we are dealing with networks.
  • Departing from CAP theorem, if P was compromised, it would be a CA system. There’s no network involved and hence we’re talking about your local system.
  • Refer to CAP Twelve Years Later: How the “Rules” Have Changed - InfoQ.

9.3. Examples of Different Systems

9.3.1. CP

  • Availability meant there will always be a response (despite being outdated).
  • Lack of availability simply means during partition, requests could be rejected.
  • It chooses failure over giving wrong data.
  • MongoDB is an example, because one server is primary and everthing else is a replica. If it can’t reach the primary server, the write won’t happen.

9.3.2. AP

  • Consistency meant all reads from any server, either always show the most recent read, or it gives an error.
  • The lack of consistency means that it chooses to give slightly incorrect/inconsistent data instead of just failing.
  • Apache Cassandra was designed for facebook, where downtime is unacceptable. So instead of just failing, it’s okay to be a little inconsistent.

9.3.3. CA

  • Like previously mentioned, the lack of partition tolerance means there’s no distributed network to partition.
  • Some examples include MySQL or Oracle Database.

10. BASE Theorem

  • BASE stands for Basically Available Soft-state Eventually-Consistent.
  • BASE theorem provides a good model for AP systems and how they can deal with the lack of consistency.

10.1. Basically Available

  • If a particular node fails in a network, data on that will not be available. But the entire data layer is still operational.
  • It conveys the same idea of consistency in CAP theorem.

10.2. Soft-state

  • The design principle in which the state of the system may change over a period of time, even without any input, is called soft-state.
  • Maintaining strong consistency might be costly and could be impractical at times. This is when soft-state is preferred.

10.3. Eventually Consistent

  • It’s a guarantee that the system will have consistent values applied at a certain point of time, if not immediately.

11. NoSQL

  • It stands for Not Only SQL, and it was coined by Carlo Strozzi in 1998.

11.1. Sharding

  • Splitting a large dataset into smaller, more manageable pieces called “shards” is called sharding.
  • These shards are distributed across multiple systems to improve scalability and performance. (because this paves way for parallel processing).

11.2. Types of NoSQL

Key-Value Document Oriented Column Oriented Graph Based
Designed for horizontal scaling JSON-like documents Work on columns instead of rows, and they’re good for faster analytics Good for representing relationships
Eg. Redis Eg. MongoDB Eg. Cassandra Eg. InfiniteGraph

11.3. Difference between SQL, NoSQL and NewSQL

  Traditional SQL NoSQL NewSQL
Horizontal Scaling/Distributed Computing No Yes Yes
Data Format Flexibility No Yes Maybe
OLTP/OLAP Yes No Yes
Community Support Huge Growing Slowly Growing
Properties Follows ACID Properties Follows Brewer’s CAP theorem Follows both
P Not preferred for hierarchical data (tree-like structure) Best for hierarchical data (because JSON can effectively store this)  

12. Hadoop

  • It’s an open-source framework written in Java used to process enormous data sets.
  • It was first developed by Doug Cutting and Mike Cafarella in 2005.
  • Hadoop allows distributed processing, across a cluster of commodity hardware (inexpensive, widely available hardware).
  • Hadoop was inspired by Google File System for distributed storage.
  • In 2007, Yahoo started using Hadoop on a 1000 node cluster. It was in 2008 Hadoop defeated supercomputers.

12.1. Hadoop Cluster

  • A Hadoop Cluster is a collection of computers (which are broadly of two types) which store and work on data.
  • The two types of computers in a cluster, are master node and slave node.
    • The master node (primary node) is responsible for managing the cluster.
    • The slave nodes (secondary nodes) are the workers which actually store data and perform computation.
  • One cluster has a single namespace.

12.2. Hadoop Versions

12.2.1. Hadoop 1.x

These are the components of a node using Hadoop 1.x.

  1. HDFS
    • This is for reliable storage.
    • HDFS stands for Hadoop Distributed File System.
    • Given any file, you split the file into blocks

      File (1GB)
      ↓
      Block1 (128MB)
      Block2 (128MB)
      Block3 (128MB)
      ...
      

      Then you distribute the blocks across different nodes:

      Node1 → Block1
      Node2 → Block2
      Node3 → Block3
      Node4 → Block4
      
    1. Blocks
      • This is the minimum amount of data that can be read or written. (64 MB or 128 MB).
      • By default there are 3 replicas of data, and they will be placed across at least 2 racks.
        • The first replica is placed on the local node
        • The second and third replica are palaced on a remote rack during write pipeline.
      • The client reads from the nearest replica.
    2. Rack
      • It’s a collection of 40-50 data nodes using the same network switch.
      Rack1 Rack2 Rack3
      Block1 _ _
      _ Block1 _
      _ _ _
      _ Block1 _
    3. Rack Awareness
      • Choosing data node racks that are closest to each other is called rack awareness.
      • Improves cluster performance by reducing network traffic.
      • This is how you move the computation to where the data is stored.
    4. Master/Slave Architecture - Node
      1. NameNode
        • Master of the system
        • Maintains and manages the blocks present on the Data Node.
        • It doesn’t store the files themselves. It only stores metadata, locates the files and tracks changes.
        • These are up and running all the time.
        • There exists a secondary NameNode, which takes over if the primary NameNode is not there.
        • There are two components of a NameNode:

          FSImage (File System Image) Edit Logs
          Point-in-time snapshot of the file system metadata Journals recording every change (write/delete) made to the namespace since the last snapshot
        1. Secondary NameNode
          • NameNode only stores metadata (FSImage) and logs. So incase it restarts, it has to replay the entire log to manage the cluster.
          • The Secondary NameNode pulls the current FSImage and Edit Logs from the Active NameNode.
          • It merges them, creating a new, up-to-date FSImage. This means the current snapshot is modified to accommodate the new changes presented in Edit logs.
          • The new FSImage is sent back to the Primary NameNode, and the old edit log is cleared
      2. DataNode
        • These are the machines that actually store data and serve it to the client.
        • They send heartbeat signals to the NameNode about every 3 seconds to indicate health.
        • Responsible for serving read/write requests to the client.
  2. MapReduce
    • MapReduce is a programming model for cluster resource management and data processing.
    • It’s also known as Hadoop’s processing unit.
    • Map is for loading, parsing, transforming and filtering key-value pairs.
    • Reduce is for grouping and aggregating data produced by map.
    • There are various steps for MapReduce:
    1. Record Reader
      • You read documents, where the key is the name of the file and the value is the contents of the file.
      • For example, a <key, value> pair could be <input.txt, "the quick brown fox the">.
    2. Mapper
      • Mapper splits each word from the value, and now each word will be the key and the count of that word will be the value.
      • So the output of Mapper would be:

        <"the", 1>
        <"quick", 1>
        <"brown", 1>
        <"fox", 1>
        <"the", 1>
        

        This is hardcoded, so if a word appears twice, there will be duplicate keys, and each key will have value 1.

    3. Combiner
      • This reads each key-value pair and for common keys, it combines the values as a collection.
      • So the output of Combiner would be:

        <"the", [1, 1]>
        <"quick", [1]>
        <"brown", [1]>
        <"fox", [1]>
        
    4. Reducer
      • This reduces a set of intermediate values (the values are now a collection), into a final value.
      • So the output of Reducer would be:

        <"the", 2>
        <"quick", 1>
        <"brown", 1>
        <"fox", 1>
        
    5. Some Map-Reduce Terminology
      Job
       ├── Task (Map)
       │    ├── Task Attempt 1
       │    ├── Task Attempt 2
       │    └── Task Attempt 3
       ├── Task (Map)
       │    └── Task Attempt 1
       └── Task (Reduce)
            ├── Task Attempt 1
            └── Task Attempt 2
      
      1. Job
        • The problem statement i.e. what you want to run Map-Reduce on.
      2. Task
        • Smaller piece of a job
        • There’s the map task and the reduce task
      3. Task Attempt
        • It’s an execution attempt of a task.
        • For one task, there may be multiple task attempts, because of node failures or task crashes.
    6. Some applications of MapReduce
      1. Reverse Web-Link Graph
        • You have a bunch of tuples (a,b) where there’s a link from page a to b.
        • The output is that for each page we have to display the list of pages that link to it.

          For example, given (a,b), (c,b) and (d,b), the output will be

          <b, [a,c,d]>
          
        • Map will take (a,b) as input and output (b,a) (it simply swaps, so that we can group by b)

          map(a, b):
              emit(b, a)
          
        • Hadoop automatically does something called shuffle and sort phase, where all entries with the same key are grouped.
        • The reduce phase would simply output the list.

          reduce(b, list_of_a):
              emit(b, list_of_a)
          
    7. Some things you CAN’T use Map-Reduce for
      • Real-time analytics because of high latency. MapReduce was designed for massive batch jobs.
      • Iterative algorithms because re-reading and writing data to HDFS is expensive.
      • A large number of small files, because
        • They can overwhelm the NameNode and create metadata overhead.
        • The shuffle and sort phase moves data all over the nodes consuming bandwidth.
      • Tasks which need Caching, because in MapReduce, the same dataset is re-read from the disk for every subsequent job.
    8. Daemons in Hadoop 1.x
      • NameNode (Master Node in HDFS)
      • Secondary NameNode (Helper)
      • DataNode (Slave Node in HDFS)
      • JobTracker (Master Node in MapReduce)
      • TaskTracker (Slave Node in MapReduce)

12.2.2. Hadoop 2.x

These are the components of a node using Hadoop 2.x.

  1. HDFS
    • Reliable Storage.
    • HDFS is still the file system of Hadoop 2.x.
    1. Standby NameNode
      • Along with primary and secondary NameNode, there also exists a Standby NameNode.
      • It acts as a passive backup to the active NameNode.
    2. Data Pipelining
      • It’s a multistep processing of moving large volumes of data across a distributed computing cluster.
  2. MapReduce
    • Data processing.
    • There is no longer any concept of job tracker or task tracker
  3. Yarn
    • Cluster Resource Managment
    • Yarn stands for Yet Another Resource Negotiator.
    • Yarn is called the operating system of Hadoop 2.x.
    • Yarn defines two resources:
      • v-cores: virtual cores (CPU power allocated to a container on a node)
      • memory: memory allocated to a container on a node
  4. Daemons in Hadoop 2.x
    • NameNode (Master node in HDFS)
    • Secondary NameNode (helper)
    • DataNode (slave node in HDFS)
    • Resource Manager (master node in YARN)
      • This keeps a running total of the cluster’s available resources.
    • NodeManager (slave node in YARN)
    • ApplicationMaster
      • This isn’t a daemon, but it’s a framework-specific library to manage the entire lifecycle of a job.
    • Container
      • These are the resources allocated to a single node on a cluster.
  5. How Yarn works
    • First the application talks to the ResourceManager for the cluster
    • In response, the ResourceManager first allocates one single container.
    • The ApplicationMaster starts running within that container.
    • The ApplicationMaster requests subsequent containers from ResourceManager.
    • Once all tasks are finished, ApplicationMaster exists and the container is de-allocated.
  6. Key Features
    1. Scalability
    2. Flexibility
      • Supports way more processing models than just MapReduce, including interactive SQL, real-time streaming.
    3. Improved Utilization

12.3. HBase

  • HBase is an open-source, non-relational distributed database.
  • It’s a column oriented NoSQL database, that runs on top of HDFS.
HDFS HBase
Does Batch Processing Does Real Time Processing
Does sequential access of data Does Random access of data
It’s a File System HBase is a Column oriented Database (i’s a data warehouse)

13. Apache Spark

13.1. What it is

  • Instead of just “map” and “reduce”, you define a large set of operations (transformations and actions).
  • It was started by Matei Zaharia.
  • It’s a unified computing engine and a set of libraries for parallel data processing on computer clusters.

13.2. Resilient Distributed Dataset

  • All of the data is in the form of immutable, fault-tolerant, distributed collection of objects that can be operated on in parallel.
  • It’s an in-memory collection of items/data distributed across many compute nodes.
  • Spark core provides many APIs for building and manipulating these collections.