Big Data

6 min readMar 16, 2021

The world is one big data problem

Big Data is hype in the market. Earlier Big Data considered as a problem but soon this problem has been resolved and now it is known for the operations such as Big Data analysis and Big Data analytics. Big Data analytics includes ML, DL and AI.

Why Big Data was considered to be a problem? we are gonna know in this story.

Definitions of Big Data:

It can be defined as data sets whose size or type is beyond the ability of traditional relational databases to capture, manage and process the data with low latency. Characteristics of big data include high volume, high velocity and high variety.
Big Data is a term used to describe the large amount of data in the networked, digitized, sensor-laden, information-driven world (NIST).
Big data used to mean data that a single machine was unable to handle. Now big data has become a buzzword to mean anything related to data analytics or visualization.

How much data is Big Data?

As the world’s data has grown, we’re now talking about data in terms of zettabytes. How many zettabytes have been created so far? According to market intelligence company IDC, the ‘Global Datasphere’ in 2018 reached 18 zettabytes. This is the total of all data created, captured or replicated.

In 2020, it was projected that the overall amount of data created worldwide would reach 59 zettabytes, climbing rapidly into the future.

With the increasing number of people and cities connected to the Internet, data sets are increasingly larger in size. One report estimates that the total size of digital data will be 175 zettabytes in 2025.

How much data is 175 zettabytes, anyway? A single zettabyte is a trillion gigabytes. A modern smartphone stores about 32 gigabytes. To store 175 zettabytes, we would need 6 trillion smartphones (1000 smartphones for every living person!).

Byte < Kilobyte < Megabyte < Gigabyte < Terrabyte < Petabyte < Exabyte < Zettabyte < Yottabyte

That was about the huge size of Big Data. What would be the size of largest storage device ?

At 18TB, this is the largest internal hard disk drive around: the Seagate IronWolf Pro
At 100TB, this is the largest solid state drive right now: the Nimbus Data

Why largest storage device of TBs? , Why not of Petabyte or Exabyte?

It is because with single storage device we have following challenges:

Volume: We employed a machine or storage device for Big Data storage. This device due to failure at some points do not work as a result of which the entire system which was working with data for its query, analysis and analytics will stop and then the failure is known as single point of failure. The failure in the device can be due to many reasons like heating effect. In data centers everyday hardware fails or disks fail so saying that a storage device can not fail during work is stupidity. Here are facing the challenge of volume for Big Data.
Velocity: A SSD takes 56 minutes to read 10 TB and now if take storage device of Pettabyte then it should take 5374 minutes. If we go beyond Pettabyte then you can think of number. Taking this much time to serve the customer is the worst service.

What is the tool of Big Data storage and its access?

In modern technology there’s no choice but to store Big Data across multiple disk drives, and the largest data stores must necessarily span thousands of disk drives. So, a big data store relies on “distributed storage.” For distributed storage, instead of storing a large file sequentially, you can split it into pieces, and scatter those pieces across many disks. The below illustration shows a file split into pieces, sometimes called blocks, with those blocks distributed across multiple disks for storage of the file.

The big data platform Apache Hadoop includes a file system called the Hadoop Distributed File System, or HDFS. In HDFS, a single block is usually of size 128 megabytes. So, a one-gigabyte file would consist of 8 blocks, and a one-terabyte file would consist of 8000 blocks. Notice, though, that if one disk fails, then a part of your file is lost. This presents a problem for keeping the file system available. You can usually purchase a disk drive with a tested mean time to failure of 100,000 hours, or just over 11 years.

And this gives you a high degree of confidence if you just use one drive on one computer, making periodic backup copies, for a number of years. But, what if you have a 1000 drives ganged together, and each one is needed to keep your files available? The mean time to failure is now about 100 hours — less than a week. With 10,000 drives, you can estimate a drive failure every 10 hours. So disk failure must be an expected, common occurrence for big data systems. In order to keep files available, HDFS keeps redundant copies of blocks on different disks. The accepted standard for redundancy in HDFS is 3 copies. If one disk fails, there are still 2 copies of each lost block available.

Velocity: It takes 56 minutes to read 10 TB data from SSD disk. If we use 100 TB SSD disk it would take 560 minutes to read the data. In the distributed storage if we parallelly read the 100 TB data from 10 disks of 10 TB then time to read the 100 TB will be 56 minutes.

That’s how we can store the Big Data and access it using the Distributed File Storage concept.

Every day, we feed Facebook’s data beast with mounds of information. Every 60 seconds, 136,000 photos are uploaded, 510,000 comments are posted, and 293,000 status updates are posted. That is a LOT of data and for storing this much data Facebook uses Hadoop.

Here are a few examples that show how Facebook uses its Big Data.

Example 1: The Flashback

Honoring its 10th anniversary, Facebook offered its users the option of viewing and sharing a video that traces the course of their social network activity from the date of registration until the present. Called the “Flashback,” this video is a collection of photos and posts that received the most comments and likes and set to nostalgic background music.

Other videos have been created since then, including those you can view and share in celebrating a “Friendversary,” the anniversary of two people becoming friends on Facebook. You’ll also be able to see a special video on your birthday.

Example 2: I Voted

Facebook successfully tied the political activity to user engagement when they came out with a social experiment by creating a sticker allowing its users to declare “I Voted” on their profiles.

This experiment ran during the 2010 midterm elections and seemed useful. Users who noticed the button were likely to vote and be vocal about the behavior of voting once they saw their friends were participating in it. Out of a total of 61 million users, then, 20% of the users who saw their friends voting, also clicked the sticker.

The Data science unit at Facebook has claimed that with the combination of their stickers that motivated close to 60,000 voters directly, and the social contagion, which prompted 280,000 connected users to vote for a total of 340,000 additional voters in the midterm elections.

For the 2016 elections, Facebook expanded its involvement into the voting process with reminders and directions to users’ polling places.

Conclusion: Data scientist, Data analysis is the good career because 21st century is of the information.

Big Data

Example 1: The Flashback

Example 2: I Voted

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Shobhit Singh Pal

No responses yet

More from Shobhit Singh Pal

Launching GUI application in Docker container

What do you find above?

WordPress with AWS RDS

To setup the WordPress with AWS RDS you are going to implement the following steps:

Message Queues in AWS SQS

Top companies like amazon , google have successfully shift their architecture from the monolithic to microservice. A monolithic…

Jenkins Case Study

For the intro of Jenkins checkout https://pshobhitsingh01.medium.com/what-is-jenkins-fdcec381ea38.

Recommended from Medium

Silver Layer Data Modeling Best Practices (Medallion Architecture)

In modern data architectures, the Silver layer plays a pivotal role as an intermediary between raw data (Bronze layer) and refined…

Medallion Architecture: Principles and Practical Exploration

Data Layout Approach: A Modern Approach to Scalable Data Lakehouse Design and Understanding with Databricks notebook

Lists

Interesting Design Topics

Staff picks

Natural Language Processing

Data engineering at Meta: High-Level Overview of the internal tech stack

This article provides an overview of the internal tech stack that we use on a daily basis as data engineers at Meta. The idea is to shed…

How I processed ONE billion rows in PySpark without crashing (and You Can Too!)

Ever tried running a PySpark job on 1 billion rows, only to watch it crash and burn?

Apache Spark Architecture :A Deep Dive into Big Data Processing

Agenda

Enterprise Data Architecture 101: AWS+Snowflake Blueprints

A framework for understanding Enterprise Data Architecture on AWS in Snowflake for 2024