Tuesday, July 5, 2016

Introduction to Hadoop

We live in a Data World!
Data is coming from all over sources like a flood!. See your facebook photo uploads. In earlier days you used to have image of lower quality but now the resolution,pixel quality and everything may lead to the image size more.. So, day by day data is getting more. So, this we call it as Big Data - Amount of data that is beyond the storage and processing capabilities of a single machine.
-> Large Volume : Data to be referred as BigData (of massive volume)
-> Variety : Multiple Varieties of data whether semi-structured/no-structured and may originate from variety of sources.
-> Velocity: Rate at which data comes into system is really fast.
Example: Facebook uploads/posts across world

Data Storage and Analysis:Now, you may need to store the data. This Big Data cant be stored in a single disk so there comes Distributed File System, where we can store the data across multiple systems. Now, assume if you want to copy 3 GB of data, it may take slow for writing/copying the data.
Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers with commodity hardware using simple programming models.




1st problem-> If you are storing the data across distributed/network of systems, then there is a chances of data loss due to one or other issue's. So to have data safe, we may need to replicate the data across 2 copies in 2 different disks which is called replication of data.
Then how about processing the data.?
There comes the logic of Analysis.
Suppose if you want to give a code to 1 person, he may take 6 months of time for developing.
If you distribute the same, then the same can be done to 3 people, so we can expect the same can be done in lesser time i.e, in 2 months.
Same-way, we can get the data across different disks , manipulate and then integrate the same to provide the final output.
2nd problem-> You are analyzing large dataset's and you need to integrate this, which may take more bandwidth again.
Here, comes HADOOP where Hadoop provides: a reliable shared storage and analysis system.
The storage is provided by HDFS and analysis/processing is done by MapReduce.

HADOOP -> HDFS (Storage) + MapReduce (Processing Data)


Why can’t we use databases with lots of disks to do large-scale batch analysis? Why is MapReduce needed?
1=> An RDBMS is good for point queries or updates, where the dataset has been indexed to deliver low-latency retrieval and update times of a relatively small amount of data.
2=> MapReduce suits applications where the data is written once, and read many times, whereas a relational database is good for datasets that are continually updated.
3=> Structured data means data that is organized into entities that have a defined format, such as XML documents or database tables that conform to a particular predefined schema. This can be stored using RDMS.
4=> What if the data is unstructured/no structured means sometimes image and sometimes text, maps e.t.c, How to store such semi-structured/unstructured data.
MapReduce works well on unstructured or semistructured data, since it is designed to interpret the data at processing time.
5=> Relational data is often normalized to retain its integrity and remove redundancy.
Normalization poses problems for MapReduce, since it makes reading a record a non-local operation, and one of the central assumptions that MapReduce makes is that it is possible to perform (high-speed) streaming reads and writes.
A web server log is a good example of a set of records that is not normalized (for example, the client hostnames are specified in full each time, even though the same client may appear many times), and this is one reason that logfiles of all kinds are particularly well-suited to analysis with MapReduce.




Features of HDFS:

  • Highly fault-tolerant and is designed to be deployed on low-cost hardware.
  • High throughput access to application data 
  • Suitable for applications with large datasets
  • Streaming access to file system data
  • Low cost.
What is MapReduce?
MapReduce is a linearly scalable programming model. The programmer writes two functions—a map function and a reduce function—each of which defines a mapping from one set of key-value pairs to another.

No comments:

Post a Comment