Tuesday, July 5, 2016

History of Hadoop


Hadoop was derived from Google’s MapReduce and Google File System (GFS) papers. Hadoop was created by Doug Cutting and Michael J. Cafarella (Yahoo employees) and name by Doug’s son’s toy elephant.

Dec 2004 – Google GFS paper published
July 2005 – Nutch uses MapReduce
Feb 2006 – Starts as a Lucene subproject
Apr 2007 – Yahoo! on 1000-node cluster
Jan 2008 – An Apache Top Level Project
Jul 2008 – A 4000 node test cluster
May 2009 – Hadoop sorts Petabyte in 17 hours



Design of HDFS:
HDFS  is  a  filesystem  designed  for  storing  very  large  files  with  streaming  data  access patterns,  running  on  clusters  of  commodity  hardware.
1. Very large files:- “Very  large”  in  this  context  means  files  that  are  hundreds  of  megabytes,  gigabytes, or  terabytes  in  size.  There  are  Hadoop  clusters  running  today  that  store  petabytes of data.
2.  Streaming data access:- HDFS  is  built  around  the  idea  that  the  most  efficient  data  processing  pattern  is  a write-once,  read-many-times  pattern.  A  dataset  is  typically  generated  or  copied from  source,  then  various  analyses  are  performed  on  that  dataset  over  time.  Each analysis  will  involve  a  large  proportion,  if  not  all,  of  the  dataset,  so  the  time  to  read the whole dataset is more important than the latency in reading the first record.
3. Commodity hardware:- Hadoop doesn’t require expensive, highly reliable hardware to run on. It’s designed to  run  on  clusters  of  commodity hardware (commonly available hardware available from multiple vendors)  for  which  the  chance  of  node  failure  across  the  cluster  is high,  at  least  for  large  clusters.  HDFS  is  designed  to  carry  on  working  without  a noticeable interruption to the user in the face of such failure.

HDFS is not a good fit for below cases:
1. Low-latency data access
Applications that require low-latency access to data, in the tens of milliseconds
range, will not work well with HDFS. Remember, HDFS is optimized for delivering
a high throughput of data, and this may be at the expense of latency. HBase is currently a better choice for low-latency access.
2. Lots of small files
Since the namenode holds filesystem metadata in memory, the limit to the number
of files in a filesystem is governed by the amount of memory on the namenode. As
a rule of thumb, each file, directory, and block takes about 150 bytes. So, for
example, if you had one million files, each taking one block, you would need at
least 300 MB of memory. While storing millions of files is feasible, billions is beyond
the capability of current hardware.
3. Multiple writers, arbitrary file modifications
Files in HDFS may be written to by a single writer. Writes are always made at the
end of the file. There is no support for multiple writers, or for modifications at
arbitrary offsets in the file. 

No comments:

Post a Comment