hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bill Habermaas" <b...@habermaas.us>
Subject Re: Fundamental question
Date Sun, 09 May 2010 13:47:11 GMT
These questions are usually answered once you start using the system but 
I'll provide some quick answers.

1. Hadoop uses the local file system at each node to store blocks. The only 
part of the system that needs to be formatted is the namenode which is where 
Hadoop keeps track of the logical HDFS filesystem image that contains the 
directory structure, files and the datanodes where they reside. A file in 
HDFS is a sequence of blocks. When the file has a replication factor 
(usually 3) then each block has 3 exact copies that reside at different 
datanodes. This is important to remember for your second question.

2. The notion of processing locally is simply that map/reduce will process a 
file at different nodes by reading the blocks that are located at that 
location.  So if you have 3 copies of the same block at different nodes, 
then the system can pick nodes where it can process those blocks locally. In 
order to process the entire file, map/reduce runs parallel tasks that 
process the blocks locally at each node.  Once you have data in the HDFS 
cluster it is not necessary to move things around.  The framework does that 
transparently. An example might help: say file has blocks 1,2,3,4 which are 
replicated across 3 datanodes (A,B,C).  Due to replication there is a copy 
of each block residing at each node. When the map/reduce job is started by 
the jobtracker, it begins a task at each node: (A will process blocks 1 & 
2),   B will process block 3, and C will process block 4).  All these tasks 
run in parallel so if you are handling a terrabyte+ file there is a big 
reduction in processing time.  Each task writes it's map/reduce output to a 
specific output directory (in this case 3 files) which can be inputted to 
the next map/reduce job.

I hope this brief answer is helpful and provides some insight.


----- Original Message ----- 
From: "Vijay Rao" <raovijay@gmail.com>
To: <common-user@hadoop.apache.org>
Sent: Sunday, May 09, 2010 2:49 AM
Subject: Fundamental question

> Hello,
> I am just reading and understanding Hadoop and all the other components.
> However I have a fundamental question for which I am not getting answers 
> in
> any of the online material that is out there.
> 1) If hadoop is used then all the slaves and other machines in the cluster
> need to be formatted to have HDFS file system. If so what happens to the
> tera bytes of data that need to be crunched? Or is the data on a different
> machine?
> 2) Everywhere it is mentioned that the main advantage of map/reduce and
> hadoop is that it runs on data that is available locally. So does this 
> mean
> that once the file system is formatted then I have to move my terabytes of
> data and split them across the cluster?
> Thanks
> VJ

View raw message