hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-hadoop Wiki] Update of "FrontPage" by SameerParanjpye
Date Tue, 22 Aug 2006 20:20:08 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.

The following page has been changed by SameerParanjpye:
http://wiki.apache.org/lucene-hadoop/FrontPage

The comment on the change is:
Added Map/Reduce intro to front page

------------------------------------------------------------------------------
- = Hadoop =
+ = Introduction =
  
  [http://lucene.apache.org/hadoop/ Hadoop] is a framework for running applications on large
clusters built of commodity hardware. The Hadoop framework transparently provides applications
both reliability and data motion. Hadoop implements a computational paradigm named [:HadoopMapReduce:
Map/Reduce], where the application is divided into many small fragments of work, each of which
may be executed or reexecuted on any node in the cluster. In addition, it provides a distributed
file system that stores data on the compute nodes, providing very high aggregate bandwidth
across the cluster. Both Map/Reduce and the distributed file system are designed so that node
failures are automatically handled by the framework.
  
@@ -10, +10 @@

  
  Hadoop was originally built as infrastructure for the [http://lucene.apache.org/nutch/ Nutch]
project, which crawls the web and builds a search engine index for the crawled pages. Both
Hadoop and Nutch are part of the [http://lucene.apache.org/java/docs/index.html Lucene] [http://www.apache.org/
Apache] project.
  
+ == Hadoop Map/Reduce ==
+ 
+ === Programming model and execution framework ===
+ 
+ Map/Reduce is a programming paradigm that expresses a large distributed computation as a
+ sequence of distributed operations on data sets of key/value pairs. The Hadoop Map/Reduce
+ framework harnesses a cluster of machines and executes user defined Map/Reduce jobs across
+ the nodes in the cluster. A Map/Reduce computation has two phases, a ''map'' phase and a
''reduce''
+ phase. The input to the computation is a data set of key/value pairs.
+ 
+ In the map phase, the framework splits the input data set into a large number of fragments
+ and assigns each fragment to a ''map task''. The framework also distributes the many map
tasks
+ across the cluster of nodes on which it operates. Each map task consumes key/value pairs
+ from its assigned fragment and produces a set of intermediate key/value pairs. For each
+ input key/value pair ''(K,V)'', the map task invokes a user defined ''map function'' that
transmutes
+ the input into a different key/value pair ''(K',V')''.
+ 
+ Following the map phase the framework sorts the intermediate data set by key and produces
+ a set of ''(K',V'*)'' tuples so that all the values associated with a particular key appear
+ together. It also partitions the set of tuples into as a number of fragments equal to the
+ number of reduce tasks.
+ 
+ In the reduce phase, each ''reduce task'' consumes the fragment of ''(K',V'*)'' tuples assigned
to it.
+ For each such tuple it invokes a user defined ''reduce function'' that transmutes the tuple
into
+ an output key/value pair ''(K,V)''. Once again, the framework distributes the many reduce
+ tasks across the cluster of nodes and deals with shipping the appropriate fragment of
+ intermediate data to each reduce task.
+ 
+ Tasks in each phase are executed in a fault tolerant manner, if a node(s) fail in the middle
+ of a computation the tasks assigned to them are re-distributed among the remaining nodes.
+ Having many map and reduce tasks enables good load balancing and allows failed tasks to
be
+ re-run with small runtime overhead.
+ 
+ == Architecture ==
+ 
+ The Hadoop Map/Reduce framework has a master/slave architecture. It has a single master
+ server or JOBTRACKER and several slave servers or TASKTRACKERS, one per node in the cluster.
+ The JOBTRACKER is the point of interaction between users and the framework. Users submit
+ map/reduce jobs to the jobtracker which puts them in a queue of pending jobs and executes
+ them on a first come first serve basis. The jobtracker manages the assignment of map and
+ reduce tasks to the tasktrackers. The tasktrackers execute tasks upon instruction from the
+ jobtracker and also handle data motion between the map and reduce phases.
+ 
+ 
+ 
+ 
  == General Information ==
   * [http://lucene.apache.org/hadoop/ Hadoop Website ]
-  * HadoopOverview
+  * [:HadoopOverview: Hadoop Code Overview]
   * Self:FAQ
   * HadoopMapReduce
  

Mime
View raw message