hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-hadoop Wiki] Update of "ImportantConcepts" by TedDunning
Date Fri, 20 Jul 2007 02:52:23 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.

The following page has been changed by TedDunning:
http://wiki.apache.org/lucene-hadoop/ImportantConcepts

New page:
Most of the documentation on Hadoop assumes that you know some of the basic concepts and procedures
involved in writing and running jobs in Hadoop.  This can be very confusing at first because
many of these concepts won't be things that you will have picked up unless you have used a
system like Hadoop in the past.

Some notable terms that may confuse you:

* Hadoop - Hadoop itself refers to the overall system that runs jobs, distributes tasks (pieces
of these jobs) and stores data in a parallel and distributed fashion.

* Job -  In hadoop, the combination of all of the jars and classes needed to run a map/reduce
program is called a job.  All of these components are themselves collected into a jar which
is usually referred to as a job file.  To execute a job, you normally will use the command:

    hadoop jar your-job-file-goes-here.jar

This assumes that your job file has a main class that is defined as if it were executable
from the command line and that this main class defines a JobConf data structure that is used
to carry all of the configuration information about your program around.  The wordcount example
shows how a typical map/reduce program is written.  Be warned, however, that the wordcount
program is not usually run directly, but instead there is a single example driver program
that provides a main method that then calls the wordcount main method itself.  This added
complexity decreases the number of jars involved in the example structure, but doesn't really
serve any other purpose.

* Task - Whereas a job describes all of the inputs, outputs, classes and libraries used in
a map/reduce program, a task is the program that executes the individual map and reduce steps.

* HDFS - stands for Hadoop Distributed File System.  This is how input and output files of
Hadoop programs are normally stored.  The major advantage of HDFS are that it provides very
high input and output speeds.  This is critical for good performance for highly parallel programs
since as the number of processors involved in working on a problem increases, the overall
demand for input data increases as does the overall rate that output is produced.  HDFS provides
very high bandwidth by storing chunks of files scattered throughout the Hadoop cluster.  By
clever choice of where individual tasks are run and because files are stored in multiple places,
tasks are placed near their input data and output data is largely stored where it is created.
   

Mime
View raw message