hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Bigdatafun <sean.bigdata...@gmail.com>
Subject Who actually does the split computation?
Date Wed, 09 Feb 2011 21:09:09 GMT
http://answers.oreilly.com/topic/459-anatomy-of-a-mapreduce-job-run-with-hadoop/
"Computes the input splits for the job. If the splits cannot be computed,
because the input paths don’t exist, for example, then the job is not
submitted and an error is thrown to the MapReduce program.

Copies the resources needed to run the job, including the job JAR file, the
configuration file and the computed input splits, to the jobtracker’s
filesystem in a directory named after the job ID. The job JAR is copied with
a high replication factor (controlled by the mapred.submit.replication
property,
which defaults to 10) so that there are lots of copies across the cluster
for the tasktrackers to access when they run tasks for the job (step 3)."

1. My first question: who is responsible to compute the input splits? Is it
the jobclient's work or the jobtracker's work? --- it sounds the jobclient's
work from the above statement. But I do not understand how jobclient is able
to compute this info because it does not hold enough information to do so.
To compute the input split, the party must at least know how many blocks the
target input includes, IFIAK, but jobclient does not seem to have such
information.

Here is my understanding about split using an example: a 256MB file stored
in 4 blocks in HDFS can be splitted into 4 splits if it is the target input
for the MR job. Is the minimal split a block or can a split be smaller than
that? How exactly is a split size computed?


-- 
--Sean

Mime
View raw message