hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mahajan, Neeraj" <nemaha...@ebay.com>
Subject RE: Setting number of Maps
Date Tue, 03 Jul 2007 16:45:41 GMT
I am a novice on hadoop so please consider my response with caution :).
The split size seems to be calculated as follows:
split size = max(value("mapred.min.split.size"), min((total size of all
files / number of splits), file system block size))
The value of "file system block size" might be the cause of your
situation.

I had a (possibly) similar situation and I ended up extending
FileInputFormat. You can overload method getSplits() (and possibly
isSplitable(), getRecordReader() etc.). The above fomula is from
FileInputFormat.getSplits().
You can then do a 
conf.setInputFormat(<Your class>)
before submitting the job.
In your case, you might want to overload SequenceFileInputFormat which
extends FileInputFormat.
When hadoop wants to create the splits, your code can specify the
splits. If you can predict the exact bytes it takes to store your
complex number then splitting shouldn't be difficult, else if split can
be decided only while fetching data from the file, then it would be
difficult to set one split per number.

I know it is good to split a job in as many smaller parts as possible,
but too much granularity, according to me, can result in excessive
overhead in splitting and then combining the results. You might want to
experiment with the granularity to figure out the best value.

~ Neeraj


-----Original Message-----
From: Oliver Haggarty [mailto:ojh06@doc.ic.ac.uk] 
Sent: Tuesday, July 03, 2007 8:45 AM
To: hadoop-user@lucene.apache.org
Subject: Setting number of Maps

Hi,

I'm writing a mapreduce task that will take a load of complex numbers,
do some processing on each then return a double. As this processing will
be complex and could take up to 10 minutes I am using Hadoop to
distribute this amongst many machines.

So ideally for each complex number I want a new map task to spread the
load most efficiently. A typical run might have as many as 7500 complex
numbers that need processing. I will eventually have access to a cluster
of approximately 500 machines.

So far, the only way I can get one map task per complex number is to
create a new SequenceFile for each number in the input directory. This
takes a while though and I was hoping I could just create a single
SequenceFile holding all the complex numbers, and then use the
JobConf.setNumMapTasks(n) to get one map task per number in the file. 
This doesn't work though, and I end up with approx 60-70 complex numbers
per map task (depending on the total number of input numbers).

Does anyone have any idea why this second method doesn't work? If it is
not supposed to work in this way are there any suggestions as to how to
get a map per input record without having to put each one in a separate
file?

Thanks in advance for any help,

Ollie

Mime
View raw message