hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: sub 60 second performance
Date Mon, 11 May 2009 16:08:09 GMT
In addition to Jason's suggestion, you could also see about setting some of
Hadoop's directories to subdirs of /dev/shm. If the dataset is really small,
it should be easy to re-load it onto the cluster if it's lost, so even
putting dfs.data.dir in /dev/shm might be worth trying.
You'll probably also want mapred.local.dir in /dev/shm

Note that if in fact you don't have enough RAM to do this, you'll start
swapping and your performance will suck like crazy :)

That said, you may find that even with all storage in RAM your jobs are
still too slow. Hadoop isn't optimized for this kind of small-job
performance quite yet. You may find that task setup time dominates the job.
I think it's entirely reasonable to shoot for sub-60-second jobs down the
road, and I'd find it interesting to hear what the results are now. Hope you
report back!


On Sun, May 10, 2009 at 2:30 PM, Matt Bowyer <mattbowyers@googlemail.com>wrote:

> Hi,
> I am trying to do 'on demand map reduce' - something which will return in
> reasonable time (a few seconds).
> My dataset is relatively small and can fit into my datanode's memory. Is it
> possible to keep a block in the datanode's memory so on the next job the
> response will be much quicker? The majority of the time spent during the
> job
> run appears to be during the 'HDFS_BYTES_READ' part of the job. I have
> tried
> using the setNumTasksToExecutePerJvm but the block still seems to be
> cleared
> from memory after the job.
> thanks!

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message