Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <e1429e600905101430ied7860w5d2a3010c8f0f248@mail.gmail.com>
References: <e1429e600905101430ied7860w5d2a3010c8f0f248@mail.gmail.com>
Date: Mon, 11 May 2009 09:08:09 -0700
Message-ID: <45f85f70905110908n1c6c1582i6030e7dc900525c9@mail.gmail.com>
Subject: Re: sub 60 second performance
From: Todd Lipcon <todd@cloudera.com>
To: core-user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=001e680f0e803d3a5e0469a52f1f

--001e680f0e803d3a5e0469a52f1f
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

In addition to Jason's suggestion, you could also see about setting some of
Hadoop's directories to subdirs of /dev/shm. If the dataset is really small,
it should be easy to re-load it onto the cluster if it's lost, so even
putting dfs.data.dir in /dev/shm might be worth trying.
You'll probably also want mapred.local.dir in /dev/shm

Note that if in fact you don't have enough RAM to do this, you'll start
swapping and your performance will suck like crazy :)

That said, you may find that even with all storage in RAM your jobs are
still too slow. Hadoop isn't optimized for this kind of small-job
performance quite yet. You may find that task setup time dominates the job.
I think it's entirely reasonable to shoot for sub-60-second jobs down the
road, and I'd find it interesting to hear what the results are now. Hope you
report back!

-Todd

On Sun, May 10, 2009 at 2:30 PM, Matt Bowyer <mattbowyers@googlemail.com>wrote:

> Hi,
>
> I am trying to do 'on demand map reduce' - something which will return in
> reasonable time (a few seconds).
>
> My dataset is relatively small and can fit into my datanode's memory. Is it
> possible to keep a block in the datanode's memory so on the next job the
> response will be much quicker? The majority of the time spent during the
> job
> run appears to be during the 'HDFS_BYTES_READ' part of the job. I have
> tried
> using the setNumTasksToExecutePerJvm but the block still seems to be
> cleared
> from memory after the job.
>
> thanks!
>

--001e680f0e803d3a5e0469a52f1f--