Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 98029 invoked from network); 11 May 2009 16:08:44 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 11 May 2009 16:08:44 -0000 Received: (qmail 14254 invoked by uid 500); 11 May 2009 16:08:41 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 14174 invoked by uid 500); 11 May 2009 16:08:41 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 14164 invoked by uid 99); 11 May 2009 16:08:41 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 11 May 2009 16:08:41 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [209.85.217.215] (HELO mail-gx0-f215.google.com) (209.85.217.215) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 11 May 2009 16:08:32 +0000 Received: by gxk11 with SMTP id 11so5681605gxk.5 for ; Mon, 11 May 2009 09:08:10 -0700 (PDT) MIME-Version: 1.0 Received: by 10.151.135.8 with SMTP id m8mr13509683ybn.228.1242058089904; Mon, 11 May 2009 09:08:09 -0700 (PDT) In-Reply-To: References: Date: Mon, 11 May 2009 09:08:09 -0700 Message-ID: <45f85f70905110908n1c6c1582i6030e7dc900525c9@mail.gmail.com> Subject: Re: sub 60 second performance From: Todd Lipcon To: core-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=001e680f0e803d3a5e0469a52f1f X-Virus-Checked: Checked by ClamAV on apache.org --001e680f0e803d3a5e0469a52f1f Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit In addition to Jason's suggestion, you could also see about setting some of Hadoop's directories to subdirs of /dev/shm. If the dataset is really small, it should be easy to re-load it onto the cluster if it's lost, so even putting dfs.data.dir in /dev/shm might be worth trying. You'll probably also want mapred.local.dir in /dev/shm Note that if in fact you don't have enough RAM to do this, you'll start swapping and your performance will suck like crazy :) That said, you may find that even with all storage in RAM your jobs are still too slow. Hadoop isn't optimized for this kind of small-job performance quite yet. You may find that task setup time dominates the job. I think it's entirely reasonable to shoot for sub-60-second jobs down the road, and I'd find it interesting to hear what the results are now. Hope you report back! -Todd On Sun, May 10, 2009 at 2:30 PM, Matt Bowyer wrote: > Hi, > > I am trying to do 'on demand map reduce' - something which will return in > reasonable time (a few seconds). > > My dataset is relatively small and can fit into my datanode's memory. Is it > possible to keep a block in the datanode's memory so on the next job the > response will be much quicker? The majority of the time spent during the > job > run appears to be during the 'HDFS_BYTES_READ' part of the job. I have > tried > using the setNumTasksToExecutePerJvm but the block still seems to be > cleared > from memory after the job. > > thanks! > --001e680f0e803d3a5e0469a52f1f--