Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 1851 invoked from network); 11 Jun 2009 06:13:46 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 11 Jun 2009 06:13:46 -0000 Received: (qmail 91089 invoked by uid 500); 11 Jun 2009 06:13:55 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 90984 invoked by uid 500); 11 Jun 2009 06:13:54 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 90974 invoked by uid 99); 11 Jun 2009 06:13:54 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 11 Jun 2009 06:13:54 +0000 X-ASF-Spam-Status: No, hits=3.7 required=10.0 tests=HTML_MESSAGE,NORMAL_HTTP_TO_IP,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jason.hadoop@gmail.com designates 209.85.222.200 as permitted sender) Received: from [209.85.222.200] (HELO mail-pz0-f200.google.com) (209.85.222.200) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 11 Jun 2009 06:13:45 +0000 Received: by pzk38 with SMTP id 38so1191887pzk.5 for ; Wed, 10 Jun 2009 23:13:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=h7ZQwu4GwBx4TWSkchh22iRp+8oxKEXZxAQxliQopnc=; b=Ky7MOo59dfk6Me5dR0EJ96dIpt50LR33NgKJ4GqFmf47ZXGgWdJjEpONlmz2dCPGGg Vcjc3cVsk7Ov90HPIJdniyxOkpazQpF2URPUl/JhVesrftl6bfv9lilokdTKm1iGrDsP ZmkXc2sqWGKvMTxtBUxRBnsKLQuQIFu6szsMo= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=jNZdqY97xnpxkj2tmFDVjhnDu4UoSvrHn0B023aoXFpgRVbXymyReR6NW91jFL5iYb S/6YIqEci4++EYKTc6nb3532uh9nmsR5RY/Im6WutqBMim05rjYDZQ2Dik+Ntj0TPu7O fT819v0lhsrnaq7A7q9bP3P0ucDjZXWujuft8= MIME-Version: 1.0 Received: by 10.115.72.17 with SMTP id z17mr3382823wak.183.1244700804353; Wed, 10 Jun 2009 23:13:24 -0700 (PDT) In-Reply-To: <45f85f70906101731m7bad310cj51101c6d29f9d8aa@mail.gmail.com> References: <4A2FE1F2.9000009@weather.com> <623d9cf40906101658g3082bda0vfafd551fefdc349f@mail.gmail.com> <45f85f70906101731m7bad310cj51101c6d29f9d8aa@mail.gmail.com> Date: Wed, 10 Jun 2009 23:13:24 -0700 Message-ID: <314098690906102313s568ddc6an27f8f7e051505954@mail.gmail.com> Subject: Re: Hadoop streaming - No room for reduce task error From: jason hadoop To: core-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=00163646c11e4b949c046c0c7d66 X-Virus-Checked: Checked by ClamAV on apache.org --00163646c11e4b949c046c0c7d66 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit The reduce output may spill to disk during the sort, and if it expected to be larger than the partition free space, unless the machine/jvm has a hugh allowed memory space, the data will spill to disk during the sort. If I did my math correctly, you are trying to push ~2TB through the single reduce. as for the part-XXXX files, if you have the number of reduces set to zero, you will get N part files, where N is the number of map tasks. If you absolutely must have it all go to one reduce, you will need to increase the free disk space. I think 19.1 preserves compression for the map output, so you could try enabling compression for map output. If you have many nodes, you can set the number of reduces to some number and then use sort -M on the part files, to merge sort them, assuming your reduce preserves ordering. Try adding these parameters to your job line: -D mapred.compress.map.output=true -D mapred.output.compression.type=BLOCK BTW, /bin/cat works fine as an identity mapper or an identity reducer On Wed, Jun 10, 2009 at 5:31 PM, Todd Lipcon wrote: > Hey Scott, > It turns out that Alex's answer was mistaken - your error is actually > coming > from lack of disk space on the TT that has been assigned the reduce task. > Specifically, there is not enough space in mapred.local.dir. You'll need to > change your mapred.local.dir to point to a partition that has enough space > to contain your reduce output. > > As for why this is the case, I hope someone will pipe up. It seems to me > that reduce output can go directly to the target filesystem without using > space on mapred.local.dir. > > Thanks > -Todd > > On Wed, Jun 10, 2009 at 4:58 PM, Alex Loddengaard > wrote: > > > What is mapred.child.ulimit set to? This configuration options specifics > > how much memory child processes are allowed to have. You may want to up > > this limit and see what happens. > > > > Let me know if that doesn't get you anywhere. > > > > Alex > > > > On Wed, Jun 10, 2009 at 9:40 AM, Scott wrote: > > > > > Complete newby map/reduce question here. I am using hadoop streaming > as > > I > > > come from a Perl background, and am trying to prototype/test a process > to > > > load/clean-up ad server log lines from multiple input files into one > > large > > > file on the hdfs that can then be used as the source of a hive db > table. > > > I have a perl map script that reads an input line from stdin, does the > > > needed cleanup/manipulation, and writes back to stdout. I don't > really > > > need a reduce step, as I don't care what order the lines are written > in, > > and > > > there is no summary data to produce. When I run the job with -reducer > > NONE > > > I get valid output, however I get multiple part-xxxxx files rather than > > one > > > big file. > > > So I wrote a trivial 'reduce' script that reads from stdin and simply > > > splits the key/value, and writes the value back to stdout. > > > > > > I am executing the code as follows: > > > > > > ./hadoop jar ../contrib/streaming/hadoop-0.19.1-streaming.jar -mapper > > > "/usr/bin/perl /home/hadoop/scripts/map_parse_log_r2.pl" -reducer > > > "/usr/bin/perl /home/hadoop/scripts/reduce_parse_log.pl" -input > > /logs/*.log > > > -output test9 > > > > > > The code I have works when given a small set of input files. However, > I > > > get the following error when attempting to run the code on a large set > of > > > input files: > > > > > > hadoop-hadoop-jobtracker-testdw0b00.log.2009-06-09:2009-06-09 > > 15:43:00,905 > > > WARN org.apache.hadoop.mapred.JobInProgress: No room for reduce task. > > Node > > > tracker_testdw0b00:localhost.localdomain/127.0.0.1:53245 has > 2004049920 > > > bytes free; but we expect reduce input to take 22138478392 > > > > > > I assume this is because the all the map output is being buffered in > > memory > > > prior to running the reduce step? If so, what can I change to stop the > > > buffering? I just need the map output to go directly to one large > file. > > > > > > Thanks, > > > Scott > > > > > > > > > -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals --00163646c11e4b949c046c0c7d66--