Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of jason.hadoop@gmail.com
 designates 209.85.222.200 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=jNZdqY97xnpxkj2tmFDVjhnDu4UoSvrHn0B023aoXFpgRVbXymyReR6NW91jFL5iYb
         S/6YIqEci4++EYKTc6nb3532uh9nmsR5RY/Im6WutqBMim05rjYDZQ2Dik+Ntj0TPu7O
         fT819v0lhsrnaq7A7q9bP3P0ucDjZXWujuft8=
MIME-Version: 1.0
In-Reply-To: <45f85f70906101731m7bad310cj51101c6d29f9d8aa@mail.gmail.com>
References: <4A2FE1F2.9000009@weather.com>
	 <623d9cf40906101658g3082bda0vfafd551fefdc349f@mail.gmail.com>
	 <45f85f70906101731m7bad310cj51101c6d29f9d8aa@mail.gmail.com>
Date: Wed, 10 Jun 2009 23:13:24 -0700
Message-ID: <314098690906102313s568ddc6an27f8f7e051505954@mail.gmail.com>
Subject: Re: Hadoop streaming - No room for reduce task error
From: jason hadoop <jason.hadoop@gmail.com>
To: core-user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=00163646c11e4b949c046c0c7d66

--00163646c11e4b949c046c0c7d66
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

The reduce output may spill to disk during the sort, and if it expected to
be larger than the partition free space, unless the machine/jvm has a hugh
allowed memory space, the data will spill to disk during the sort.
If I did my math correctly, you are trying to push ~2TB through the single
reduce.

as for the part-XXXX files, if you have the number of reduces set to zero,
you will get N part files, where N is the number of map tasks.

If you absolutely must have it all go to one reduce, you will need to
increase the free disk space. I think 19.1 preserves compression for the map
output, so you could try enabling compression for map output.

If you have many nodes, you can set the number of reduces to some number and
then use sort -M on the part files, to merge sort them, assuming your reduce
preserves ordering.

Try adding these parameters to your job line:
-D mapred.compress.map.output=true -D mapred.output.compression.type=BLOCK

BTW, /bin/cat works fine as an identity mapper or an identity reducer


On Wed, Jun 10, 2009 at 5:31 PM, Todd Lipcon <todd@cloudera.com> wrote:

> Hey Scott,
> It turns out that Alex's answer was mistaken - your error is actually
> coming
> from lack of disk space on the TT that has been assigned the reduce task.
> Specifically, there is not enough space in mapred.local.dir. You'll need to
> change your mapred.local.dir to point to a partition that has enough space
> to contain your reduce output.
>
> As for why this is the case, I hope someone will pipe up. It seems to me
> that reduce output can go directly to the target filesystem without using
> space on mapred.local.dir.
>
> Thanks
> -Todd
>
> On Wed, Jun 10, 2009 at 4:58 PM, Alex Loddengaard <alex@cloudera.com>
> wrote:
>
> > What is mapred.child.ulimit set to?  This configuration options specifics
> > how much memory child processes are allowed to have.  You may want to up
> > this limit and see what happens.
> >
> > Let me know if that doesn't get you anywhere.
> >
> > Alex
> >
> > On Wed, Jun 10, 2009 at 9:40 AM, Scott <skester@weather.com> wrote:
> >
> > > Complete newby map/reduce question here.  I am using hadoop streaming
> as
> > I
> > > come from a Perl background, and am trying to prototype/test a process
> to
> > > load/clean-up ad server log lines from multiple input files into one
> > large
> > > file on the hdfs that can then be used as the source of a hive db
> table.
> > > I have a perl map script that reads an input line from stdin, does the
> > > needed cleanup/manipulation, and writes back to stdout.    I don't
> really
> > > need a reduce step, as I don't care what order the lines are written
> in,
> > and
> > > there is no summary data to produce.  When I run the job with -reducer
> > NONE
> > > I get valid output, however I get multiple part-xxxxx files rather than
> > one
> > > big file.
> > > So I wrote a trivial 'reduce' script that reads from stdin and simply
> > > splits the key/value, and writes the value back to stdout.
> > >
> > > I am executing the code as follows:
> > >
> > > ./hadoop jar ../contrib/streaming/hadoop-0.19.1-streaming.jar -mapper
> > > "/usr/bin/perl /home/hadoop/scripts/map_parse_log_r2.pl" -reducer
> > > "/usr/bin/perl /home/hadoop/scripts/reduce_parse_log.pl" -input
> > /logs/*.log
> > > -output test9
> > >
> > > The code I have works when given a small set of input files.  However,
> I
> > > get the following error when attempting to run the code on a large set
> of
> > > input files:
> > >
> > > hadoop-hadoop-jobtracker-testdw0b00.log.2009-06-09:2009-06-09
> > 15:43:00,905
> > > WARN org.apache.hadoop.mapred.JobInProgress: No room for reduce task.
> > Node
> > > tracker_testdw0b00:localhost.localdomain/127.0.0.1:53245 has
> 2004049920
> > > bytes free; but we expect reduce input to take 22138478392
> > >
> > > I assume this is because the all the map output is being buffered in
> > memory
> > > prior to running the reduce step?  If so, what can I change to stop the
> > > buffering?  I just need the map output to go directly to one large
> file.
> > >
> > > Thanks,
> > > Scott
> > >
> > >
> >
>


-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals

--00163646c11e4b949c046c0c7d66--