hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott <skes...@weather.com>
Subject Hadoop streaming - No room for reduce task error
Date Wed, 10 Jun 2009 16:40:18 GMT
Complete newby map/reduce question here.  I am using hadoop streaming as 
I come from a Perl background, and am trying to prototype/test a process 
to load/clean-up ad server log lines from multiple input files into one 
large file on the hdfs that can then be used as the source of a hive db 

I have a perl map script that reads an input line from stdin, does the 
needed cleanup/manipulation, and writes back to stdout.    I don't 
really need a reduce step, as I don't care what order the lines are 
written in, and there is no summary data to produce.  When I run the job 
with -reducer NONE I get valid output, however I get multiple part-xxxxx 
files rather than one big file. 

So I wrote a trivial 'reduce' script that reads from stdin and simply 
splits the key/value, and writes the value back to stdout.

I am executing the code as follows:

./hadoop jar ../contrib/streaming/hadoop-0.19.1-streaming.jar -mapper 
"/usr/bin/perl /home/hadoop/scripts/map_parse_log_r2.pl" -reducer 
"/usr/bin/perl /home/hadoop/scripts/reduce_parse_log.pl" -input 
/logs/*.log -output test9

The code I have works when given a small set of input files.  However, I 
get the following error when attempting to run the code on a large set 
of input files:

15:43:00,905 WARN org.apache.hadoop.mapred.JobInProgress: No room for 
reduce task. Node 
tracker_testdw0b00:localhost.localdomain/ has 2004049920 
bytes free; but we expect reduce input to take 22138478392

I assume this is because the all the map output is being buffered in 
memory prior to running the reduce step?  If so, what can I change to 
stop the buffering?  I just need the map output to go directly to one 
large file.


View raw message