hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bzheng <bing.zh...@gmail.com>
Subject Re: OutOfMemory error processing large amounts of gz files
Date Tue, 03 Mar 2009 00:29:57 GMT

Thanks for all the info.  Upon further investigation, we are dealing with two
separate issues:

1.  problem processing a lot of gz files 

we have tried the hadoop.native.lib setting and it makes little difference. 
however, this is not that big a deal since we can use multiple jobs each
processing a small chunk of the files instead of one big job processing all
the files.

2.  jobtracker out of memory

by increasing amount of memory for the jobtracker, we can delay the
inevitable.  since the jobtraker's memory usage keeps going up as we run
more job, we will need to restart the cluster once this error happens.  we
are currently using 0.18.3 and are holding off changing to a different
version because we don't want to lose the existing files on HDFS.

bzheng wrote:
> I have about 24k gz files (about 550GB total) on hdfs and has a really
> simple java program to convert them into sequence files.  If the script's
> setInputPaths takes a Path[] of all 24k files, it will get a OutOfMemory
> error at about 35% map complete.  If I make the script process 2k files
> per job and run 12 jobs consecutively, then it goes through all files
> fine.  The cluster I'm using has about 67 nodes.  Each nodes has 16GB
> memory, max 7 map, and max 2 reduce.
> The map task is really simple, it takes LongWritable as key and Text as
> value, generate a Text newKey, and output.collect(Text newKey, Text
> value).  It doesn't have any code that can possibly leak memory.
> There's no stack trace for the vast majority of the OutOfMemory error,
> there's just a single line in the log like this:
> 2009-02-23 14:27:50,902 INFO org.apache.hadoop.mapred.TaskTracker:
> java.lang.OutOfMemoryError: Java heap space
> I can't find the stack trace right now, but rarely the OutOfMemory error
> originates from some hadoop config array copy opertaion.  There's no
> special config for the script.

View this message in context: http://www.nabble.com/OutOfMemory-error-processing-large-amounts-of-gz-files-tp22193552p22300192.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

View raw message