From Gordon Mohr <goj...@archive.org>
Subject Re: OutOfMemory error processing large amounts of gz files
Date Wed, 25 Feb 2009 05:40:29 GMT
If you're doing a lot of gzip compression/decompression, you *might* be 
hitting this 6+-year-old Sun JVM bug:

"Instantiating Inflater/Deflater causes OutOfMemoryError; finalizers not 
called promptly enough"

A workaround is listed in the issue: ensuring you call close() or end() 
on the Deflater; something similar might apply to Inflater.

(This is one of those fun JVM situations where having more heap space 
may make OOMEs more likely: less heap memory pressure leaves more un-GCd 
or un-finalized heap objects around, each of which is holding a bit of 
native memory.)

- Gordon @ IA

bzheng wrote:
> I have about 24k gz files (about 550GB total) on hdfs and has a really simple
> java program to convert them into sequence files.  If the script's
> setInputPaths takes a Path[] of all 24k files, it will get a OutOfMemory
> error at about 35% map complete.  If I make the script process 2k files per
> job and run 12 jobs consecutively, then it goes through all files fine.  The
> cluster I'm using has about 67 nodes.  Each nodes has 16GB memory, max 7
> map, and max 2 reduce.
> The map task is really simple, it takes LongWritable as key and Text as
> value, generate a Text newKey, and output.collect(Text newKey, Text value). 
> It doesn't have any code that can possibly leak memory.
> There's no stack trace for the vast majority of the OutOfMemory error,
> there's just a single line in the log like this:
> 2009-02-23 14:27:50,902 INFO org.apache.hadoop.mapred.TaskTracker:
> java.lang.OutOfMemoryError: Java heap space
> I can't find the stack trace right now, but rarely the OutOfMemory error
> originates from some hadoop config array copy opertaion.  There's no special
> config for the script.

