hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tom White <...@cloudera.com>
Subject Re: OutOfMemory error processing large amounts of gz files
Date Wed, 25 Feb 2009 18:32:57 GMT
Do you experience the problem with and without native compression? Set
hadoop.native.lib to false to disable native compression.

Cheers,
Tom

On Tue, Feb 24, 2009 at 9:40 PM, Gordon Mohr <gojomo@archive.org> wrote:
> If you're doing a lot of gzip compression/decompression, you *might* be
> hitting this 6+-year-old Sun JVM bug:
>
> "Instantiating Inflater/Deflater causes OutOfMemoryError; finalizers not
> called promptly enough"
> http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4797189
>
> A workaround is listed in the issue: ensuring you call close() or end() on
> the Deflater; something similar might apply to Inflater.
>
> (This is one of those fun JVM situations where having more heap space may
> make OOMEs more likely: less heap memory pressure leaves more un-GCd or
> un-finalized heap objects around, each of which is holding a bit of native
> memory.)
>
> - Gordon @ IA
>
> bzheng wrote:
>>
>> I have about 24k gz files (about 550GB total) on hdfs and has a really
>> simple
>> java program to convert them into sequence files.  If the script's
>> setInputPaths takes a Path[] of all 24k files, it will get a OutOfMemory
>> error at about 35% map complete.  If I make the script process 2k files
>> per
>> job and run 12 jobs consecutively, then it goes through all files fine.
>>  The
>> cluster I'm using has about 67 nodes.  Each nodes has 16GB memory, max 7
>> map, and max 2 reduce.
>>
>> The map task is really simple, it takes LongWritable as key and Text as
>> value, generate a Text newKey, and output.collect(Text newKey, Text
>> value). It doesn't have any code that can possibly leak memory.
>>
>> There's no stack trace for the vast majority of the OutOfMemory error,
>> there's just a single line in the log like this:
>>
>> 2009-02-23 14:27:50,902 INFO org.apache.hadoop.mapred.TaskTracker:
>> java.lang.OutOfMemoryError: Java heap space
>>
>> I can't find the stack trace right now, but rarely the OutOfMemory error
>> originates from some hadoop config array copy opertaion.  There's no
>> special
>> config for the script.
>

Mime
View raw message