Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-user@hadoop.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
Message-ID: <49A4D9CD.4050206@archive.org>
Date: Tue, 24 Feb 2009 21:40:29 -0800
From: Gordon Mohr <gojomo@archive.org>
User-Agent: Thunderbird 2.0.0.19 (Windows/20081209)
MIME-Version: 1.0
To: core-user@hadoop.apache.org
Subject: Re: OutOfMemory error processing large amounts of gz files
References: <22193552.post@talk.nabble.com>
In-Reply-To: <22193552.post@talk.nabble.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

If you're doing a lot of gzip compression/decompression, you *might* be 
hitting this 6+-year-old Sun JVM bug:

"Instantiating Inflater/Deflater causes OutOfMemoryError; finalizers not 
called promptly enough"
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4797189

A workaround is listed in the issue: ensuring you call close() or end() 
on the Deflater; something similar might apply to Inflater.

(This is one of those fun JVM situations where having more heap space 
may make OOMEs more likely: less heap memory pressure leaves more un-GCd 
or un-finalized heap objects around, each of which is holding a bit of 
native memory.)

- Gordon @ IA

bzheng wrote:
> I have about 24k gz files (about 550GB total) on hdfs and has a really simple
> java program to convert them into sequence files.  If the script's
> setInputPaths takes a Path[] of all 24k files, it will get a OutOfMemory
> error at about 35% map complete.  If I make the script process 2k files per
> job and run 12 jobs consecutively, then it goes through all files fine.  The
> cluster I'm using has about 67 nodes.  Each nodes has 16GB memory, max 7
> map, and max 2 reduce.
> 
> The map task is really simple, it takes LongWritable as key and Text as
> value, generate a Text newKey, and output.collect(Text newKey, Text value). 
> It doesn't have any code that can possibly leak memory.
> 
> There's no stack trace for the vast majority of the OutOfMemory error,
> there's just a single line in the log like this:
> 
> 2009-02-23 14:27:50,902 INFO org.apache.hadoop.mapred.TaskTracker:
> java.lang.OutOfMemoryError: Java heap space
> 
> I can't find the stack trace right now, but rarely the OutOfMemory error
> originates from some hadoop config array copy opertaion.  There's no special
> config for the script.