hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bae, Jae Hyeon" <metac...@gmail.com>
Subject Re: Small Files as input, Heap Size and garbage Collector
Date Wed, 24 Mar 2010 01:06:49 GMT
I think there's no way to avoid this limit if you have several
millions of small files.

You might know that at least one InputSplit instance is created for
one file, so if there are several millions of small files, you might
have several millions of InputSplit instance, this should consume much
gigabytes of memory.

You should tar your small files and you should implement input
formatter for those tar files.

One of my colleagues had a similar case to handle a number of small
files, he tarred those files and made input formatter, and he finally
solved the problem.

I like to suggest him to open those sources but I am not sure that his
manager would permit it :)

2010/3/24 Mohamed Riadh Trad <Mohamed.Trad@inria.fr>:
> Hi,
> I am running hadoop over a collection of several millions of small files using the CombineFileInputFormat.
> However, when generating splits, the job fails because of a Garbage Collector Overhead
limit exceed exception.
> I disabled the Garbage Colelctor overhead limit exception with -server -XX:-UseGCOverheadLimit;
I get a java.lang.OutOfMemoryError: Java heap space with -Xmx8192m -server.
> Is there any solution to avoid this limit when splitting input?
> Regards

View raw message