hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Artem Yankov <artem.yan...@gmail.com>
Subject Re: Hadoop cluster on EC2: hangs on big chunks of data
Date Tue, 25 Oct 2011 19:43:32 GMT
It looks like input data is not splited correctly. It always generates only
one map task and gives it to one of the nodes. I tried to pass parameters
like  -D mapred.max.split.size but it doesn't seem to have any effect.

So the question would be: how to specify the maximum amount of input records
each mapper can receive?

On Tue, Oct 25, 2011 at 10:56 AM, Artem Yankov <artem.yankov@gmail.com>wrote:

> Hey,
>
> I set up a hadoop cluster on EC2 using this documentation:
> http://wiki.apache.org/hadoop/AmazonEC2
>
> OS: Linux Fedora 8
> Hadoop version is 0.20.203.0
> java version "1.7.0_01"
> heap size: 1Gb (stats always shows that it uses only 4% of this)
> I use mongo-hadoop plugin to get data from mongodb.
>
> Everything seems to work perfect with the small chunks of data:
> calculations are fast, I'm getting the results and tasks
> seem to be distributed normally among the slaves.
>
> Then I try to load a huge amount of data (22 Millions of records) and
> everything hangs. First slave receives a map task and other slaves are not.
> In logs I constantly see this:
>
> *INFO org.apache.hadoop.hdfs.StateChange: *BLOCK*
> NameSystem.processReport: from x.x.x.x:50010, blocks: 2, processing time: 0
> m*
> *
> *
> I tried to use different number of slaves (maximum I ran 25 nodes), but it
> doesn't help cause it seems that when first slave receives a job it blocks
> everything else. (again, everything works cool with the small chunks of
> data).
>
> There are no significant CPU or Memory load on Master.
>
> Any ideas on what can be a reason of this?
>
> Artem.
>
>

Mime
View raw message