hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Friso van Vollenhoven <fvanvollenho...@xebia.com>
Subject Re: Hadoop cluster on EC2: hangs on big chunks of data
Date Thu, 27 Oct 2011 08:41:44 GMT
What is your input data? Some types of files are not splittable, because of non-splittable
compression codecs (like gzip). Could that be in your case?


On 25 okt. 2011, at 21:43, Artem Yankov wrote:

It looks like input data is not splited correctly. It always generates only one map task and
gives it to one of the nodes. I tried to pass parameters like  -D mapred.max.split.size but
it doesn't seem to have any effect.

So the question would be: how to specify the maximum amount of input records each mapper can

On Tue, Oct 25, 2011 at 10:56 AM, Artem Yankov <artem.yankov@gmail.com<mailto:artem.yankov@gmail.com>>

I set up a hadoop cluster on EC2 using this documentation: http://wiki.apache.org/hadoop/AmazonEC2

OS: Linux Fedora 8
Hadoop version is
java version "1.7.0_01"
heap size: 1Gb (stats always shows that it uses only 4% of this)
I use mongo-hadoop plugin to get data from mongodb.

Everything seems to work perfect with the small chunks of data: calculations are fast, I'm
getting the results and tasks
seem to be distributed normally among the slaves.

Then I try to load a huge amount of data (22 Millions of records) and everything hangs. First
slave receives a map task and other slaves are not. In logs I constantly see this:

INFO org.apache.hadoop.hdfs.StateChange: *BLOCK* NameSystem.processReport: from x.x.x.x:50010,
blocks: 2, processing time: 0 m

I tried to use different number of slaves (maximum I ran 25 nodes), but it doesn't help cause
it seems that when first slave receives a job it blocks everything else. (again, everything
works cool with the small chunks of data).

There are no significant CPU or Memory load on Master.

Any ideas on what can be a reason of this?


View raw message