pig-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Craig Macdonald <cra...@dcs.gla.ac.uk>
Subject Re: OutOfMemory on DISTINCT
Date Thu, 20 Dec 2007 14:09:22 GMT
Utkarsh,

[i've split this thread into two issues, as I have questions on 
compressed files]
> I could run the job without problems.
Hmmm.
> <snip>
> As regards the memory problem, its most likely something wrong with 
> your hadoop cluster (perhaps a missing 0 in the default memory you 
> give to your tasks).
Sorry to be a pain - just trying to work to the bottom of this.
My cluster has all defaults setup - ie only three properties are 
specified in my hadoop-site.xml (fs.default.name, mapred.job.tracker, 
dfs.replication), so I'd be suspicious if my setup is wrong - it's 
mostly just hadoop defaults.

I assume the memory given to tasks is defined as
    mapred.child.java.opts
which the default value is -Xmx200m
 (see hadoop-default.xml)

Does this seem to low for this kind of job?

C
>
> Utkarsh
>
>
> On Dec 18, 2007, at 11:33 AM, Craig Macdonald wrote:
>
>> Hi Utkarsh,
>>
>> I retried on a larger cluster with more nodes. Note that I setup 
>> these hadoop clusters myself, so perhaps I'm doing something wrong 
>> there. I also reran again on the larger cluster using Java 6, as this 
>> gives a stack trace on OOM.
>>
>> Here are the job tracker statistics:
>>
>>
>>     Counter     Map     Reduce     Total
>> Job Counters     Failed map tasks     0     0     3
>> Launched map tasks     0     0     9
>> Launched reduce tasks     0     0     1
>> Map-Reduce Framework     Map input records     852,940     0     852,940
>> Map output records     852,940     0     852,940
>> Map input bytes     28,869,165     0     28,869,165
>> Map output bytes     65,414,790     0     65,414,790
>>
>>
>>
>> All errors were at:
>>
>> java.lang.OutOfMemoryError: Java heap space
>>     at java.util.Arrays.copyOf(Arrays.java:2786)
>>     at 
>> java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
>>     at java.io.DataOutputStream.write(DataOutputStream.java:90)
>>     at java.io.FilterOutputStream.write(FilterOutputStream.java:80)
>>     at org.apache.pig.data.DataAtom.write(DataAtom.java:138)
>>     at org.apache.pig.data.Tuple.write(Tuple.java:282)
>>     at org.apache.pig.data.Tuple.write(Tuple.java:282)
>>     at 
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:347) 
>>
>>     at 
>> org.apache.pig.impl.mapreduceExec.PigMapReduce$MapDataOutputCollector.add(PigMapReduce.java:309)

>>
>>     at 
>> org.apache.pig.impl.eval.collector.UnflattenCollector.add(UnflattenCollector.java:56)

>>
>>     at 
>> org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.add(GenerateSpec.java:242)

>>
>>     at 
>> org.apache.pig.impl.eval.collector.UnflattenCollector.add(UnflattenCollector.java:56)

>>
>>     at 
>> org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.exec(GenerateSpec.java:273)

>>
>>     at org.apache.pig.impl.eval.GenerateSpec$1.add(GenerateSpec.java:86)
>>     at 
>> org.apache.pig.impl.eval.collector.UnflattenCollector.add(UnflattenCollector.java:56)

>>
>>     at 
>> org.apache.pig.impl.mapreduceExec.PigMapReduce.run(PigMapReduce.java:113) 
>>
>>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
>>     at 
>> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1760)
>>
>>
>> I'll send you a URL for the data off list.
>>
>> Many thanks for your persistent help.
>>
>> Craig
>>
>>
>> Utkarsh Srivastava wrote:
>>>
>>> Hmm..., I can't think of any reason why this might be happening. Can 
>>> you retry?, or make the data available?
>>>
>>> Utkarsh
>>>
>>> On Dec 18, 2007, at 2:50 AM, Craig Macdonald wrote:
>>>
>>> > Hello,
>>> >
>>> > wc -l gives
>>> > 3014571
>>> >
>>> > - so shouldn't be loaded as a single tuple by Pig.
>>> >
>>> > C
>>> >
>>> > Utkarsh Srivastava wrote:
>>> >> This is really strange since your job is running out of memory on 
>>> >> the map side. This could happen if the input file had no newlines 
>>> >> (so that Pig tries to load the whole data set as a tuple). But >>

>>> even then, your data is only 20M.
>>> >>
>>> >> Utkarsh
>>> >>
>>> >> On Dec 14, 2007, at 5:07 AM, Craig Macdonald wrote:
>>> >>
>>> >>> Hi All,
>>> >>>
>>> >>> I have been trying a really simple DISTINCT operator on a 20MB 
>>> >>> set of URLs (hadoop cluster of 6 nodes - Java VM heap  is 1000MB

>>> >>> each). Any idea what's going wrong here?
>>> >>>
>>> >>> I cant see this being a problem the ongoing spill stuff, because

>>> >>> the dataset is pretty small!
>>> >>>
>>> >>> The node logs dont give much other information either!
>>> >>>
>>> >>> Thanks in advance.
>>> >>>
>>> >>> Craig
>>> >>>
>>> >>>
>>> >>> urls = LOAD 'file:/users/tr.craigm/Blogs08/sourceBlogs/
>>> >>> blogger.com/recent-updates/all_13122007.txt';
>>> >>> Y = DISTINCT urls;
>>> >>> store Y 'distincUrls'
>>> >>>
>>> >>> <snip>
>>> >>>
>>> >>> 2007-12-14 12:55:38,999 [main] INFO  org.apache.pig - Pig >>>

>>> progress = 28%
>>> >>> 2007-12-14 12:55:43,030 [main] INFO  org.apache.pig - Pig >>>

>>> progress = 29%
>>> >>> 2007-12-14 13:00:25,230 [main] ERROR org.apache.pig - Error >>>

>>> message from task (map) tip_200712070754_0025_m_000000 >>> 
>>> java.lang.OutOfMemoryError: Java heap space
>>> >>> java.lang.OutOfMemoryError: Java heap space
>>> >>> java.lang.OutOfMemoryError: Java heap space
>>> >>> java.lang.OutOfMemoryError: Java heap space
>>> >>>
>>> >>> 2007-12-14 13:00:25,288 [main] ERROR org.apache.pig - Error >>>

>>> message from task (map) tip_200712070754_0025_m_000001 >>> 
>>> java.lang.OutOfMemoryError: Java heap space
>>> >>> java.lang.OutOfMemoryError: Java heap space
>>> >>> java.lang.OutOfMemoryError: Java heap space
>>> >>>
>>> >>> 2007-12-14 13:00:25,295 [main] ERROR org.apache.pig - Error >>>

>>> message from task (reduce) tip_200712070754_0025_r_000000
>>> >>> Job failed
>>> >>> grunt>
>>> >>
>>> >
>>>
>>
>


Mime
View raw message