pig-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Craig Macdonald <cra...@dcs.gla.ac.uk>
Subject Re: OutOfMemory on DISTINCT
Date Tue, 18 Dec 2007 19:33:06 GMT
Hi Utkarsh,

I retried on a larger cluster with more nodes. Note that I setup these 
hadoop clusters myself, so perhaps I'm doing something wrong there. I 
also reran again on the larger cluster using Java 6, as this gives a 
stack trace on OOM.

Here are the job tracker statistics:


	Counter 	Map 	Reduce 	Total
Job Counters 	Failed map tasks 	0 	0 	3
Launched map tasks 	0 	0 	9
Launched reduce tasks 	0 	0 	1
Map-Reduce Framework 	Map input records 	852,940 	0 	852,940
Map output records 	852,940 	0 	852,940
Map input bytes 	28,869,165 	0 	28,869,165
Map output bytes 	65,414,790 	0 	65,414,790



All errors were at:

java.lang.OutOfMemoryError: Java heap space
	at java.util.Arrays.copyOf(Arrays.java:2786)
	at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
	at java.io.DataOutputStream.write(DataOutputStream.java:90)
	at java.io.FilterOutputStream.write(FilterOutputStream.java:80)
	at org.apache.pig.data.DataAtom.write(DataAtom.java:138)
	at org.apache.pig.data.Tuple.write(Tuple.java:282)
	at org.apache.pig.data.Tuple.write(Tuple.java:282)
	at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:347)
	at org.apache.pig.impl.mapreduceExec.PigMapReduce$MapDataOutputCollector.add(PigMapReduce.java:309)
	at org.apache.pig.impl.eval.collector.UnflattenCollector.add(UnflattenCollector.java:56)
	at org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.add(GenerateSpec.java:242)
	at org.apache.pig.impl.eval.collector.UnflattenCollector.add(UnflattenCollector.java:56)
	at org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.exec(GenerateSpec.java:273)
	at org.apache.pig.impl.eval.GenerateSpec$1.add(GenerateSpec.java:86)
	at org.apache.pig.impl.eval.collector.UnflattenCollector.add(UnflattenCollector.java:56)
	at org.apache.pig.impl.mapreduceExec.PigMapReduce.run(PigMapReduce.java:113)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1760)


I'll send you a URL for the data off list.

Many thanks for your persistent help.

Craig


Utkarsh Srivastava wrote:
>
> Hmm..., I can't think of any reason why this might be happening. Can 
> you retry?, or make the data available?
>
> Utkarsh
>
> On Dec 18, 2007, at 2:50 AM, Craig Macdonald wrote:
>
> > Hello,
> >
> > wc -l gives
> > 3014571
> >
> > - so shouldn't be loaded as a single tuple by Pig.
> >
> > C
> >
> > Utkarsh Srivastava wrote:
> >> This is really strange since your job is running out of memory on 
> >> the map side. This could happen if the input file had no newlines 
> >> (so that Pig tries to load the whole data set as a tuple). But 
> >> even then, your data is only 20M.
> >>
> >> Utkarsh
> >>
> >> On Dec 14, 2007, at 5:07 AM, Craig Macdonald wrote:
> >>
> >>> Hi All,
> >>>
> >>> I have been trying a really simple DISTINCT operator on a 20MB 
> >>> set of URLs (hadoop cluster of 6 nodes - Java VM heap  is 1000MB 
> >>> each). Any idea what's going wrong here?
> >>>
> >>> I cant see this being a problem the ongoing spill stuff, because 
> >>> the dataset is pretty small!
> >>>
> >>> The node logs dont give much other information either!
> >>>
> >>> Thanks in advance.
> >>>
> >>> Craig
> >>>
> >>>
> >>> urls = LOAD 'file:/users/tr.craigm/Blogs08/sourceBlogs/
> >>> blogger.com/recent-updates/all_13122007.txt';
> >>> Y = DISTINCT urls;
> >>> store Y 'distincUrls'
> >>>
> >>> <snip>
> >>>
> >>> 2007-12-14 12:55:38,999 [main] INFO  org.apache.pig - Pig 
> >>> progress = 28%
> >>> 2007-12-14 12:55:43,030 [main] INFO  org.apache.pig - Pig 
> >>> progress = 29%
> >>> 2007-12-14 13:00:25,230 [main] ERROR org.apache.pig - Error 
> >>> message from task (map) tip_200712070754_0025_m_000000 
> >>> java.lang.OutOfMemoryError: Java heap space
> >>> java.lang.OutOfMemoryError: Java heap space
> >>> java.lang.OutOfMemoryError: Java heap space
> >>> java.lang.OutOfMemoryError: Java heap space
> >>>
> >>> 2007-12-14 13:00:25,288 [main] ERROR org.apache.pig - Error 
> >>> message from task (map) tip_200712070754_0025_m_000001 
> >>> java.lang.OutOfMemoryError: Java heap space
> >>> java.lang.OutOfMemoryError: Java heap space
> >>> java.lang.OutOfMemoryError: Java heap space
> >>>
> >>> 2007-12-14 13:00:25,295 [main] ERROR org.apache.pig - Error 
> >>> message from task (reduce) tip_200712070754_0025_r_000000
> >>> Job failed
> >>> grunt>
> >>
> >
>


Mime
View raw message