Return-Path: Delivered-To: apmail-incubator-pig-user-archive@locus.apache.org Received: (qmail 41882 invoked from network); 20 Dec 2007 14:24:10 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 20 Dec 2007 14:24:10 -0000 Received: (qmail 50044 invoked by uid 500); 20 Dec 2007 14:23:59 -0000 Delivered-To: apmail-incubator-pig-user-archive@incubator.apache.org Received: (qmail 50030 invoked by uid 500); 20 Dec 2007 14:23:59 -0000 Mailing-List: contact pig-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: pig-user@incubator.apache.org Delivered-To: mailing list pig-user@incubator.apache.org Received: (qmail 50021 invoked by uid 99); 20 Dec 2007 14:23:59 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 20 Dec 2007 06:23:59 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [130.209.249.184] (HELO mr1.dcs.gla.ac.uk) (130.209.249.184) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 20 Dec 2007 14:23:46 +0000 Received: from bohol.dcs.gla.ac.uk ([130.209.252.70]:50296) by mr1.dcs.gla.ac.uk with esmtps (TLSv1:AES256-SHA:256) (Exim 4.42) id 1J5MJG-0000aU-0x for pig-user@incubator.apache.org; Thu, 20 Dec 2007 14:23:38 +0000 Message-ID: <476A7792.1060106@dcs.gla.ac.uk> Date: Thu, 20 Dec 2007 14:09:22 +0000 From: Craig Macdonald User-Agent: Thunderbird 2.0.0.9 (X11/20071031) MIME-Version: 1.0 To: pig-user@incubator.apache.org Subject: Re: OutOfMemory on DISTINCT References: <98340AEF-5F47-4904-86AB-556131DB3D67@yahoo-inc.com> <47682072.1000203@dcs.gla.ac.uk> <1C316160-F366-46FE-A84A-A366F204F9EC@yahoo-inc.com> In-Reply-To: <1C316160-F366-46FE-A84A-A366F204F9EC@yahoo-inc.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Utkarsh, [i've split this thread into two issues, as I have questions on compressed files] > I could run the job without problems. Hmmm. > > As regards the memory problem, its most likely something wrong with > your hadoop cluster (perhaps a missing 0 in the default memory you > give to your tasks). Sorry to be a pain - just trying to work to the bottom of this. My cluster has all defaults setup - ie only three properties are specified in my hadoop-site.xml (fs.default.name, mapred.job.tracker, dfs.replication), so I'd be suspicious if my setup is wrong - it's mostly just hadoop defaults. I assume the memory given to tasks is defined as mapred.child.java.opts which the default value is -Xmx200m (see hadoop-default.xml) Does this seem to low for this kind of job? C > > Utkarsh > > > On Dec 18, 2007, at 11:33 AM, Craig Macdonald wrote: > >> Hi Utkarsh, >> >> I retried on a larger cluster with more nodes. Note that I setup >> these hadoop clusters myself, so perhaps I'm doing something wrong >> there. I also reran again on the larger cluster using Java 6, as this >> gives a stack trace on OOM. >> >> Here are the job tracker statistics: >> >> >> Counter Map Reduce Total >> Job Counters Failed map tasks 0 0 3 >> Launched map tasks 0 0 9 >> Launched reduce tasks 0 0 1 >> Map-Reduce Framework Map input records 852,940 0 852,940 >> Map output records 852,940 0 852,940 >> Map input bytes 28,869,165 0 28,869,165 >> Map output bytes 65,414,790 0 65,414,790 >> >> >> >> All errors were at: >> >> java.lang.OutOfMemoryError: Java heap space >> at java.util.Arrays.copyOf(Arrays.java:2786) >> at >> java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94) >> at java.io.DataOutputStream.write(DataOutputStream.java:90) >> at java.io.FilterOutputStream.write(FilterOutputStream.java:80) >> at org.apache.pig.data.DataAtom.write(DataAtom.java:138) >> at org.apache.pig.data.Tuple.write(Tuple.java:282) >> at org.apache.pig.data.Tuple.write(Tuple.java:282) >> at >> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:347) >> >> at >> org.apache.pig.impl.mapreduceExec.PigMapReduce$MapDataOutputCollector.add(PigMapReduce.java:309) >> >> at >> org.apache.pig.impl.eval.collector.UnflattenCollector.add(UnflattenCollector.java:56) >> >> at >> org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.add(GenerateSpec.java:242) >> >> at >> org.apache.pig.impl.eval.collector.UnflattenCollector.add(UnflattenCollector.java:56) >> >> at >> org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.exec(GenerateSpec.java:273) >> >> at org.apache.pig.impl.eval.GenerateSpec$1.add(GenerateSpec.java:86) >> at >> org.apache.pig.impl.eval.collector.UnflattenCollector.add(UnflattenCollector.java:56) >> >> at >> org.apache.pig.impl.mapreduceExec.PigMapReduce.run(PigMapReduce.java:113) >> >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192) >> at >> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1760) >> >> >> I'll send you a URL for the data off list. >> >> Many thanks for your persistent help. >> >> Craig >> >> >> Utkarsh Srivastava wrote: >>> >>> Hmm..., I can't think of any reason why this might be happening. Can >>> you retry?, or make the data available? >>> >>> Utkarsh >>> >>> On Dec 18, 2007, at 2:50 AM, Craig Macdonald wrote: >>> >>> > Hello, >>> > >>> > wc -l gives >>> > 3014571 >>> > >>> > - so shouldn't be loaded as a single tuple by Pig. >>> > >>> > C >>> > >>> > Utkarsh Srivastava wrote: >>> >> This is really strange since your job is running out of memory on >>> >> the map side. This could happen if the input file had no newlines >>> >> (so that Pig tries to load the whole data set as a tuple). But >> >>> even then, your data is only 20M. >>> >> >>> >> Utkarsh >>> >> >>> >> On Dec 14, 2007, at 5:07 AM, Craig Macdonald wrote: >>> >> >>> >>> Hi All, >>> >>> >>> >>> I have been trying a really simple DISTINCT operator on a 20MB >>> >>> set of URLs (hadoop cluster of 6 nodes - Java VM heap is 1000MB >>> >>> each). Any idea what's going wrong here? >>> >>> >>> >>> I cant see this being a problem the ongoing spill stuff, because >>> >>> the dataset is pretty small! >>> >>> >>> >>> The node logs dont give much other information either! >>> >>> >>> >>> Thanks in advance. >>> >>> >>> >>> Craig >>> >>> >>> >>> >>> >>> urls = LOAD 'file:/users/tr.craigm/Blogs08/sourceBlogs/ >>> >>> blogger.com/recent-updates/all_13122007.txt'; >>> >>> Y = DISTINCT urls; >>> >>> store Y 'distincUrls' >>> >>> >>> >>> >>> >>> >>> >>> 2007-12-14 12:55:38,999 [main] INFO org.apache.pig - Pig >>> >>> progress = 28% >>> >>> 2007-12-14 12:55:43,030 [main] INFO org.apache.pig - Pig >>> >>> progress = 29% >>> >>> 2007-12-14 13:00:25,230 [main] ERROR org.apache.pig - Error >>> >>> message from task (map) tip_200712070754_0025_m_000000 >>> >>> java.lang.OutOfMemoryError: Java heap space >>> >>> java.lang.OutOfMemoryError: Java heap space >>> >>> java.lang.OutOfMemoryError: Java heap space >>> >>> java.lang.OutOfMemoryError: Java heap space >>> >>> >>> >>> 2007-12-14 13:00:25,288 [main] ERROR org.apache.pig - Error >>> >>> message from task (map) tip_200712070754_0025_m_000001 >>> >>> java.lang.OutOfMemoryError: Java heap space >>> >>> java.lang.OutOfMemoryError: Java heap space >>> >>> java.lang.OutOfMemoryError: Java heap space >>> >>> >>> >>> 2007-12-14 13:00:25,295 [main] ERROR org.apache.pig - Error >>> >>> message from task (reduce) tip_200712070754_0025_r_000000 >>> >>> Job failed >>> >>> grunt> >>> >> >>> > >>> >> >