Return-Path: Delivered-To: apmail-incubator-pig-user-archive@locus.apache.org Received: (qmail 63764 invoked from network); 6 Dec 2007 20:50:56 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 6 Dec 2007 20:50:56 -0000 Received: (qmail 84360 invoked by uid 500); 6 Dec 2007 20:50:44 -0000 Delivered-To: apmail-incubator-pig-user-archive@incubator.apache.org Received: (qmail 84342 invoked by uid 500); 6 Dec 2007 20:50:44 -0000 Mailing-List: contact pig-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: pig-user@incubator.apache.org Delivered-To: mailing list pig-user@incubator.apache.org Received: (qmail 84333 invoked by uid 99); 6 Dec 2007 20:50:44 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 06 Dec 2007 12:50:44 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [216.109.112.28] (HELO mrout2-b.corp.dcn.yahoo.com) (216.109.112.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 06 Dec 2007 20:50:43 +0000 Received: from reasonpublic-lx.corp.yahoo.com (reasonpublic-lx.corp.yahoo.com [10.72.104.164]) by mrout2-b.corp.dcn.yahoo.com (8.13.6/8.13.6/y.out) with ESMTP id lB6KoE33082688; Thu, 6 Dec 2007 12:50:14 -0800 (PST) DomainKey-Signature: a=rsa-sha1; s=serpent; d=yahoo-inc.com; c=nofws; q=dns; h=from:to:subject:date:user-agent:cc:references:in-reply-to: mime-version:content-type:content-transfer-encoding: content-disposition:message-id; b=x/Je5RRJcT5L15erDEZ6OMOEf1peSVoucyLEt+GPIA1jHjtBen55SQjWWYiCbWEm From: Benjamin Reed To: pig-user@incubator.apache.org Subject: Re: error with pig job Date: Thu, 6 Dec 2007 12:50:22 -0800 User-Agent: KMail/1.9.6 (enterprise 0.20070907.709405) Cc: Utkarsh Srivastava , marty.springer@gmail.com, sam990912@gmail.com References: <47582BF5.3040303@yahoo-inc.com> In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200712061250.23062.breed@yahoo-inc.com> X-Virus-Checked: Checked by ClamAV on apache.org The simple test case would be to add to add more than max memory of records to a big BigDataBag after calling distinct. Right? ben On Thursday 06 December 2007 10:44:13 Utkarsh Srivastava wrote: > There doesn't seem to be a simple test case to reproduce this, > because the problem happens only when we spill to disk. > > Utkarsh > > On Dec 6, 2007, at 9:05 AM, Alan Gates wrote: > > Utkarsh, > > > > I can submit a patch for this today. Do you know of a simple test > > case that reproduces the error? > > > > Alan. > > > > Utkarsh Srivastava wrote: > >> Alan, this is a problem with the combiner part (the problem of > >> putting an indexed tuple directly into the bag, the first point in > >> my comment about the combiner patch that was committed). Some of > >> the mappers that spill their bags to disk, have a problem reading > >> them back, because what was written out was an indexed tuple, > >> while what is expected to be read is a regular Tuple. > >> > >> > >> Utkarsh > >> > >> On Dec 5, 2007, at 3:50 PM, Andrew Hitchcock wrote: > >>> Hi folks, > >>> > >>> I'm having a problem with a Pig job I wrote, it is throwing > >>> exceptions > >>> in the map phase. I'm using the latest SVN of Pig, compiled against > >>> the Hadoop15 jar included in SVN. My cluster is running Hadoop > >>> 0.15.1 > >>> on Java 1.6.0_03. Here's the pig job (which I ran through grunt): > >>> > >>> A = LOAD 'netflix/netflix.csv' USING PigStorage(',') AS > >>> (movie,user,rating,date); > >>> B = GROUP A BY movie; > >>> C = FOREACH B GENERATE group, COUNT(A.user) as ratingcount, > >>> AVG(A.rating) as averagerating; > >>> D = ORDER C BY averagerating; > >>> STORE D INTO 'output/output.tsv'; > >>> > >>> A large number of jobs fail (but not all, some succeed) with the > >>> following exception: > >>> > >>> error: Error message from task (map) tip_200712051644_0002_m_000003 > >>> java.lang.RuntimeException: Unexpected data while reading tuple from > >>> binary file > >>> at org.apache.pig.impl.io.DataBagFileReader$myIterator.next > >>> (DataBagFileReader.java:81) > >>> at org.apache.pig.impl.io.DataBagFileReader$myIterator.next > >>> (DataBagFileReader.java:41) > >>> at > >>> org.apache.pig.impl.eval.collector.DataCollector.addToSuccessor > >>> (DataCollector.java:89) > >>> at org.apache.pig.impl.eval.SimpleEvalSpec$1.add > >>> (SimpleEvalSpec.java:35) > >>> at org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.exec > >>> (GenerateSpec.java:273) > >>> at org.apache.pig.impl.eval.GenerateSpec$1.add > >>> (GenerateSpec.java:86) > >>> at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java: > >>> 216) > >>> at org.apache.pig.impl.eval.FuncEvalSpec$1.add > >>> (FuncEvalSpec.java:105) > >>> at org.apache.pig.impl.eval.GenerateSpec > >>> $CrossProductItem.(GenerateSpec.java:165) > >>> at org.apache.pig.impl.eval.GenerateSpec$1.add > >>> (GenerateSpec.java:77) > >>> at org.apache.pig.impl.mapreduceExec.PigCombine.reduce > >>> (PigCombine.java:101) > >>> at org.apache.hadoop.mapred.MapTask > >>> $MapOutputBuffer.combineAndSpill(MapTask.java:439) > >>> at org.apache.hadoop.mapred.MapTask > >>> $MapOutputBuffer.sortAndSpillToDisk(MapTask.java:418) > >>> at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect > >>> (MapTask.java:364) > >>> at org.apache.pig.impl.mapreduceExec.PigMapReduce > >>> $MapDataOutputCollector.add(PigMapReduce.java:309) > >>> at org.apache.pig.impl.eval.collector.UnflattenCollector.add > >>> (UnflattenCollector.java:56) > >>> at org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.add > >>> (GenerateSpec.java:242) > >>> at org.apache.pig.impl.eval.collector.UnflattenCollector.add > >>> (UnflattenCollector.java:56) > >>> at > >>> org.apache.pig.impl.eval.collector.DataCollector.addToSuccessor > >>> (DataCollector.java:93) > >>> at org.apache.pig.impl.eval.SimpleEvalSpec$1.add > >>> (SimpleEvalSpec.java:35) > >>> at org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.exec > >>> (GenerateSpec.java:273) > >>> at org.apache.pig.impl.eval.GenerateSpec$1.add > >>> (GenerateSpec.java:86) > >>> at org.apache.pig.impl.eval.collector.UnflattenCollector.add > >>> (UnflattenCollector.java:56) > >>> at org.apache.pig.impl.mapreduceExec.PigMapReduce.run > >>> (PigMapReduce.java:113) > >>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192) > >>> at org.apache.hadoop.mapred.TaskTracker$Child.main > >>> (TaskTracker.java:1760) > >>> > >>> As a comparison, the following job runs successfully: > >>> > >>> A = LOAD 'netflix/netflix.csv' USING PigStorage(',') AS > >>> (movie,user,rating,date); > >>> B = FILTER A BY movie == '8'; > >>> C = GROUP B BY movie; > >>> D = FOREACH C GENERATE group, COUNT(B.user) as ratingcount, > >>> AVG(B.rating) as averagerating; > >>> DUMP D; > >>> > >>> Any help in tracking this down would be greatly appreciated. So far, > >>> Pig is looking really slick and I'd love to write more advanced > >>> programs with it. > >>> > >>> Thanks, > >>> Andrew Hitchcock