pig-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Reed <br...@yahoo-inc.com>
Subject Re: error with pig job
Date Thu, 06 Dec 2007 20:50:22 GMT
The simple test case would be to add to add more than max memory of records to 
a big BigDataBag after calling distinct. Right?

ben

On Thursday 06 December 2007 10:44:13 Utkarsh Srivastava wrote:
> There doesn't seem to be a simple test case to reproduce this,
> because the problem happens only when we spill to disk.
>
> Utkarsh
>
> On Dec 6, 2007, at 9:05 AM, Alan Gates wrote:
> > Utkarsh,
> >
> > I can submit a patch for this today.  Do you know of a simple test
> > case that reproduces the error?
> >
> > Alan.
> >
> > Utkarsh Srivastava wrote:
> >> Alan, this is a problem with the combiner part (the problem of
> >> putting an indexed tuple directly into the bag, the first point in
> >> my comment about the combiner patch that was committed). Some of
> >> the mappers that spill their bags to disk, have a problem reading
> >> them back, because what was written out was an indexed tuple,
> >> while what is expected to be read is a regular Tuple.
> >>
> >>
> >> Utkarsh
> >>
> >> On Dec 5, 2007, at 3:50 PM, Andrew Hitchcock wrote:
> >>> Hi folks,
> >>>
> >>> I'm having a problem with a Pig job I wrote, it is throwing
> >>> exceptions
> >>> in the map phase. I'm using the latest SVN of Pig, compiled against
> >>> the Hadoop15 jar included in SVN. My cluster is running Hadoop
> >>> 0.15.1
> >>> on Java 1.6.0_03. Here's the pig job (which I ran through grunt):
> >>>
> >>> A = LOAD 'netflix/netflix.csv' USING PigStorage(',') AS
> >>> (movie,user,rating,date);
> >>> B = GROUP A BY movie;
> >>> C = FOREACH B GENERATE group, COUNT(A.user) as ratingcount,
> >>> AVG(A.rating) as averagerating;
> >>> D = ORDER C BY averagerating;
> >>> STORE D INTO 'output/output.tsv';
> >>>
> >>> A large number of jobs fail (but not all, some succeed)  with the
> >>> following exception:
> >>>
> >>> error: Error message from task (map) tip_200712051644_0002_m_000003
> >>> java.lang.RuntimeException: Unexpected data while reading tuple from
> >>> binary file
> >>>     at org.apache.pig.impl.io.DataBagFileReader$myIterator.next
> >>> (DataBagFileReader.java:81)
> >>>     at org.apache.pig.impl.io.DataBagFileReader$myIterator.next
> >>> (DataBagFileReader.java:41)
> >>>     at
> >>> org.apache.pig.impl.eval.collector.DataCollector.addToSuccessor
> >>> (DataCollector.java:89)
> >>>     at org.apache.pig.impl.eval.SimpleEvalSpec$1.add
> >>> (SimpleEvalSpec.java:35)
> >>>     at org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.exec
> >>> (GenerateSpec.java:273)
> >>>     at org.apache.pig.impl.eval.GenerateSpec$1.add
> >>> (GenerateSpec.java:86)
> >>>     at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:
> >>> 216)
> >>>     at org.apache.pig.impl.eval.FuncEvalSpec$1.add
> >>> (FuncEvalSpec.java:105)
> >>>     at org.apache.pig.impl.eval.GenerateSpec
> >>> $CrossProductItem.<init>(GenerateSpec.java:165)
> >>>     at org.apache.pig.impl.eval.GenerateSpec$1.add
> >>> (GenerateSpec.java:77)
> >>>     at org.apache.pig.impl.mapreduceExec.PigCombine.reduce
> >>> (PigCombine.java:101)
> >>>     at org.apache.hadoop.mapred.MapTask
> >>> $MapOutputBuffer.combineAndSpill(MapTask.java:439)
> >>>     at org.apache.hadoop.mapred.MapTask
> >>> $MapOutputBuffer.sortAndSpillToDisk(MapTask.java:418)
> >>>     at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect
> >>> (MapTask.java:364)
> >>>     at org.apache.pig.impl.mapreduceExec.PigMapReduce
> >>> $MapDataOutputCollector.add(PigMapReduce.java:309)
> >>>     at org.apache.pig.impl.eval.collector.UnflattenCollector.add
> >>> (UnflattenCollector.java:56)
> >>>     at org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.add
> >>> (GenerateSpec.java:242)
> >>>     at org.apache.pig.impl.eval.collector.UnflattenCollector.add
> >>> (UnflattenCollector.java:56)
> >>>     at
> >>> org.apache.pig.impl.eval.collector.DataCollector.addToSuccessor
> >>> (DataCollector.java:93)
> >>>     at org.apache.pig.impl.eval.SimpleEvalSpec$1.add
> >>> (SimpleEvalSpec.java:35)
> >>>     at org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.exec
> >>> (GenerateSpec.java:273)
> >>>     at org.apache.pig.impl.eval.GenerateSpec$1.add
> >>> (GenerateSpec.java:86)
> >>>     at org.apache.pig.impl.eval.collector.UnflattenCollector.add
> >>> (UnflattenCollector.java:56)
> >>>     at org.apache.pig.impl.mapreduceExec.PigMapReduce.run
> >>> (PigMapReduce.java:113)
> >>>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
> >>>     at org.apache.hadoop.mapred.TaskTracker$Child.main
> >>> (TaskTracker.java:1760)
> >>>
> >>> As a comparison, the following job runs successfully:
> >>>
> >>> A = LOAD 'netflix/netflix.csv' USING PigStorage(',') AS
> >>> (movie,user,rating,date);
> >>> B = FILTER A BY movie == '8';
> >>> C = GROUP B BY movie;
> >>> D = FOREACH C GENERATE group, COUNT(B.user) as ratingcount,
> >>> AVG(B.rating) as averagerating;
> >>> DUMP D;
> >>>
> >>> Any help in tracking this down would be greatly appreciated. So far,
> >>> Pig is looking really slick and I'd love to write more advanced
> >>> programs with it.
> >>>
> >>> Thanks,
> >>> Andrew Hitchcock



Mime
View raw message