hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sam Rash <s...@ning.com>
Subject Re: Is Pig dropping records?
Date Fri, 20 Nov 2009 18:28:41 GMT
Hi,

This reminds me of something else, though, that I took the latest  
patch for PIG-911 (sequence file reader) and found it skipped records

https://issues.apache.org/jira/browse/PIG-911

What I found is that the condition in getNext() would miss records:

if (reader != null && (reader.getPosition() < end || ! 
reader.syncSeen()) && reader.next(key, value)) {
...
}

I had to change it to:

if (reader != null && reader.next(key,value) && (reader.getPosition()  
< end || !reader.syncSeen())) {
...
}

(also ended up breaking out to read(key) and get the below to support  
reading other types than Writable)

This only happened when I file files pig read where more than one  
block; ie, the records dropped were around block boundaries.

has anyone noticed this?

thx,
-sr

Sam Rash
samr@ning.com



On Nov 19, 2009, at 4:48 PM, Dmitriy Ryaboy wrote:

> Zaki,
> Glad to hear it wasn't Pig's fault!
> Can you post a description of what was going on with S3, or at least
> how you fixed it?
>
> -D
>
> On Thu, Nov 19, 2009 at 2:57 PM, zaki rahaman  
> <zaki.rahaman@gmail.com> wrote:
> > Okay fixed some problem with corrupted file transfers from S3...  
> now wc -l
> > produces the same 143710 records... so yea its not a problem...  
> and now I am
> > getting the correct result from both methods... not sure what went  
> wrong...
> > thanks for the help though guys.
> >
> > On Thu, Nov 19, 2009 at 2:48 PM, Thejas Nair <tejas@yahoo-inc.com>  
> wrote:
> >
> >> Another thing to verify is that clickurl's position in the schema  
> is
> >> correct.
> >> -Thejas
> >>
> >>
> >>
> >> On 11/19/09 11:43 AM, "Ashutosh Chauhan" <ashutosh.chauhan@gmail.com 
> >
> >> wrote:
> >>
> >> > Hmm... You are sure that your records are separated by /n  
> (newline)
> >> > and fields by /t (tab).  If so, will it be possible you to  
> upload your
> >> > dataset (possibly smaller) somewhere so that someone can take a  
> look
> >> > at that.
> >> >
> >> > Ashutosh
> >> >
> >> > On Thu, Nov 19, 2009 at 14:35, zaki rahaman <zaki.rahaman@gmail.com

> >
> >> wrote:
> >> >> On Thu, Nov 19, 2009 at 2:24 PM, Ashutosh Chauhan <
> >> >> ashutosh.chauhan@gmail.com> wrote:
> >> >>
> >> >>> Hi Zaki,
> >> >>>
> >> >>> Just to narrow down the problem, can you do:
> >> >>>
> >> >>> A = LOAD 's3n://bucket/*week.46*clickLog.2009*';
> >> >>> dump A;
> >> >>>
> >> >>
> >> >> This produced 143710 records;
> >> >>
> >> >>
> >> >>> and
> >> >>>
> >> >>> A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS (
> >> >>> timestamp:chararray,
> >> >>> ip:chararray,
> >> >>> userid:chararray,
> >> >>> dist:chararray,
> >> >>> clickid:chararray,
> >> >>> usra:chararray,
> >> >>> campaign:chararray,
> >> >>> clickurl:chararray,
> >> >>> plugin:chararray,
> >> >>> tab:chararray,
> >> >>> feature:chararray);
> >> >>> dump A;
> >> >>>
> >> >>
> >> >>
> >> >> This produced 143710 records (so no problem there);
> >> >>
> >> >>
> >> >>> and
> >> >>>
> >> >>> cut -f8 *week.46*clickLog.2009* | wc -l
> >> >>>
> >> >>
> >> >>
> >> >> This produced...
> >> >> 175572
> >> >>
> >> >> Clearly, something is wrong...
> >> >>
> >> >>
> >> >> Thanks,
> >> >>> Ashutosh
> >> >>>
> >> >>> On Thu, Nov 19, 2009 at 14:03, zaki rahaman <zaki.rahaman@gmail.com

> >
> >> >>> wrote:
> >> >>>> Hi All,
> >> >>>>
> >> >>>> I have the following mini-script running as part of a larger
 
> set of
> >> >>>> scripts/workflow... however it seems like pig is dropping 

> records as
> >> when
> >> >>> I
> >> >>>> tried running the same thing as a simple grep | wc -l I get
a
> >> completely
> >> >>>> different result (2500 with Pig vs. 3300). The Pig script is
 
> as
> >> follows:
> >> >>>>
> >> >>>> A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS
> >> >>>> (timestamp:chararray,
> >> >>>> ip:chararray,
> >> >>>> userid:chararray,
> >> >>>> dist:chararray,
> >> >>>> clickid:chararray,
> >> >>>> usra:chararray,
> >> >>>> campaign:chararray,
> >> >>>> clickurl:chararray,
> >> >>>> plugin:chararray,
> >> >>>> tab:chararray,
> >> >>>> feature:chararray);
> >> >>>>
> >> >>>> B = FILTER raw BY clickurl matches '.*http://www.amazon.*';
> >> >>>>
> >> >>>> dump B produces the following output:
> >> >>>> 2009-11-19 18:50:46,013 [main] INFO
> >> >>>>
> >> >>>
> >>  
> org 
> .apache 
> .pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch
> >> >>> er
> >> >>>> - Successfully stored result in: "s3://kikin-pig-test/ 
> amazonoutput2"
> >> >>>> 2009-11-19 18:50:46,058 [main] INFO
> >> >>>>
> >> >>>
> >>  
> org 
> .apache 
> .pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch
> >> >>> er
> >> >>>> - Records written : 2502
> >> >>>> 2009-11-19 18:50:46,058 [main] INFO
> >> >>>>
> >> >>>
> >>  
> org 
> .apache 
> .pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch
> >> >>> er
> >> >>>> - Bytes written : 0
> >> >>>> 2009-11-19 18:50:46,058 [main] INFO
> >> >>>>
> >> >>>
> >>  
> org 
> .apache 
> .pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch
> >> >>> er
> >> >>>> - Success!
> >> >>>>
> >> >>>>
> >> >>>> The bash command is simply cut -f8 *week.46*clickLog.2009*
|  
> fgrep
> >> >>>> http://www.amazon | wc -l
> >> >>>>
> >> >>>> Both sets of inputs are the same files... and I'm not sure
 
> where the
> >> >>>> discrepency is coming from. Any help would be greatly  
> appreciated.
> >> >>>>
> >> >>>> --
> >> >>>> Zaki Rahaman
> >> >>>>
> >> >>>
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Zaki Rahaman
> >> >>
> >>
> >>
> >
> >
> > --
> > Zaki Rahaman
> >
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message