hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thejas Nair <te...@yahoo-inc.com>
Subject Re: LoadFunc.skipNext() function for faster sampling ?
Date Wed, 04 Nov 2009 00:49:38 GMT
Yes, that should work. I will use InputFormat.getNext from the SampleLoader
to skip the records.
Thanks,
Thejas


On 11/3/09 6:39 PM, "Alan Gates" <gates@yahoo-inc.com> wrote:

> We definitely want to avoid parsing every tuple when sampling.  But do
> we need to implement a special function for it?  Pig will have access
> to the InputFormat instance, correct?  Can it not call
> InputFormat.getNext the desired number of times (which will not parse
> the tuple) and then call LoadFunc.getNext to get the next parsed tuple?
> 
> Alan.
> 
> On Nov 3, 2009, at 4:28 PM, Thejas Nair wrote:
> 
>> In the new implementation of SampleLoader subclasses (used by order-
>> by,
>> skew-join ..) as part of the loader redesign, we are not only
>> reading all
>> the records input but also parsing them as pig tuples.
>> 
>> This is because the SampleLoaders are wrappers around the actual input
>> loaders specified in the query. We can make things much faster by
>> having a
>> skipNext() function (or skipNext(int numSkip) ) which will avoid
>> parsing the
>> record into a pig tuple.
>> LoadFunc could optionally implement this (easy to implement)
>> function (which
>> will be part of an interface) for improving speed of queries such as
>> order-by.
>> 
>> -Thejas
>> 
> 


Mime
View raw message