hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ashutosh Chauhan (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader
Date Tue, 23 Jun 2009 21:43:07 GMT

    [ https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12723316#action_12723316
] 

Ashutosh Chauhan commented on PIG-820:
--------------------------------------

Thanks Alan and Pradeep for the review.

Will be incorporating SampleOptimizer changes. 
Constructor of RandomSampleLoader can only take string args since it is instantiated from
FuncSpec on backend. So, cant make changes to types of RandomSampleLoader constructor argument.
However, instead of String having classname of loader , String version of FuncSpec can be
used so that loader with correct constructor gets instantiated.

Will be uploading a new patch soon.

> PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume another loader
> -----------------------------------------------------------------------------------------
>
>                 Key: PIG-820
>                 URL: https://issues.apache.org/jira/browse/PIG-820
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.3.0, 0.4.0
>            Reporter: Alan Gates
>            Assignee: Ashutosh Chauhan
>             Fix For: 0.4.0
>
>         Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch
>
>
> Currently a sampling job requires that data already be stored in BinaryStorage format,
since RandomSampleLoader extends BinaryStorage.  For order by this
> has mostly been acceptable, because users tend to use order by at the end of their script
where other MR jobs have already operated on the data and thus it
> is already being stored in BinaryStorage.  For pig scripts that just did an order by,
an entire MR job is required to read the data and write it out
> in BinaryStorage format.
> As we begin work on join algorithms that will require sampling, this requirement to read
the entire input and write it back out will not be acceptable.
> Join is often the first operation of a script, and thus is much more likely to trigger
this useless up front translation job.
> Instead RandomSampleLoader can be changed to subsume an existing loader, using the user
specified loader to read the tuples while handling the skipping
> between tuples itself.  This will require the subsumed loader to implement a Samplable
Interface, that will look something like:
> {code}
> public interface SamplableLoader extends LoadFunc {
>     
>     /**
>      * Skip ahead in the input stream.
>      * @param n number of bytes to skip
>      * @return number of bytes actually skipped.  The return semantics are
>      * exactly the same as {@link java.io.InpuStream#skip(long)}
>      */
>     public long skip(long n) throws IOException;
>     
>     /**
>      * Get the current position in the stream.
>      * @return position in the stream.
>      */
>     public long getPosition() throws IOException;
> }
> {code}
> The MRCompiler would then check if the loader being used to load data implemented the
SamplableLoader interface.  If so, rather than create an initial MR
> job to do the translation it would create the sampling job, having RandomSampleLoader
use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message