pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cheolsoo Park" <piaozhe...@gmail.com>
Subject Re: Review Request 16507: PIG-3642 Direct HDFS access for small jobs (fetch)
Date Fri, 03 Jan 2014 01:18:27 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16507/#review31099
-----------------------------------------------------------


I have one last comment below. Other than that, everything looks good.

Also, can you document this? It think it's worth to mention in the "Performance and Efficiency"
section in the manual. You can post a doc patch in a separate jira if you'd like.


/trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchOptimizer.java
<https://reviews.apache.org/r/16507/#comment59452>

    This won't work if the temporary file storage is not InterStorage. It can be one of Inter,
TFile, and SequenceFile storages.
    
    See here-
    https://github.com/apache/pig/blob/trunk/src/org/apache/pig/impl/util/Utils.java#L347
    


- Cheolsoo Park


On Jan. 2, 2014, 2:05 p.m., Lorand Bendig wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/16507/
> -----------------------------------------------------------
> 
> (Updated Jan. 2, 2014, 2:05 p.m.)
> 
> 
> Review request for pig.
> 
> 
> Bugs: PIG-3642
>     https://issues.apache.org/jira/browse/PIG-3642
> 
> 
> Repository: pig
> 
> 
> Description
> -------
> 
> With this patch I'd like to add the possibility to directly read data from HDFS instead
of launching MR jobs in case of simple (map-only) tasks. Hive already has this feature (fetch).
This patch shares some similarities with the local mode of Pig 0.6. Here, fetching kicks off
when the following holds for a script:
> 
>     it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, (nested)
FOREACH with expression operators, custom UDFs..etc
>     no scalar aliases
>     no SampleLoader
>     single leaf job
>     DUMP (no STORE)
> 
> The feature is enabled by default and can be toggled with:
> 
>     -N or -no_fetch
>     set opt.fetch true/false;
> 
> There's no STORE support because I wanted to make it explicit that this "optimization"
is for launching small/simple scripts during development, rather than querying and filtering
large number of rows on the client machine. However, a threshold could be given on the input
size (an estimation) to determine whether to prefer fetch over MR jobs, similar to what Hive's
'hive.fetch.task.conversion.threshold' does. (through Pig's LoadMetadata#getStatistic ?)
> 
> 
> Diffs
> -----
> 
>   /trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java
1554785 
>   /trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/FixedWidthLoader.java
1554785 
>   /trunk/src/org/apache/pig/Main.java 1554785 
>   /trunk/src/org/apache/pig/PigConfiguration.java 1554785 
>   /trunk/src/org/apache/pig/PigServer.java 1554785 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/HExecutionEngine.java 1554785

>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchLauncher.java PRE-CREATION

>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchOptimizer.java
PRE-CREATION 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchPOStoreImpl.java
PRE-CREATION 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchProgressableReporter.java
PRE-CREATION 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java
1554785 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java
1554785 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POStream.java
1554785 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/util/MapRedUtil.java 1554785

>   /trunk/src/org/apache/pig/impl/util/PropertiesUtil.java 1554785 
>   /trunk/src/org/apache/pig/newplan/logical/expression/ExpToPhyTranslationVisitor.java
1554785 
>   /trunk/src/org/apache/pig/tools/pigstats/SimpleFetchPigStats.java PRE-CREATION 
>   /trunk/test/org/apache/pig/test/TestAssert.java 1554785 
>   /trunk/test/org/apache/pig/test/TestEvalPipeline2.java 1554785 
>   /trunk/test/org/apache/pig/test/TestFetch.java PRE-CREATION 
>   /trunk/test/org/apache/pig/test/TestPigRunner.java 1554785 
> 
> Diff: https://reviews.apache.org/r/16507/diff/
> 
> 
> Testing
> -------
> 
> - new testcase added:  TestFetch
> - the patch was checked against test-commit and test-core
> - Because opt.fetch is set by default, the testcases were using fetch instead of MR jobs
wherever it was possible
> 
> 
> Thanks,
> 
> Lorand Bendig
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message