pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rohini Palaniswamy" <rohini.adi...@gmail.com>
Subject Re: Review Request 16309: PIG-3629 Implement STREAM operator in Tez
Date Sun, 16 Feb 2014 23:44:11 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16309/#review34604
-----------------------------------------------------------

Ship it!


Ship It!

- Rohini Palaniswamy


On Jan. 10, 2014, 10:35 p.m., Alex Bain wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/16309/
> -----------------------------------------------------------
> 
> (Updated Jan. 10, 2014, 10:35 p.m.)
> 
> 
> Review request for pig, Cheolsoo Park, Daniel Dai, Mark Wagner, and Rohini Palaniswamy.
> 
> 
> Bugs: PIG-3629
>     https://issues.apache.org/jira/browse/PIG-3629
> 
> 
> Repository: pig-git
> 
> 
> Description
> -------
> 
> Implement STREAM operator in Tez - https://issues.apache.org/jira/browse/PIG-3629
> 
> In this patch, I do not add resources to pig-misc.jar, I just add them individually.
See my discussion post: https://groups.google.com/forum/#!topic/pig-on-tez/8S80GMKhMaU
> 
> Basic Changes:
> -Run the PhyPlanSetter and EndOfAllInputSetter to set the parent plan and the end-of-all
input flags necessary for STREAM, just like in MR Pig.
> -Add a map to hold plan-specific extra local resources in TezOperPlan.java. These resources
can either come from the user's directory (e.g. SHIP('/home/abain/foo')) or from HDFS (e.g.
CACHE('/user/abain/bar') in HDFS).
> -Add the new class TezPOStreamVisitor that assembles all the plan-specific local resources
that get added in TezOperPlan.java.
> 
> Resource Manager Changes:
> -TezResourcManager resources were previously a map of java.net.URL -> Path in HDFS.
Previously, the URL's were all local files, e.g. file://home/abain/pig-withouthHadoop.jar.
However, the CACHE statement requires that resources already present in HDFS be able to be
added as local resources. Unfortunately java.net.URL does not support hdfs:// URL's, so I
changed this data structure to be a YARN URL instead. I also added methods to the ResourceManager
to distinguish whether you are adding a local resource or a resource already present in HDFS.
> -CACHE also supports URL's with fragments at the end, which become a "shortcut" to the
name, e.g. CACHE(/input/big-data-name.gz#data.gz). I changed the resource manager to look
for a fragments and use that as the resource name (if the fragment exist). This results in
the symlink to the resource being created with the fragment name, which is what we want.
> 
> Race condition:
> -I found a race condition that resulted from reusing the Result object in POSimpleTezLoad.
There are several possible solutions. After discussing in the newsgroup, we decided to change
POSimpleTezLoad for now.
> -I also made a small cleanup to PhysicalOperator.java.
> 
> 
> Diffs
> -----
> 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/TezJobControlCompiler.java 28a110a

>   src/org/apache/pig/backend/hadoop/executionengine/tez/TezOperPlan.java 3e6ec7b 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/TezPlanContainer.java 7342dab

>   src/org/apache/pig/backend/hadoop/executionengine/tez/TezResourceManager.java e28de47

> 
> Diff: https://reviews.apache.org/r/16309/diff/
> 
> 
> Testing
> -------
> 
> Added a unit test to TestTezCompiler.java
> Added a new unit test e2e test to tez.conf with session reuse enabled
> Ported three other e2e tests from streaming.conf to tez.conf to increase coverage
> 
> ant test-tez passes
> ant test-e2e-tez passes
> Manually tested with a large subset of tests from streaming.conf (the ones using features
currently supported by Pig-on-Tez), they pass
> 
> 
> Thanks,
> 
> Alex Bain
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message