pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Bain" <ambclo...@gmail.com>
Subject Review Request 16309: PIG-3629 Implement STREAM operator in Tez
Date Wed, 18 Dec 2013 03:53:28 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16309/
-----------------------------------------------------------

Review request for pig, Cheolsoo Park, Daniel Dai, Mark Wagner, and Rohini Palaniswamy.


Bugs: PIG-3629
    https://issues.apache.org/jira/browse/PIG-3629


Repository: pig-git


Description
-------

Implement STREAM operator in Tez - https://issues.apache.org/jira/browse/PIG-3629

In this patch, I do not add resources to pig-misc.jar, I just add them individually. See my
discussion post: https://groups.google.com/forum/#!topic/pig-on-tez/8S80GMKhMaU

Basic Changes:
-Run the PhyPlanSetter and EndOfAllInputSetter to set the parent plan and the end-of-all input
flags necessary for STREAM, just like in MR Pig.
-Add a map to hold plan-specific extra local resources in TezOperPlan.java. These resources
can either come from the user's directory (e.g. SHIP('/home/abain/foo')) or from HDFS (e.g.
CACHE('/user/abain/bar') in HDFS).
-Add the new class TezPOStreamVisitor that assembles all the plan-specific local resources
that get added in TezOperPlan.java.

Resource Manager Changes:
-TezResourcManager resources were previously a map of java.net.URL -> Path in HDFS. Previously,
the URL's were all local files, e.g. file://home/abain/pig-withouthHadoop.jar. However, the
CACHE statement requires that resources already present in HDFS be able to be added as local
resources. Unfortunately java.net.URL does not support hdfs:// URL's, so I changed this data
structure to be a YARN URL instead. I also added methods to the ResourceManager to distinguish
whether you are adding a local resource or a resource already present in HDFS.
-CACHE also supports URL's with fragments at the end, which become a "shortcut" to the name,
e.g. CACHE(/input/big-data-name.gz#data.gz). I changed the resource manager to look for a
fragments and use that as the resource name (if the fragment exist). This results in the symlink
to the resource being created with the fragment name, which is what we want.


Diffs
-----

  src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/EndOfAllInputSetter.java
37566ab 
  src/org/apache/pig/backend/hadoop/executionengine/tez/PigProcessor.java 7a1736a 
  src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java 2584501 
  src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java 96ccdde 
  src/org/apache/pig/backend/hadoop/executionengine/tez/TezJobControlCompiler.java 135b933

  src/org/apache/pig/backend/hadoop/executionengine/tez/TezOperPlan.java 0cc8e17 
  src/org/apache/pig/backend/hadoop/executionengine/tez/TezPOStreamVisitor.java PRE-CREATION

  src/org/apache/pig/backend/hadoop/executionengine/tez/TezPlanContainer.java 673fd70 
  src/org/apache/pig/backend/hadoop/executionengine/tez/TezResourceManager.java 0fd7575 

Diff: https://reviews.apache.org/r/16309/diff/


Testing
-------

Added a unit test to TestTezCompiler.java
Added an e2e test to tez.conf

ant test-tez passes
ant test-e2e-tez has three failures - I am investigating to see if they are releated, or perhaps
just transient

Question: There is already a separate suite of STREAM tests in streaming.conf. Maybe I should
remove my e2e test and we should add streaming.conf as a dependency to the test-e2e-tez target?
I haven't tried to run streaming.conf yet.


Thanks,

Alex Bain


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message