pig-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Pig Wiki] Update of "PigExecutionModel" by ChrisOlston
Date Fri, 04 Jan 2008 19:28:44 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.

The following page has been changed by ChrisOlston:
http://wiki.apache.org/pig/PigExecutionModel

------------------------------------------------------------------------------
- ---+ Pig Execution Model
+ = Pig Execution Model =
  
- ---++ Introduction
+ == Introduction ==
  
  The goal is to decide how to structure operator evaluation pipelines in Pig. The major issues
include whether data flow follows a push model or a pull model, whether an operator evaluation
tree is multi-threaded or single-threaded, and what the API for user-defined functions (UDFs)
looks like.
  
- ---++ Alternative Execution Models
+ == Alternative Execution Models ==
  
  Some possible execution models are:
  
- ---+++ Model 1
+ === Model 1 ===
  
     * One thread.
     * Top-level scheduler round-robins through "leaf" operators.
@@ -18, +18 @@

     * Special "split" operator buffers data that gets fed to multiple operators; at most
one tuple gets buffered at each split point.
     * UDF API: declare zero or one input bag as "streaming"; init() hands it all data except
the streaming bag; next() hands one tuple from streaming bag
  
- ---+++ Model 2
+ === Model 2 ===
  
     * One thread per "leaf" operator.
     * Scheduling done by OS.
@@ -26, +26 @@

     * Split operator may buffer up to K tuples (or B bytes); if an operator tries to read
too far ahead it gets blocked until other operators reading from the same buffer catch up.
     * Deadlock can arise; need to detect it and release it by relaxing the K/B constraint
on one or more of the split buffers.
  
- ---+++ Discussion
+ === Discussion ===
  
- ---++++ Model 1 Drawbacks
+ ==== Model 1 Drawbacks ====
  
     * underutilize multi-core systems? (depends whether the policy is to assign several map/reduce
tasks to a machine)
     * difficult (or impossible?) to support operations that require streaming access to multiple
inputs (e.g., merge join, merge-based set difference, etc. which operate over pre-sorted input
streams)
     * UDF APIs more complex?
  
- ---++++ Model 2 Drawbacks
+ ==== Model 2 Drawbacks ====
  
     * thread synchronization overhead
     * complexity of multi-threaded implementation
  
- ---++++ Related work
+ ==== Related Reading ====
  
-    * Fjords paper [[http://db.lcs.mit.edu/madden/html/madden_fjords.pdf][paper]][[http://www.cs.umd.edu/class/fall2002/cmsc818s/Lectures/fjords.pdf][slides]]
+    * Fjords paper 
+       * paper: http://db.lcs.mit.edu/madden/html/madden_fjords.pdf
+       * slides: http://www.cs.umd.edu/class/fall2002/cmsc818s/Lectures/fjords.pdf
-    * Stream Programming Model / MIT StreamIt [waiting for pointer]
+    * Stream Programming Model / MIT StreamIt 
+       * <i>waiting for pointer</i>
  

Mime
View raw message