pig-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Pig Wiki] Update of "PigMultiQueryPerformanceSpecification" by RichardDing
Date Tue, 05 May 2009 22:47:14 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.

The following page has been changed by RichardDing:
http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification

------------------------------------------------------------------------------
     * Creating a split operator in the map or reduce and setting the splittee plans as nested
plans of the split
     * If it needs to merge combiners it will introduce a Demux operator to route the input
from mixed split branches in the mapper to the right combine plan. The separate combiner plans
are the nested plans of the Demux operator   
     * If it needs to merge reduce plans, it will do so using the Demux operator the same
way the combiner is merged.
-    * In the cases where some splittees have combiners and some do not have combiners, the
optimizer chooses either the subset of splittees with combiners or the subset of splittees
without combiners--depending on which subset is larger--and merges these splittees into the
splitter.
+    * In the cases where some splittees have combiners and some do not have combiners, the
optimizer chooses either the subset of splittees with combiners or the subset of splittees
without combiners--depending on which subset is larger--and merges the splittees in the chosen
subset into the splitter. The other subset--if not empty--will not be merged.
  
  Note: As an end result this merging will result in Split or Demux operators with multiple
stores tucked away in their nested plans.
  
@@ -638, +638 @@

  
  The demux operator is used in combiners and reducers where the input is a mix of different
split plans of the mapper. The outputs of split plans are indexed and based on the index,
the demux operator will decide which of it's nested plans a record belongs to and then attach
it to that particular plan. 
  
+ More precisely, these are the steps to merge a map-reduce splittee into the splitter:
  
+    1. Add the map plan of the splittee to the inner plan list of the split operator. 
+    2. Set the index on the leaf operator of the map plan based on the order this map plan
on the inner plan list.
+    3. Add the reduce plan of the splittee to the inner plan list of the demux operator in
the same order as the corresponding map plan.
+    4. The outputs of merged map plan of the splitter are indexed key/value pairs and are
sent to the reduce tasks.
+    5. The demux operator extracts the index from the key/values it receives and attaches
them to the corresponding reduce plan in its inner plan list.
+    6. The chosen reduce plan consumes the key/values data.
+  
+ [[Anchor(PartitionScheme)]]
+ ===== Partition Scheme =====
+ 
+ What is the parallelism (the number of reduce tasks requested) of the merged splitter job?
How do we partition the keys of the merged inner plans?
+ 
+ After considering several partition schemes, we settled on this one:
+ 
+    * The parallelism of the merged splitter job is the maximum of the parallelisms of all
splittee jobs.
+    * The keys from inner plans are partitioned into all the buckets via the default hash
partitioner.
+ 
+ To avoid the key collision of different inner plans with this scheme, the PigNullableWritable
class is modified to take into account of the indexes when two keys are compared (hashed).

+   
  [[Anchor(Local_Execution_engine)]]
  ==== Local Execution Engine ====
  The local engine has not changed as much as the map reduce engine. The local engine executes
the physical plan directly. The main changes were:

Mime
View raw message