hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sreekanth Ramakrishnan (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-1695) MapJoin followed by ReduceSink should be done as single MapReduce Job
Date Mon, 29 Nov 2010 10:22:40 GMT

    [ https://issues.apache.org/jira/browse/HIVE-1695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12964676#action_12964676
] 

Sreekanth Ramakrishnan commented on HIVE-1695:
----------------------------------------------

Current processing of the jobs with MapJoin followed by the Reduce sink happens in two stages.

Stage-1 : Mapjoin + Select operator is split into one single stage. This stage the plan is
split when the select operator is encountered immediately after the MapJoin. A file Sink Operator
is added immediately after the Mapjoin and the select operator is removed from the tree.

Stage-2: Mapjoin + Reduce Sink processor. This stage the work is initialized from the previous
stage by looking at the output from the FileSinkOperator and then uses this as input for current
stage and select operator is added for the column to be used in the reduce stage along with
ordering and other information.

In order to collapse the two stage into a single stage we would need to do the following:

After Stage-1 processing is done, i.e. after the NodeProcessor from MapJoinFactory.MapJoin
is run and the next stage NodeProcessor is called, we need to:

# In GenMRRedSink4, access the current MapJoin Operator. Remove the FileSinkOperator which
is added to mark the end of stage.
# Add Reduce operator to the same to pass the expression and the sort order to be used by
the reducer.

Thoughts on the above approach? 

Plus, by adding the reduce operator at the end of the MapJoin would it cause any regressions?
Is there a cleaner way of doing the same i.e by adding a new rule for processing?

> MapJoin followed by ReduceSink should be done as single MapReduce Job
> ---------------------------------------------------------------------
>
>                 Key: HIVE-1695
>                 URL: https://issues.apache.org/jira/browse/HIVE-1695
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Amareshwari Sriramadasu
>
> Currently MapJoin followed by ReduceSink runs as two MapReduce jobs : One map only job
followed by a Map-Reduce job. It can be combined into single MapReduce Job.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message