hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ning Zhang (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-1307) More generic and efficient merge method
Date Wed, 12 May 2010 17:48:46 GMT

    [ https://issues.apache.org/jira/browse/HIVE-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12866655#action_12866655
] 

Ning Zhang commented on HIVE-1307:
----------------------------------

Some design notes:

This task should benefit not only the dynamic partition inserts, but any inserts that requires
merging (hive.merge.mapfiles/mapredfiles=true). The idea is as follows:

The current merge job is a MapReduce job for each partition. The mappers are just reading
the files and pass alone to only 1 reducer. The reducer is responsible to consolidate all
inputs into a single stream. The extra work in the boundary of mapper/reducer (e.g., copying,
shuffling and sorting) are not necessary. 

With the CombineHiveInputFormat, the merge job is map-only and it should take care of multiple
partitions. The idea is that one mapper should be generated for each partition. The input
format for that mapper should be CombineHiveInputFormat so that it will read multiple files
and output to one file.  

Since CombineHiveInputFormat depends on a Hadoop 0.20 feature, this feature relies on shim
to tell whether to use the new merge job (M) or old one (MR). With this restriction, merging
after dynamic partition insert only works for Hadoop 0.20. 

> More generic and efficient merge method
> ---------------------------------------
>
>                 Key: HIVE-1307
>                 URL: https://issues.apache.org/jira/browse/HIVE-1307
>             Project: Hadoop Hive
>          Issue Type: New Feature
>    Affects Versions: 0.6.0
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>             Fix For: 0.6.0
>
>
> Currently if hive.merge.mapfiles/mapredfiles=true, a new mapreduce job is create to read
the input files and output to one reducer for merging. This MR job is created at compile time
and one MR job for one partition. In the case of dynamic partition case, multiple partitions
could be created at execution time and generating merging MR job at compile time is impossible.

> We should generalize the merge framework to allow multiple partitions and most of the
time a map-only job should be sufficient if we use CombineHiveInputFormat. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message