hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Liyin Tang (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-1641) add map joined table to distributed cache
Date Wed, 06 Oct 2010 01:30:35 GMT

    [ https://issues.apache.org/jira/browse/HIVE-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12918374#action_12918374
] 

Liyin Tang commented on HIVE-1641:
----------------------------------

The previously assumption is not always true. There might be multiple map join operations
in one local work. 

No matter how many map join operators in one Map Red Task, for each map join operator, there
will be one parent operator from big table branch and other operators from small table branches.
For big table branch, just leave it alone.

For small table branch, create a new JDBMSinkOperator to replace the current MapJoin Operator.
Now the local work has no common operators shared with the MapredWork.  
And create a JDBMDummyOperator to replace original parent operator for the MapJoinOperator.

This JDBMDummyOperator will help MapJoinOperator generate correctly input object inspector
during the run time.

In the execution time, the LocalTask will process all the local work and generate the JDBM
file for each small tables. 
When the MapRedTask starts to process the first row for MapJoinOperator, it will load the
JDBM file to generate the in-memory hash table.

If in the local mode, the JDBM files will be just stored in local directory. If not, the jdbm
files will be added into Distributed Cache.

This patch is just tested on Local Mode. I will submit another patch after testing against
the clusters.


> add map joined table to distributed cache
> -----------------------------------------
>
>                 Key: HIVE-1641
>                 URL: https://issues.apache.org/jira/browse/HIVE-1641
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Liyin Tang
>             Fix For: 0.7.0
>
>
> Currently, the mappers directly read the map-joined table from HDFS, which makes it difficult
to scale.
> We end up getting lots of timeouts once the number of mappers are beyond a few thousand,
due to 
> concurrent mappers.
> It would be good idea to put the mapped file into distributed cache and read from there
instead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message