hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ning Zhang (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-900) Map-side join failed if there are large number of mappers
Date Thu, 29 Oct 2009 18:39:59 GMT

    [ https://issues.apache.org/jira/browse/HIVE-900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771544#action_12771544
] 

Ning Zhang commented on HIVE-900:
---------------------------------

The essential problem is that there are too many mappers are trying to accessing the same
block at the same time so that it exceeds the threshold of accessing the same block. Thus
the BlockMissingException is thrown. 

Discussed with Namit and Dhruba offline. There are the proposed solutions:

1) Make the HDFS fault tolerant to this issue. Dhruba mentioned there already exists retry
logic implemented in the DFS client code: if the BlockMissingException is throw it will wait
about 400ms and retry. If there are still exceptions then wait for 800 ms and so on until
5 unsuccessful retry. This mechanism works for non-correlated simultaneous request of the
same block. However in this case, almost all the mappers request the same block at the same
time, so their retries will be also at about the same time. So it would be better to introduce
a random factor into the wait time. Dhruba will look into the DFS code and working on that.
This will solve a broader type of issues beside the map-side join.

2) Another orthogonal issue brought up by Namit for map-side join is that if there are too
many mappers and each of them request the same small table, it comes with a cost of transferring
the small file to all these mappers. Even though the BlockMissingException is resolved, the
cost is still there and it is proportional to the number of mappers. In this respect it would
be better to reduce the number of mappers. But it also comes with the cost that each mappers
then has to deal with larger portion of the large table. So we have to tradeoff the network
cost of the small table and the processing cost of the large table. Will come with a heuristic
on tune the parameters to decide the number of mappers for map join. 

> Map-side join failed if there are large number of mappers
> ---------------------------------------------------------
>
>                 Key: HIVE-900
>                 URL: https://issues.apache.org/jira/browse/HIVE-900
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>
> Map-side join is efficient when joining a huge table with a small table so that the mapper
can read the small table into main memory and do join on each mapper. However, if there are
too many mappers generated for the map join, a large number of mappers will simultaneously
send request to read the same block of the small table. Currently Hadoop has a upper limit
of the # of request of a the same block (250?). If that is reached a BlockMissingException
will be thrown. That cause a lot of mappers been killed. Retry won't solve but worsen the
problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message