hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hong Tang (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-1698) InputSplit.getLocations() semantics is not clear.
Date Wed, 14 Apr 2010 01:03:52 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856713#action_12856713

Hong Tang commented on MAPREDUCE-1698:

I think we should enforce that InputSplit.getLocations() only includes hosts that contain
at least a certain percentage of local data (and/or local+rack). 

Ideally, I'd like to see an interface as follows:

public static class LocationInfo {
  String host;
  float hostLocalDataPercent;
  float rackLocalDataPercent;

LocationInfo[] getLocations();


And we can introduce two config parameters "mapreduce.split.min-host-local-data-percent" and
"mapreduce.split.min-host-rack-data-percent". And a host will be included iff hostLocalDataPercent
>= %mapreduce.split.min-host-local-data-percent AND (hostLocalDataPercent + rackLocalDataPercent)
>= %mapreduce.split.min-host-rack-data-percent.

There are two benefits of doing so:
- The number of hosts for each split should be much manageable (translating to lower memory
consumption of JT, and simpler scheduling). In the case when most of the data are rack local,
we do not need to list all hosts on the rack.
- JobTracker can quantitatively distinguish cases like (40% host local + 40% rack local) vs
(80% host local).

> InputSplit.getLocations() semantics is not clear.
> -------------------------------------------------
>                 Key: MAPREDUCE-1698
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1698
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>            Reporter: Hong Tang
> The semantics of InputSplit.getLocations() is not clear and this has led to sub optimal
implementations and even bugs (PIG-878) in the past.

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message