hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jian He (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-6197) Cache MapOutputLocations in ShuffleHandler
Date Thu, 16 Jun 2016 03:57:05 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-6197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15333020#comment-15333020
] 

Jian He commented on MAPREDUCE-6197:
------------------------------------

lgtm, 
one question is how/why do you choose such policy for determining the weight ?
{code}
maximumWeight(MAX_WEIGHT).weigher(
          new Weigher<AttemptPathIdentifier, AttemptPathInfo>() {
            @Override
            public int weigh(AttemptPathIdentifier key,
                AttemptPathInfo value) {
              return key.jobId.length() + key.user.length() +
                  key.attemptId.length()+
                  value.indexPath.toString().length() +
                  value.dataPath.toString().length();
            }
          }
      )
{code}

> Cache MapOutputLocations in ShuffleHandler
> ------------------------------------------
>
>                 Key: MAPREDUCE-6197
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6197
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 2.6.0
>            Reporter: Siddharth Seth
>            Assignee: Junping Du
>         Attachments: MAPREDUCE-6197.patch
>
>
> ShuffleHandler currently seems to create a map of mapId - mapInfo (file.out / index information)
when it receives a message.
> This should be caching map info across requests, so that the a scan of all directories
is not required for each reducer fetching from the same map.
> Also, the scan for each map output / index file is performed twice per mapId within a
request. In populateHeaders - once in the call to getMapOutputInfo, and then directly in the
method.
> For an invocation where we do end up with more than 1000 (default) mapIds in a single
call, and don't cache them in the map - the path constructed for such entries will be invalid.
This is highly unlikely to be the case though, until there's proper caching.
> {code}
> MapOutputInfo info = mapOutputInfoMap.get(mapId);
>           if (info == null) {
>             info = getMapOutputInfo(outputBasePathStr, mapId, reduceId, user);
>           }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-help@hadoop.apache.org


Mime
View raw message