hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hong Tang (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-801) MAPREDUCE framework should issue warning with too many locations for a split
Date Tue, 28 Jul 2009 11:03:15 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12736031#action_12736031

Hong Tang commented on MAPREDUCE-801:

@vinod, PIG has its own input handling system (Slicer ~= InputFormat, Slice ~= Input Split),
when PIG uses MapReduce as the backend, the default Slicer (PigSlicer) creates slices for
each DFS block. However, there is a bug in the code that instead of returning the hosts for
that particular block, it returns the aggregation of all hosts for all blocks of a file (ignoring
the offset and length of the slice). It probably would help you understand the problem by
simply looking at the patch attached with PIG-878.

I can imagine similar problems may happen for non-expert users trying to write his/her input
formats. You may argue that (1) only affects the user (directly), however, we are sharing
the same cluster with many users, and poor locality could thrash the whole cluster and thus
affecting all users' jobs (indirectly). The proposal does not really solve the problem, it
merely makes sure the problem does not go silently without being noticed.

For (2), yes, we may choose to use a fraction of the locations, but do we need to worry that
the scheduler may try to schedule tasks on those subset of hosts and thus could make the actual
job running much slower (than not specifying locations at all)?

> MAPREDUCE framework should issue warning with too many locations for a split
> ----------------------------------------------------------------------------
>                 Key: MAPREDUCE-801
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-801
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>            Reporter: Hong Tang
> Customized input-format may be buggy and report misleading locations through input-split,
an example of which is PIG-878. When an input split returns too many locations, it would not
only artificially inflate the percentage of data local or rack local maps, but also force
scheduler to use more memory and work harder to conduct task assignment.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message