hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cutting (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-801) MAPREDUCE framework should issue warning with too many locations for a split
Date Thu, 30 Jul 2009 21:31:14 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12737301#action_12737301

Doug Cutting commented on MAPREDUCE-801:

> The #locations per split to keep should probably be a cluster-wide config limit?

Sounds reasonable.

> Should we pick first n locations or pick randomly?

That depends on whether locations are ordered.  For example, one might list locations which
have 90% of the data in a split ahead of locations that only have 20%.  (Think map-side-join,
where a split might contain segments of multiple files.)  If that scenario sounds plausible,
then we should pick the N first, no?

> We should do truncation on both the JobClient and JobTracker to be wary of DOS if a malicious
client submits too many locations per split...

This all still all feels like overkill to me.  It reminds me of TSA policies about shoes and
liquids.  There are not that many InputFormat implementations.  We should seek to make it
easy to debug them generally rather than guard against a particular bug seen once.  To prevent
DOS, we could put an overall limit on the number of locations per job, or even the size of
the splits file, so that the JT doesn't run out of memory trying to process a job.  We should
make it easier to notice when job locality is poor.  But there are so many ways folks can
write poorly-performing applications and frameworks that spending a lot of time guarding against
this particular one seems a poor investment.

Also, truncation does nothing to, e.g., prevent an application that simply lists the wrong
locations.  Truncation would not help locality in the case of the PIG bug, since those locations
were mostly wrong.  The only thing truncation does is protect against a job using too many
resources in the JT, and there are simpler ways to protect against that.

So, sure, if we're checking it once, why not twice!

> MAPREDUCE framework should issue warning with too many locations for a split
> ----------------------------------------------------------------------------
>                 Key: MAPREDUCE-801
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-801
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>            Reporter: Hong Tang
> Customized input-format may be buggy and report misleading locations through input-split,
an example of which is PIG-878. When an input split returns too many locations, it would not
only artificially inflate the percentage of data local or rack local maps, but also force
scheduler to use more memory and work harder to conduct task assignment.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message