pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Levenson (Created) (JIRA)" <j...@apache.org>
Subject [jira] [Created] (PIG-2647) Split Combining drops splits with empty getLocations()
Date Wed, 11 Apr 2012 21:01:19 GMT
Split Combining drops splits with empty getLocations()

                 Key: PIG-2647
                 URL: https://issues.apache.org/jira/browse/PIG-2647
             Project: Pig
          Issue Type: Bug
          Components: impl
            Reporter: Alex Levenson

which is used by PigInputFormat

There is an assumption that every split's getLocations() will return a non-empty array.
If the following criteria are met:
1) Split combining is turned on
2) There is more than one split
3) There is at least one split that is smaller than the maxCombineSplitSize

splits with empty getLocations() will simply be dropped (ignored) without warning.

The hadoop API does not specify that all splits must return a location and there are cases
where a split may want to return no locations (if the data is not in HDFS for example, or
if the data is a directory full of HDFS files in which case there's not much gained by having

This is due to the implementation of org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil#getCombinePigSplits
scans all splits eligible for combining and creates a map of Nodes -> splits, then laster
iterates through the MAP (not the splits) to do the combining.

One solution would be to inject a dummy "empty node" into the map.

Overall the logic in getCombinePigSplits is very complicated and has a lot of edge cases,
it might be worth cleaning up.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message