Return-Path: X-Original-To: apmail-pig-dev-archive@www.apache.org Delivered-To: apmail-pig-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id F288D984D for ; Wed, 11 Apr 2012 21:01:42 +0000 (UTC) Received: (qmail 18822 invoked by uid 500); 11 Apr 2012 21:01:42 -0000 Delivered-To: apmail-pig-dev-archive@pig.apache.org Received: (qmail 18773 invoked by uid 500); 11 Apr 2012 21:01:42 -0000 Mailing-List: contact dev-help@pig.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@pig.apache.org Delivered-To: mailing list dev@pig.apache.org Received: (qmail 18764 invoked by uid 500); 11 Apr 2012 21:01:42 -0000 Delivered-To: apmail-hadoop-pig-dev@hadoop.apache.org Received: (qmail 18761 invoked by uid 99); 11 Apr 2012 21:01:42 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 11 Apr 2012 21:01:42 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 11 Apr 2012 21:01:40 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 62E4B366CB5 for ; Wed, 11 Apr 2012 21:01:19 +0000 (UTC) Date: Wed, 11 Apr 2012 21:01:19 +0000 (UTC) From: "Alex Levenson (Created) (JIRA)" To: pig-dev@hadoop.apache.org Message-ID: <1000595003.14411.1334178079406.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Created] (PIG-2647) Split Combining drops splits with empty getLocations() MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Split Combining drops splits with empty getLocations() ------------------------------------------------------ Key: PIG-2647 URL: https://issues.apache.org/jira/browse/PIG-2647 Project: Pig Issue Type: Bug Components: impl Reporter: Alex Levenson in: org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil#getCombinePigSplits which is used by PigInputFormat There is an assumption that every split's getLocations() will return a non-empty array. If the following criteria are met: 1) Split combining is turned on 2) There is more than one split 3) There is at least one split that is smaller than the maxCombineSplitSize splits with empty getLocations() will simply be dropped (ignored) without warning. The hadoop API does not specify that all splits must return a location and there are cases where a split may want to return no locations (if the data is not in HDFS for example, or if the data is a directory full of HDFS files in which case there's not much gained by having locality) This is due to the implementation of org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil#getCombinePigSplits scans all splits eligible for combining and creates a map of Nodes -> splits, then laster iterates through the MAP (not the splits) to do the combining. One solution would be to inject a dummy "empty node" into the map. Overall the logic in getCombinePigSplits is very complicated and has a lot of edge cases, it might be worth cleaning up. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira