Return-Path: Delivered-To: apmail-hadoop-core-dev-archive@www.apache.org Received: (qmail 31188 invoked from network); 30 Oct 2008 10:02:42 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 30 Oct 2008 10:02:42 -0000 Received: (qmail 92178 invoked by uid 500); 30 Oct 2008 10:02:41 -0000 Delivered-To: apmail-hadoop-core-dev-archive@hadoop.apache.org Received: (qmail 92117 invoked by uid 500); 30 Oct 2008 10:02:40 -0000 Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-dev@hadoop.apache.org Received: (qmail 92106 invoked by uid 99); 30 Oct 2008 10:02:40 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 30 Oct 2008 03:02:40 -0700 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 30 Oct 2008 10:01:34 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 653F2234C24D for ; Thu, 30 Oct 2008 03:01:44 -0700 (PDT) Message-ID: <647964165.1225360904413.JavaMail.jira@brutus> Date: Thu, 30 Oct 2008 03:01:44 -0700 (PDT) From: "Jothi Padmanabhan (JIRA)" To: core-dev@hadoop.apache.org Subject: [jira] Commented: (HADOOP-3293) When an input split spans cross block boundary, the split location should be the host having most of bytes on it. In-Reply-To: <712354443.1208819722876.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12643792#action_12643792 ] Jothi Padmanabhan commented on HADOOP-3293: ------------------------------------------- The reason for this is that FileInputFormat.getBlockIndex() returns the blockindex of the starting block for the given offset. Instead, it should identify all the blocks that this particular split spans and then choose the block that contributes the maximum data for this split. We could use the following approach {code} //Calculate the number of blocks the split spans if (numBlocks == 1) return startIndex; else if (numBlocks == 2) return (bytesInFirstBlock > bytesInSecondBlock) ? startIndex:startIndex+1; else return startIndex + 1; {code} The rationale here is that if there are more than two blocks, we are guaranteed that block 2 is contributing its entire block length for this split. Note that we cannot do the identification of the block index based on the amount of data contributed by the individual host, because of the replication factor. For example, consider the following example (assume dfs block size = 100) Block 1 contributes 20 bytes and its hosts are A,B,C Block 2 contributes 100 bytes and its hosts are A, D,E Block 3 contributes 10 bytes and its hosts are D,E,F If we aggregate on a per host basis, host A having contributed 120 bytes would be the ideal choice. However, if we choose Block 1 as the index to be returned, even hosts B &C would be treated as data local, which is sub optimal. Thoughts? > When an input split spans cross block boundary, the split location should be the host having most of bytes on it. > ------------------------------------------------------------------------------------------------------------------ > > Key: HADOOP-3293 > URL: https://issues.apache.org/jira/browse/HADOOP-3293 > Project: Hadoop Core > Issue Type: Bug > Components: mapred > Reporter: Runping Qi > Assignee: Jothi Padmanabhan > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.