Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-dev@hadoop.apache.org
Message-ID: <647964165.1225360904413.JavaMail.jira@brutus>
Date: Thu, 30 Oct 2008 03:01:44 -0700 (PDT)
From: "Jothi Padmanabhan (JIRA)" <jira@apache.org>
To: core-dev@hadoop.apache.org
Subject: [jira] Commented: (HADOOP-3293) When an input split spans cross
 block boundary, the split location should be the host having most of bytes
 on it.
In-Reply-To: <712354443.1208819722876.JavaMail.jira@brutus>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HADOOP-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12643792#action_12643792 ] 

Jothi Padmanabhan commented on HADOOP-3293:
-------------------------------------------

The reason for this is that  FileInputFormat.getBlockIndex() returns the blockindex of the starting block for the given offset. Instead, it should identify all the blocks that this particular split spans and then choose the block that contributes the maximum data for this split.  

We could use the following approach

{code}
//Calculate the number of blocks the split spans
if (numBlocks == 1)
  return startIndex;
else if (numBlocks == 2)
  return (bytesInFirstBlock > bytesInSecondBlock) ? startIndex:startIndex+1;
else 
  return startIndex + 1;
{code}

The rationale here is that if there are more than two blocks, we are guaranteed that block 2 is contributing its entire block length for this split.

Note that we cannot do the identification of the block index based on the amount of data contributed by the individual host, because of the replication factor.
For example, consider the following example (assume dfs block size = 100)
Block 1 contributes 20 bytes and its hosts are A,B,C
Block 2 contributes 100 bytes and its hosts are A, D,E
Block 3 contributes 10 bytes and its hosts are  D,E,F

If we aggregate on a per host basis, host A having contributed 120 bytes would be the ideal choice. However, if we choose Block 1 as the index to be returned, even hosts B &C would be treated as data local, which is sub optimal.  
Thoughts?

> When an input split spans cross block boundary, the split location should be the host having most of bytes on it. 
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3293
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3293
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Runping Qi
>            Assignee: Jothi Padmanabhan
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.