hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jothi Padmanabhan (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3293) When an input split spans cross block boundary, the split location should be the host having most of bytes on it.
Date Thu, 30 Oct 2008 10:01:44 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12643792#action_12643792
] 

Jothi Padmanabhan commented on HADOOP-3293:
-------------------------------------------

The reason for this is that  FileInputFormat.getBlockIndex() returns the blockindex of the
starting block for the given offset. Instead, it should identify all the blocks that this
particular split spans and then choose the block that contributes the maximum data for this
split.  

We could use the following approach

{code}
//Calculate the number of blocks the split spans
if (numBlocks == 1)
  return startIndex;
else if (numBlocks == 2)
  return (bytesInFirstBlock > bytesInSecondBlock) ? startIndex:startIndex+1;
else 
  return startIndex + 1;
{code}

The rationale here is that if there are more than two blocks, we are guaranteed that block
2 is contributing its entire block length for this split.

Note that we cannot do the identification of the block index based on the amount of data contributed
by the individual host, because of the replication factor.
For example, consider the following example (assume dfs block size = 100)
Block 1 contributes 20 bytes and its hosts are A,B,C
Block 2 contributes 100 bytes and its hosts are A, D,E
Block 3 contributes 10 bytes and its hosts are  D,E,F

If we aggregate on a per host basis, host A having contributed 120 bytes would be the ideal
choice. However, if we choose Block 1 as the index to be returned, even hosts B &C would
be treated as data local, which is sub optimal.  
Thoughts?

> When an input split spans cross block boundary, the split location should be the host
having most of bytes on it. 
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3293
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3293
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Runping Qi
>            Assignee: Jothi Padmanabhan
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message