hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Patrick Kling (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-1752) Implement getFileBlockLocations in HarFilesystem
Date Mon, 01 Nov 2010 22:12:28 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12927178#action_12927178

Patrick Kling commented on MAPREDUCE-1752:

There is something really strange about the semantics of the offsets and lengths returned
by this. Consider the following part file consisting of 3 blocks containing a file f starting
at offset 896 with length 512:

| ...           |

| ...       | f |
512         896

| f         |...|
1024        1408

Calling getFileBlockLocations on this file will return 2 LocatedBlocks: b1=<offset=0, length=512>,
b2=<offset=512, length=512>. This indicates that b1 contains the first 512 bytes of
the block, even though in fact it only contains the first 128 bytes. This is a problem when
the client uses these LocatedBlocks to detect whether a portion of f has been corrupted.

I can think of 2 possible ways of fixing this:

1) Fix the offset of the returned blocks by subtracting hstatus.getStartIndex() (i.e., the
offset of f in the part file) from the block offset. This would return b1=<offset=-384,
length=512> and b2=<offset=128, length=512>, indicating to the client that the first
384 bytes of b1 are not part of 1 and correctly indicating the length of each block. In a
way, this is similar to how FSNamesystem.getBlockLocations returns entire blocks even if the
caller asks for a range that covers only part of these blocks.

2) Fix the length on the first block returned to reflect the portion of f that is contained
in this block, i.e., return b1=<offset=128, length=128>, b2=<offset=128, length=512>.
This seems somewhat less clean to me but avoids negative offsets. Also, it would break the
convention that all blocks of a file with the exception of the last block are the same length.

> Implement getFileBlockLocations in HarFilesystem
> ------------------------------------------------
>                 Key: MAPREDUCE-1752
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1752
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: harchive
>            Reporter: Dmytro Molkov
>            Assignee: Dmytro Molkov
>             Fix For: 0.22.0
>         Attachments: MAPREDUCE-1752.2.patch, MR-1752.patch
> To efficiently run map reduce on the data that has been HAR'ed it will be great to actually
implement getFileBlockLocations for a given filename.
> This way the JobTracker will have information about data locality and will schedule tasks
> I believe the overhead introduced by doing lookups in the index files can be smaller
than that of copying data over the wire.
> Will upload the patch shortly, but would love to get some feedback on this. And any ideas
on how to test it are very welcome.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message