hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dmytro Molkov (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-1752) Implement getFileBlockLocations in HarFilesystem
Date Mon, 01 Nov 2010 22:28:28 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12927186#action_12927186

Dmytro Molkov commented on MAPREDUCE-1752:

In the second case it would of course be b1=<offset=0, length=128>.

I personally like the second way of fixing it more, since it gives predictable offsets. For
the file f the block locations would start with offset 0 and the total length would sum up
to the total length of the file. The problem with it might be that the block location of the
first block will have length different from the actual block length in this file.
The way block locations are returned currently each of them except for the last one will have
the length of the block and start at the offset which is a multiple of the block length. And
even when I call getBlockLocations with offset and length different from 0, status.getLength()
I am not guaranteed to get the result where the sum of length would be equal to length and
the smallest offset of the block location would be equal to the offset provided.

That said I think that the second approach fits better into this system unless having block
of different lengths will be a problem.

> Implement getFileBlockLocations in HarFilesystem
> ------------------------------------------------
>                 Key: MAPREDUCE-1752
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1752
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: harchive
>            Reporter: Dmytro Molkov
>            Assignee: Dmytro Molkov
>             Fix For: 0.22.0
>         Attachments: MAPREDUCE-1752.2.patch, MR-1752.patch
> To efficiently run map reduce on the data that has been HAR'ed it will be great to actually
implement getFileBlockLocations for a given filename.
> This way the JobTracker will have information about data locality and will schedule tasks
> I believe the overhead introduced by doing lookups in the index files can be smaller
than that of copying data over the wire.
> Will upload the patch shortly, but would love to get some feedback on this. And any ideas
on how to test it are very welcome.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message