hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dmytro Molkov (JIRA)" <j...@apache.org>
Subject [jira] Updated: (MAPREDUCE-1752) Implement getFileBlockLocations in HarFilesystem
Date Fri, 29 Oct 2010 00:27:24 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Dmytro Molkov updated MAPREDUCE-1752:
-------------------------------------

    Attachment: MAPREDUCE-1752.2.patch

Finally got back to this JIRA.
Attached is the patch that we tested internally and are currently using. It does have the
overhead of initial job submission, but it gives you locality for when you run the job which
is a reasonable tradeoff.

We were thinking of taking it one step further eventually when the splits created by the job
client on the job submission can have part files of the har directly. So that the only piece
of infrastructure that will be accessing har index file will be the client and the mr tasks
will go directly after the specific offsets inside of part files of har. But this seems like
another JIRA.

> Implement getFileBlockLocations in HarFilesystem
> ------------------------------------------------
>
>                 Key: MAPREDUCE-1752
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1752
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: harchive
>            Reporter: Dmytro Molkov
>            Assignee: Dmytro Molkov
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1752.2.patch, MR-1752.patch
>
>
> To efficiently run map reduce on the data that has been HAR'ed it will be great to actually
implement getFileBlockLocations for a given filename.
> This way the JobTracker will have information about data locality and will schedule tasks
appropriately.
> I believe the overhead introduced by doing lookups in the index files can be smaller
than that of copying data over the wire.
> Will upload the patch shortly, but would love to get some feedback on this. And any ideas
on how to test it are very welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message