hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-902) Map output merge still uses unnecessary seeks
Date Sat, 17 Oct 2009 04:40:32 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12766833#action_12766833
] 

Todd Lipcon commented on MAPREDUCE-902:
---------------------------------------

I've seen similar performance issues when there are a high number of reducers.

Christian: which linux IO scheduler are you using? Can you try switching to anticipatory and
seeing if the problem improves?

> Map output merge still uses unnecessary seeks
> ---------------------------------------------
>
>                 Key: MAPREDUCE-902
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-902
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: task
>    Affects Versions: 0.20.1
>            Reporter: Christian Kunz
>
> HADOOP-3638 improved the merge of the map output by caching the index files.
> But why not also caching the data files?
> In our use-case scenario, still using hadoop-0.18.3, but HADOOP-3638 would only help
partially, an individual map tasks finishes in less than 30 minutes, but needs 4 hours to
merge 70 spills for 20,000 partitions (with lzo compression), reading about 10kB from each
spill file (which is re-opened for every partition). As this is just a merge sort, there is
no reason to not keep the input files open and eliminate seek altogether with sequential access.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message