hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-902) Map output merge still uses unnecessary seeks
Date Sat, 17 Oct 2009 04:40:32 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12766833#action_12766833

Todd Lipcon commented on MAPREDUCE-902:

I've seen similar performance issues when there are a high number of reducers.

Christian: which linux IO scheduler are you using? Can you try switching to anticipatory and
seeing if the problem improves?

> Map output merge still uses unnecessary seeks
> ---------------------------------------------
>                 Key: MAPREDUCE-902
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-902
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: task
>    Affects Versions: 0.20.1
>            Reporter: Christian Kunz
> HADOOP-3638 improved the merge of the map output by caching the index files.
> But why not also caching the data files?
> In our use-case scenario, still using hadoop-0.18.3, but HADOOP-3638 would only help
partially, an individual map tasks finishes in less than 30 minutes, but needs 4 hours to
merge 70 spills for 20,000 partitions (with lzo compression), reading about 10kB from each
spill file (which is re-opened for every partition). As this is just a merge sort, there is
no reason to not keep the input files open and eliminate seek altogether with sequential access.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message