hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Devaraj Das (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-1965) Handle map output buffers better
Date Mon, 05 Nov 2007 10:15:50 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-1965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12540173

Devaraj Das commented on HADOOP-1965:

Doug, regarding your comment on pushing the sorting to the maps is not clear to me. Each map,
in the current framework, and in this issue, will sort the outputs that it produces for the
reduces. Even if there are multiple spills, the final output per reduce will be sorted on
the map side (the maps do a final merge of the spills). The amount of data that is sorted
on the map side is dependent on the value of split.getLength().

The one concern I have on this issue is that, for a constant io.sort.mb, we double the number
of seeks for the final merge of the spill files when compared to the #seeks in the current
framework. This is because we work on 50% of the io.sort.mb space for sort/spill and use the
other 50% for collecting. The #seeks issue can be avoided by keeping the spill-files handles'
open during merge but we then might run into issues discussed in HADOOP-874. 

What do others think?

> Handle map output buffers better
> --------------------------------
>                 Key: HADOOP-1965
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1965
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Devaraj Das
>            Assignee: Amar Kamat
>             Fix For: 0.16.0
>         Attachments: 1965_single_proc_150mb_gziped.jpeg, 1965_single_proc_150mb_gziped.pdf,
1965_single_proc_150mb_gziped_breakup.png, HADOOP-1965-1.patch
> Today, the map task stops calling the map method while sort/spill is using the (single
instance of) map output buffer. One improvement that can be done to improve performance of
the map task is to have another buffer for writing the map outputs to, while sort/spill is
using the first buffer.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message