hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pi Song (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-287) Speed up SequenceFile sort with memory reduction
Date Tue, 19 Feb 2008 13:16:43 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12570206#action_12570206

Pi Song commented on HADOOP-287:

This should be measurable only if the output key set of a map task is big enough. This may
happen when the initial Map generates many  {K2,V2} for one input tuple {K1,V1}.

Still I think merge sort in some cases is worth considering (but not for the current implementation)

1. For small data sets,  performance characteristic of merge sort and quick sort should be
almost the same in many ways 
- Neglect the fact that merge sort requires more memory which is small in this case
- Merge sort on index array doesn't incur more objects for GC to be collected

2. For medium-sized data sets (I don't indicate how big is medium), quick sort should be better
(assume good pivoting so worst cases are rare).

3. For very large data sets, a good implementation of merge sort which allows  max memory
setting and really does multi-pass pipelined merging on the capped memory buffer should yield
better performance plus better control over memory usage.

> Speed up SequenceFile sort with memory reduction
> ------------------------------------------------
>                 Key: HADOOP-287
>                 URL: https://issues.apache.org/jira/browse/HADOOP-287
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 0.17.0
>            Reporter: Benjamin Reed
>            Assignee: Chris Douglas
>         Attachments: 287-1.patch, 287-2.patch, 287-2.patch, s.patch, zoom-sort.patch,
> I replaced the merge sort with a quick sort and it yielded approx 30% improvement in
sort time. It also reduced the memory requirement for sorting because the sort is done in

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message