hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hong Tang (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-64) Map-side sort is hampered by io.sort.record.percent
Date Tue, 22 Dec 2009 10:32:29 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-64?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793578#action_12793578

Hong Tang commented on MAPREDUCE-64:

bq. This is an interesting idea. Clever implementations could also avoid skewing the average
record size disproportionately (possibly an independent issue). Please file a JIRA.

Will do. 

bq. The distinction between flush and close is not clear for a Collector.

The reason I find it odd is that conventionally one can flush a stream an arbitrary number
of times without destroying the stream. This is clearly not the case here. Yes, I agree that
MAPREDUCE-1211 would be relevant here. I am also fine with deferring the work of making the
distinction between close and flush consistent with java io stream convention to MAPREDUCE-1324
(assuming that is your intention).

bq. Since the testing/validation of this patch is difficult, and you've already done the work,
I'd like to postpone this to a separate issue if that's OK.

That is fine. Would you please file a jira wrt this?

> Map-side sort is hampered by io.sort.record.percent
> ---------------------------------------------------
>                 Key: MAPREDUCE-64
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-64
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>            Reporter: Arun C Murthy
>            Assignee: Chris Douglas
>         Attachments: M64-0.patch, M64-0i.png, M64-1.patch, M64-1i.png, M64-2.patch, M64-2i.png,
M64-3.patch, M64-4.patch, M64-5.patch
> Currently io.sort.record.percent is a fairly obscure, per-job configurable, expert-level
parameter which controls how much accounting space is available for records in the map-side
sort buffer (io.sort.mb). Typically values for io.sort.mb (100) and io.sort.record.percent
(0.05) imply that we can store ~350,000 records in the buffer before necessitating a sort/combine/spill.
> However for many applications which deal with small records e.g. the world-famous wordcount
and it's family this implies we can only use 5-10% of io.sort.mb i.e. (5-10M) before we spill
inspite of having _much_ more memory available in the sort-buffer. The word-count for e.g.
results in ~12 spills (given hdfs block size of 64M). The presence of a combiner exacerbates
the problem by piling serialization/deserialization of records too...
> Sure, jobs can configure io.sort.record.percent, but it's tedious and obscure; we really
can do better by getting the framework to automagically pick it by using all available memory
(upto io.sort.mb) for either the data or accounting.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message