Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mapreduce-issues@hadoop.apache.org
Date: Wed, 21 Mar 2012 06:53:44 +0000 (UTC)
From: "anty.rao (Commented) (JIRA)" <jira@apache.org>
To: mapreduce-issues@hadoop.apache.org
Message-ID: 
 <1894668771.40694.1332312824371.JavaMail.tomcat@hel.zones.apache.org>
In-Reply-To: 
 <509801353.35624.1332230385964.JavaMail.tomcat@hel.zones.apache.org>
Subject: [jira] [Commented] (MAPREDUCE-4039) Sort Avoidance
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


    [ https://issues.apache.org/jira/browse/MAPREDUCE-4039?page=3Dcom.atlas=
sian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D=
13234174#comment-13234174 ]=20

anty.rao commented on MAPREDUCE-4039:
-------------------------------------

I am a little confused about the implementation of Reader of IFile.
In previous hadoop version, IFile reader will read in a bunch of key/value =
pairs from the disk one time, then serve it directly from in memory.I think=
 this strategy is common and good.However, in yarn for each requested key/v=
alue pairs reader will go hit the disk(though pre-read will do some help). =
Am i miss something=EF=BC=9FCan someone shed light on me?
               =20
> Sort Avoidance
> --------------
>
>                 Key: MAPREDUCE-4039
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4039
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: mrv2
>    Affects Versions: 0.23.2
>            Reporter: anty.rao
>            Priority: Minor
>             Fix For: 0.23.2
>
>
> Inspired by [Tenzing|http://static.googleusercontent.com/external_content=
/untrusted_dlcp/research.google.com/en//pubs/archive/37200.pdf], in 5.1 Map=
Reduce Enhanceemtns:
> {quote}*Sort Avoidance*. Certain operators such as hash join
> and hash aggregation require shuffling, but not sorting. The
> MapReduce API was enhanced to automatically turn off
> sorting for these operations. When sorting is turned off, the
> mapper feeds data to the reducer which directly passes the
> data to the Reduce() function bypassing the intermediate
> sorting step. This makes many SQL operators significantly
> more ecient.{quote}
> There are a lot of applications which need aggregation only, not sorting.=
Using sorting to achieve aggregation is costly and inefficient. Without sor=
ting, up application can make use of hash table or hash map to do aggregati=
on efficiently.But application should bear in mind that reduce memory is li=
mited, itself is committed to manage memory of reduce, guard against out of=
 memory. Map-side combiner is not supported, you can also do hash aggregati=
on in map side  as a workaround.
> the following is the main points of sort avoidance implementation
> # add a configuration parameter ??mapreduce.sort.avoidance??, boolean typ=
e, to turn on/off sort avoidance workflow.Two type of workflow are coexist =
together.
> # key/value pairs emitted by map function is sorted by partition only, us=
ing a more efficient sorting algorithm: counting sort.
> # map-side merge, use a kind of byte merge, which just concatenate bytes =
from generated spills, read in bytes, write out bytes, without overhead of =
key/value serialization/deserailization, comparison, which current version =
incurs.
> # reduce can start up as soon as there is any map output available, in co=
ntrast to sort workflow which must wait until all map outputs are fetched a=
nd merged.
> # map output in memory can be directly consumed by reduce.When reduce can=
't catch up with the speed of incoming map outputs, in-memory merge thread =
will kick in, merging in-memory map outputs onto disk.
> # sequentially read in on-disk files to feed reduce, in contrast to curre=
ntly implementation which read multiple files concurrently, result in many =
disk seek. Map output in memory take precedence over on disk files in feedi=
ng reduce function.
> I have already implement this feature based on hadoop CDH3U3 and done som=
e performance evaluation, you can reference to [https://github.com/hanborq/=
hadoop] for details. Now,I'm willing to port it into yarn. Welcome for comm=
enting.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrato=
rs: https://issues.apache.org/jira/secure/ContactAdministrators!default.jsp=
a
For more information on JIRA, see: http://www.atlassian.com/software/jira