hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tsz Wo Nicholas Sze (JIRA)" <j...@apache.org>
Subject [jira] [Assigned] (MAPREDUCE-5010) use multithreading to speed up mergeParts and try MapPartitionsCompleteEvent to schedule fetch in reduce
Date Tue, 21 Oct 2014 01:17:34 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-5010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Tsz Wo Nicholas Sze reassigned MAPREDUCE-5010:
----------------------------------------------

    Assignee:     (was: Tsz Wo Nicholas Sze)

> use multithreading to speed up mergeParts  and try MapPartitionsCompleteEvent to schedule
fetch in reduce 
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-5010
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5010
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: mrv1
>    Affects Versions: 1.0.1
>            Reporter: Li Junjun
>         Attachments: MAPREDUCE-5010.jpg
>
>
> use multithreading to speed up Merger and try MapPartitionsCompleteEvent to schedule
fetch in reduce 
> This is for muticore cpu, the performance will depend on your hardware and config.
> In maptask 
> <code>
> for (int parts = 0; parts < partitions; parts++) {
> 	//doing merger , append to final output file (file.out)
> }
> </code>
> it only use one thread !
> so,I think :We can use more Theads(conf: mapred.map.mergerthreads) to do Merger , if
you have many cores or cpus.
> Before, only a map task complete the reduce tasks will fetch the output , that means

> when map x complete , all the reduce will fetch the output concomitantly. even we use
> <code>   
>    // Randomize the map output locations to prevent 
>    // all reduce-tasks swamping the same tasktracker
>    List<String> hostList = new ArrayList<String>();
>    hostList.addAll(mapLocations.keySet());       
>    Collections.shuffle(hostList, this.random);
> </code>
> in  reduce task .
> for example ,  100 reduce wait 2 map complete ,beacase the cluster's map task capacity
is 98,but the job have 
> 100 map tasks . 
> so,I think : During the threads mergering  , for example if map has 8 partitions , and
use 3 thread  doing merger , 
> where one of the thread complete one part we can inform  the Reduce to fetch the partition
file  immediately,
> or we can wait after 3 parts complete then send the event  (conf: mapred.map.parts.inform)
to reduce the jt's stress.
> not to wait all the map task complete. by doing this, it will  prevent all reduce-tasks
swamping the same tasktracker
> more effective and  speed reduce process.
> is it  acceptable ?
> and other good ideas ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message