hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Milind Bhandarkar (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-229) hadoop cp should generate a better number of map tasks
Date Thu, 18 May 2006 20:58:07 GMT
    [ http://issues.apache.org/jira/browse/HADOOP-229?page=comments#action_12412420 ] 

Milind Bhandarkar commented on HADOOP-229:

Number of maps is now computed as follows:

numMaps = max(1, min(numFiles,  numNodes*10, totalBytes/256MB, 10000)).

Plus added reporting status for every file (or every 32MB - approx 10 seconds) so that tasks
dont timeout while copying huge files.

This fix is part of the patch attached to hadoop-220.

> hadoop cp should generate a better number of map tasks
> ------------------------------------------------------
>          Key: HADOOP-229
>          URL: http://issues.apache.org/jira/browse/HADOOP-229
>      Project: Hadoop
>         Type: Bug

>   Components: fs
>     Reporter: Yoram Arnon
>     Assignee: Milind Bhandarkar
>     Priority: Minor

> hadoop cp currently assigns 10 files to copy per map task.
> in case of a small number of large files on a large cluster (say 300 files of 30GB each
on a 300 node cluster), this results in long execution times.
> better would be to assign files per task such that the entire cluster is utilized: one
file per map, with a cap of 10000 maps total, so as not to over burden the job tracker.

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:

View raw message