hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-93) allow minimum split size configurable
Date Fri, 17 Mar 2006 22:04:58 GMT
    [ http://issues.apache.org/jira/browse/HADOOP-93?page=comments#action_12370894 ] 

Owen O'Malley commented on HADOOP-93:

>From what I've seen, it is always 32M fragments, but that is still 300k input splits/maps,
which is a lot. We'd like to be able to drop that by an order of magnitude. (I think in this
case that the input splitter never finished, so we don't know.)

> allow minimum split size configurable
> -------------------------------------
>          Key: HADOOP-93
>          URL: http://issues.apache.org/jira/browse/HADOOP-93
>      Project: Hadoop
>         Type: Bug
>     Reporter: Hairong Kuang
>  Attachments: hadoop-93.fix
> The current default split size is the size of a block (32M) and a SequenceFile sets it
to be SequenceFile.SYNC_INTERVAL(2K). We currently have a Map/Reduce application working on
crawled docuements. Its input data consists of 356 sequence files, each of which is of a size
around 30G. A jobtracker takes forever to launch the job because it needs to generate 356*30G/2K
map tasks!
> The proposed solution is to let the minimum split size configurable so that the programmer
can control the number of tasks to generate.

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:

View raw message