[ http://issues.apache.org/jira/browse/HADOOP-93?page=all ]
Owen O'Malley updated HADOOP-93:
--------------------------------
Attachment: (was: hadoop_87.fix)
> allow minimum split size configurable
> -------------------------------------
>
> Key: HADOOP-93
> URL: http://issues.apache.org/jira/browse/HADOOP-93
> Project: Hadoop
> Type: Bug
> Components: mapred
> Versions: 0.1
> Reporter: Hairong Kuang
> Fix For: 0.1
> Attachments: hadoop-93.fix
>
> The current default split size is the size of a block (32M) and a SequenceFile sets it
to be SequenceFile.SYNC_INTERVAL(2K). We currently have a Map/Reduce application working on
crawled docuements. Its input data consists of 356 sequence files, each of which is of a size
around 30G. A jobtracker takes forever to launch the job because it needs to generate 356*30G/2K
map tasks!
> The proposed solution is to let the minimum split size configurable so that the programmer
can control the number of tasks to generate.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
|