hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cutting (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-93) allow minimum split size configurable
Date Fri, 17 Mar 2006 21:17:00 GMT
    [ http://issues.apache.org/jira/browse/HADOOP-93?page=comments#action_12370887 ] 

Doug Cutting commented on HADOOP-93:

With such big input files the default logic should split things into dfs block-sized splits.
 Smaller splits should only be used if this would result in fewer than mapred.map.tasks splits.
 What value do you have for mapred.map.tasks in your mapred-default.xml?  Let's make sure
that is working before we add a new min.split.size feature.  I don't oppose the feature, but
it should be generating 356*30G/32M splits, not 356*30G/2K splits as you claim.  That's still
a lot of splits.  If it is too many then we should add the feature you're adding.

Note that, as a workaround, it is also easy to implement this w/o patching by defining an
InputFormat that subclasses InputFormatBase and specifies a different minSplitSize.  But making
that a long is a good idea.

So, in summary, can you please confirm that the actual number of splits that you object to
is 356*30G/32M splits, not 356*30G/2K?  Thanks.

> allow minimum split size configurable
> -------------------------------------
>          Key: HADOOP-93
>          URL: http://issues.apache.org/jira/browse/HADOOP-93
>      Project: Hadoop
>         Type: Bug
>     Reporter: Hairong Kuang
>  Attachments: hadoop-93.fix
> The current default split size is the size of a block (32M) and a SequenceFile sets it
to be SequenceFile.SYNC_INTERVAL(2K). We currently have a Map/Reduce application working on
crawled docuements. Its input data consists of 356 sequence files, each of which is of a size
around 30G. A jobtracker takes forever to launch the job because it needs to generate 356*30G/2K
map tasks!
> The proposed solution is to let the minimum split size configurable so that the programmer
can control the number of tasks to generate.

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:

View raw message