hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cutting (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-93) allow minimum split size configurable
Date Fri, 17 Mar 2006 21:17:00 GMT
    [ http://issues.apache.org/jira/browse/HADOOP-93?page=comments#action_12370887 ] 

Doug Cutting commented on HADOOP-93:
------------------------------------

With such big input files the default logic should split things into dfs block-sized splits.
 Smaller splits should only be used if this would result in fewer than mapred.map.tasks splits.
 What value do you have for mapred.map.tasks in your mapred-default.xml?  Let's make sure
that is working before we add a new min.split.size feature.  I don't oppose the feature, but
it should be generating 356*30G/32M splits, not 356*30G/2K splits as you claim.  That's still
a lot of splits.  If it is too many then we should add the feature you're adding.

Note that, as a workaround, it is also easy to implement this w/o patching by defining an
InputFormat that subclasses InputFormatBase and specifies a different minSplitSize.  But making
that a long is a good idea.

So, in summary, can you please confirm that the actual number of splits that you object to
is 356*30G/32M splits, not 356*30G/2K?  Thanks.

> allow minimum split size configurable
> -------------------------------------
>
>          Key: HADOOP-93
>          URL: http://issues.apache.org/jira/browse/HADOOP-93
>      Project: Hadoop
>         Type: Bug
>     Reporter: Hairong Kuang
>  Attachments: hadoop-93.fix
>
> The current default split size is the size of a block (32M) and a SequenceFile sets it
to be SequenceFile.SYNC_INTERVAL(2K). We currently have a Map/Reduce application working on
crawled docuements. Its input data consists of 356 sequence files, each of which is of a size
around 30G. A jobtracker takes forever to launch the job because it needs to generate 356*30G/2K
map tasks!
> The proposed solution is to let the minimum split size configurable so that the programmer
can control the number of tasks to generate.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Mime
View raw message