Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 54692 invoked from network); 21 Mar 2006 17:39:28 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 21 Mar 2006 17:39:28 -0000 Received: (qmail 63691 invoked by uid 500); 21 Mar 2006 17:39:21 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 63653 invoked by uid 500); 21 Mar 2006 17:39:21 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 63618 invoked by uid 99); 21 Mar 2006 17:39:21 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 21 Mar 2006 09:39:21 -0800 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received: from [192.87.106.226] (HELO ajax.apache.org) (192.87.106.226) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 21 Mar 2006 09:39:21 -0800 Received: from ajax (localhost.localdomain [127.0.0.1]) by ajax.apache.org (Postfix) with ESMTP id 3A18BD49FE for ; Tue, 21 Mar 2006 17:39:00 +0000 (GMT) Message-ID: <1405669744.1142962740235.JavaMail.jira@ajax> Date: Tue, 21 Mar 2006 17:39:00 +0000 (GMT) From: "Doug Cutting (JIRA)" To: hadoop-dev@lucene.apache.org Subject: [jira] Resolved: (HADOOP-93) allow minimum split size configurable In-Reply-To: <1640550924.1142625680354.JavaMail.jira@ajax> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N [ http://issues.apache.org/jira/browse/HADOOP-93?page=all ] Doug Cutting resolved HADOOP-93: -------------------------------- Resolution: Fixed Assign To: Doug Cutting Okay, I have applied this. For the record, patches are easier to apply if they are made from the root of the project. Also, new config properties should generally be added to hadoop-default.xml. Finally, the cast added in SequenceFileInputFormat was not required. > allow minimum split size configurable > ------------------------------------- > > Key: HADOOP-93 > URL: http://issues.apache.org/jira/browse/HADOOP-93 > Project: Hadoop > Type: Bug > Components: mapred > Versions: 0.1 > Reporter: Hairong Kuang > Assignee: Doug Cutting > Fix For: 0.1 > Attachments: hadoop-93.fix > > The current default split size is the size of a block (32M) and a SequenceFile sets it to be SequenceFile.SYNC_INTERVAL(2K). We currently have a Map/Reduce application working on crawled docuements. Its input data consists of 356 sequence files, each of which is of a size around 30G. A jobtracker takes forever to launch the job because it needs to generate 356*30G/2K map tasks! > The proposed solution is to let the minimum split size configurable so that the programmer can control the number of tasks to generate. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira