Return-Path: Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: (qmail 43169 invoked from network); 30 Nov 2010 08:22:22 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 30 Nov 2010 08:22:22 -0000 Received: (qmail 57546 invoked by uid 500); 30 Nov 2010 08:22:19 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 57324 invoked by uid 500); 30 Nov 2010 08:22:19 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 57316 invoked by uid 99); 30 Nov 2010 08:22:18 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 30 Nov 2010 08:22:18 +0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of qwertymaniac@gmail.com designates 209.85.161.48 as permitted sender) Received: from [209.85.161.48] (HELO mail-fx0-f48.google.com) (209.85.161.48) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 30 Nov 2010 08:22:12 +0000 Received: by fxm2 with SMTP id 2so4284189fxm.35 for ; Tue, 30 Nov 2010 00:21:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:mime-version:received:in-reply-to :references:from:date:message-id:subject:to:content-type; bh=5MHqEyI7kBcB1FhXufCARzB65T9FSjzBKHwIfmGcjxo=; b=xlzqW8O2GBYzvwhfDau0bNRBYOj5VYFfOe4BHrnwJIMz9jg+pSp2C2tYGkAnpMOZXO UDSs3rgJ0AsRXImzw/JPuZi6croiKQSVPD+XuTUXyWsIZplhtO617oMFqxdybJsU6sIi vix7EK7I9OLQ2Z2ayvE9nFH77u9odJQIKjtdg= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; b=miAhAwMafr/qSTYQptkWI2mbyxWXTzMCccL7Z1f+mQi7EXuIKPsPaz3BJPBv7FuBHe W2CTeCf7qcdo+Lrs0PLGE673XGHNM5c56pAde+2Fq5LArJ2+Nn+LPWNP3ODs6lgvDrCP Mb0WNS7ukUwEkcori8f6cOhtZKIPJyA5JTTjU= Received: by 10.223.96.198 with SMTP id i6mr6439518fan.10.1291105311755; Tue, 30 Nov 2010 00:21:51 -0800 (PST) MIME-Version: 1.0 Received: by 10.223.113.145 with HTTP; Tue, 30 Nov 2010 00:21:31 -0800 (PST) In-Reply-To: <1291073176029-1989598.post@n3.nabble.com> References: <1291073176029-1989598.post@n3.nabble.com> From: Harsh J Date: Tue, 30 Nov 2010 13:51:31 +0530 Message-ID: Subject: Re: small files and number of mappers To: common-user@hadoop.apache.org Content-Type: text/plain; charset=ISO-8859-1 X-Virus-Checked: Checked by ClamAV on apache.org Hey, On Tue, Nov 30, 2010 at 4:56 AM, Marc Sturlese wrote: > > Hey there, > I am doing some tests and wandering which are the best practices to deal > with very small files which are continuously being generated(1Mb or even > less). Have a read: http://www.cloudera.com/blog/2009/02/the-small-files-problem/ > > I see that if I have hundreds of small files in hdfs, hadoop automatically > will create A LOT of map tasks to consume them. Each map task will take 10 > seconds or less... I don't know if it's possible to change the number of map > tasks from java code using the new API (I know it can be done with the old > one). I would like to do something like NumMapTasksCalculatedByHadoop * 0.3. > This way, less maps tasks would be instanciated and each would be working > more time. Perhaps you need to use MultiFileInputFormat: http://www.cloudera.com/blog/2009/02/the-small-files-problem/ -- Harsh J www.harshj.com