Return-Path: Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: (qmail 81573 invoked from network); 26 Nov 2009 20:24:27 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 26 Nov 2009 20:24:27 -0000 Received: (qmail 3262 invoked by uid 500); 26 Nov 2009 20:24:25 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 3177 invoked by uid 500); 26 Nov 2009 20:24:25 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 3166 invoked by uid 99); 26 Nov 2009 20:24:25 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 26 Nov 2009 20:24:25 +0000 X-ASF-Spam-Status: No, hits=-2.6 required=5.0 tests=AWL,BAYES_00,HTML_MESSAGE X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of cubicdesign@gmail.com designates 209.85.219.210 as permitted sender) Received: from [209.85.219.210] (HELO mail-ew0-f210.google.com) (209.85.219.210) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 26 Nov 2009 20:24:21 +0000 Received: by ewy2 with SMTP id 2so1108504ewy.12 for ; Thu, 26 Nov 2009 12:24:00 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from :user-agent:mime-version:to:subject:references:in-reply-to :content-type; bh=0BSxwDD1YAE6WpSvSvgua5mIyyxCD1PlEjU6DXf72ug=; b=tZHNhnXc3S7/Hw5eC35rBq2BEhDSgWiW31BhA1vWloTDJGmIDp6wogStHdxZ3I8+q3 045WXEWAY0J8rxZVMpO699ywZdeJAGm+DPWZDxH1dfWmpozVlKrEZruxxUBuw1uhfWe2 uRyao6gmYDt9rCethZOQDQ7HYCl4UQHOLtGj8= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type; b=e6M+JHzeCHdR/XgmZpg+4J5bGwL07cNfcxINF/j57Ituw009X8yE1qsjqPP7Is3WVr BXW64iSQQdoNSQSt8N54YAE0QuWf/kjKwtFC15Yu/fs9irgB5+pNt5gsxlX1fTwlMwoP EjCtu41ngNDNqTxFaLpXYnRHiL9NYpkqYsQE8= Received: by 10.213.24.15 with SMTP id t15mr185022ebb.42.1259267040429; Thu, 26 Nov 2009 12:24:00 -0800 (PST) Received: from ?192.168.220.104? (host-091-096-212-170.ewe-ip-backbone.de [91.96.212.170]) by mx.google.com with ESMTPS id 15sm526533ewy.12.2009.11.26.12.23.58 (version=TLSv1/SSLv3 cipher=RC4-MD5); Thu, 26 Nov 2009 12:23:59 -0800 (PST) Message-ID: <4B0EE3DE.2060001@Gmail.com> Date: Thu, 26 Nov 2009 21:23:58 +0100 From: CubicDesign User-Agent: Thunderbird 2.0.0.23 (Windows/20090812) MIME-Version: 1.0 To: common-user@hadoop.apache.org Subject: Re: Processing 10MB files in Hadoop References: <9b0a5b990911260542q16fab58ag4bd6f875d763d0c4@mail.gmail.com> <8211a1320911260557v7b895b05x8e776a3dfb0197c4@mail.gmail.com> <4B0EA11E.8090905@Gmail.com> <8211a1320911260741i68cce53dib0d04a8bf802a835@mail.gmail.com> <4B0EA5C0.9000908@Gmail.com> <8211a1320911260805h747a3f06j33946a4a40d7c04@mail.gmail.com> <314098690911261014m2c77d395oc1eb243a42d7e611@mail.gmail.com> In-Reply-To: <314098690911261014m2c77d395oc1eb243a42d7e611@mail.gmail.com> Content-Type: multipart/alternative; boundary="------------050602020407080906050109" --------------050602020407080906050109 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit > Are the record processing steps bound by a local machine resource - cpu, > disk io or other? > Some disk I/O. Not so much compared with the CPU. Basically it is a CPU bound. This is why each machine has 16 cores. > What I often do when I have lots of small files to handle is use the > NlineInputFormat, Each file contains a complete/independent set of records. I cannot mix the data resulted from processing two different files. --------- Ok. I think I need to re-explain my problem :) While running jobs on these small files, the computation time was almost 5 times longer than expected. It looks like the job was affected by the number of map task that I have (100). I don't know which are the best parameters in my case (10MB files). I have zero reduce tasks. --------------050602020407080906050109--