Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5DDAE9C83 for ; Wed, 21 Mar 2012 17:13:33 +0000 (UTC) Received: (qmail 56286 invoked by uid 500); 21 Mar 2012 17:13:29 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 56215 invoked by uid 500); 21 Mar 2012 17:13:29 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 56207 invoked by uid 99); 21 Mar 2012 17:13:29 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Mar 2012 17:13:29 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of weishung@gmail.com designates 209.85.210.48 as permitted sender) Received: from [209.85.210.48] (HELO mail-pz0-f48.google.com) (209.85.210.48) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Mar 2012 17:13:20 +0000 Received: by dadp13 with SMTP id p13so2278744dad.35 for ; Wed, 21 Mar 2012 10:12:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=subject:references:from:content-type:x-mailer:in-reply-to :message-id:date:to:content-transfer-encoding:mime-version; bh=xHLOC98zHoYv/hS0tDoQ3nR9bIwREfTe+NsSfZSF+9s=; b=QQ3qcMZf+Id9uPaAkWs/D8FWaJJincwgK2Pm5eEBq6gzgcUIN1E9ypY95oQI0SOYI+ CHIsPaOI2o+1Ms4r77ZyBYhEvRchNQaHpoQObCJeIEAobGYfUp4zyq54nVj/QN8L4mCe jr61dA94cQ+L3ZDhaz9wc2PbfbyawU0OrHfL5CqddGy4Abb7Pl8XxxDjyyvdkYf3rW44 vBiRU6mzhCxvfkTcr1Ny/K9AT3yFEOAJHykTGxLTX/be0gmB0vE+nPRArfMqmkWKJyx8 qsETtKYZb7w8Sa1hoGOwYoIgTuWozVJYDMDmSArvZseGPhvM8Z5lM70ZgVi3MktviU1i q6IA== Received: by 10.68.227.73 with SMTP id ry9mr13280782pbc.33.1332349978860; Wed, 21 Mar 2012 10:12:58 -0700 (PDT) Received: from [10.81.125.245] (mobile-166-205-139-136.mycingular.net. [166.205.139.136]) by mx.google.com with ESMTPS id ko12sm1771053pbb.52.2012.03.21.10.12.57 (version=TLSv1/SSLv3 cipher=OTHER); Wed, 21 Mar 2012 10:12:58 -0700 (PDT) Subject: Re: how can i increase the number of mappers? References: From: Wei Shung Chung Content-Type: text/plain; charset=us-ascii X-Mailer: iPhone Mail (8C148) In-Reply-To: Message-Id: Date: Wed, 21 Mar 2012 10:12:49 -0700 To: "common-user@hadoop.apache.org" Content-Transfer-Encoding: quoted-printable Mime-Version: 1.0 (iPhone Mail 8C148) X-Virus-Checked: Checked by ClamAV on apache.org Great info :) Sent from my iPhone On Mar 21, 2012, at 9:10 AM, Jane Wayne wrote: > if anyone is facing the same problem, here's what i did. i took anil's > advice to use NLineInputFormat (because that approach would scale out my > mappers). >=20 > however, i am using the new mapreduce package/API in hadoop v0.20.2. i > notice that you cannot use NLineInputFormat from the old package/API > (mapred). >=20 > when i took a look at hadoop v1.0.1, there is a NLineInputFormat class for= > the new API. i simply copied and pasted this file into my project. i got 4= > errors associated with import statements and annotations. when i removed > the 2 import statements and corresponding 2 annotations, the class compile= d > successfully. after this modification, running NLineInputFormat of v1.0.1 > on a cluster based on v0.20.2, works. >=20 > one mini-problem solved, many more to go. >=20 > thanks for the help. >=20 > On Wed, Mar 21, 2012 at 3:33 AM, Jane Wayne wrot= e: >=20 >> as i understand, that class does not exist for new API in hadoop v0.20.2 >> (which is what i am using). if i am mistaken, where is it? >>=20 >> i am looking at hadoop v1.0.1, and there is a NLineInputFormat class. i >> wonder if i can simply copy/paste this into my project. >>=20 >>=20 >> On Wed, Mar 21, 2012 at 2:37 AM, Anil Gupta wrote= : >>=20 >>> Have a look at NLineInputFormat class in Hadoop. That class will solve >>> your purpose. >>>=20 >>> Best Regards, >>> Anil >>>=20 >>> On Mar 20, 2012, at 11:07 PM, Jane Wayne >>> wrote: >>>=20 >>>> i have a matrix that i am performing operations on. it is 10,000 rows b= y >>>> 5,000 columns. the total size of the file is just under 30 MB. my HDFS >>>> block size is set to 64 MB. from what i understand, the number of >>> mappers >>>> is roughly equal to the number of HDFS blocks used in the input. i.e. >>> if my >>>> input data spans 1 block, then only 1 mapper is created, if my data >>> spans 2 >>>> blocks, then 2 mappers will be created, etc... >>>>=20 >>>> so, with my 1 matrix file of 15 MB, this won't fill up a block of data,= >>> and >>>> being as such, only 1 mapper will be called upon the data. is this >>>> understanding correct? >>>>=20 >>>> if so, what i want to happen is for more than one mapper (let's say 10)= >>> to >>>> work on the data, even though it remains on 1 block. my analysis (or >>>> map/reduce job) is such that +1 mappers can work on different parts of >>> the >>>> matrix. for example, mapper 1 can work on the first 500 rows, mapper 2 >>> can >>>> work on the next 500 rows, etc... how can i set up multiple mappers (+1= >>>> mapper) to work on a file that resides only one block (or a file whose >>> size >>>> is smaller than the HDFS block size). >>>>=20 >>>> can i split the matrix into (let's say) 10 files? that will mean 30 MB >>> / 10 >>>> =3D 3 MB per file. then put each 3 MB file onto HDFS ? will this increa= se >>> the >>>> chance of having multiple mappers work simultaneously on the >>> data/matrix? >>>> if i can increase the number of mappers, i think (pretty sure) my >>>> implementation will improve in speed linearly. >>>>=20 >>>> any help is appreciated. >>>=20 >>=20 >>=20