Return-Path: Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: (qmail 64348 invoked from network); 8 Feb 2011 05:20:58 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 8 Feb 2011 05:20:58 -0000 Received: (qmail 70087 invoked by uid 500); 8 Feb 2011 05:20:55 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 69643 invoked by uid 500); 8 Feb 2011 05:20:51 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 69626 invoked by uid 99); 8 Feb 2011 05:20:49 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Feb 2011 05:20:49 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of maha@umail.ucsb.edu designates 128.111.151.62 as permitted sender) Received: from [128.111.151.62] (HELO outgoing-2.umail.ucsb.edu) (128.111.151.62) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Feb 2011 05:20:41 +0000 Received: from resnet-32-224.resnet.ucsb.edu ([169.231.32.224] helo=[192.168.1.108]) by outgoing-2.umail.ucsb.edu with esmtpsa (TLSv1:AES128-SHA:128) (Exim 4.72) (envelope-from ) id 1Pmfzv-0003sG-78 for common-user@hadoop.apache.org; Mon, 07 Feb 2011 21:20:19 -0800 From: maha Mime-Version: 1.0 (Apple Message framework v1082) Content-Type: multipart/alternative; boundary=Apple-Mail-6-1007637539 Subject: Re: Quick Question: LineSplit or BlockSplit Date: Mon, 7 Feb 2011 21:20:17 -0800 In-Reply-To: To: common-user@hadoop.apache.org References: Message-Id: X-Mailer: Apple Mail (2.1082) X-Virus-Scanned: (umail.ucsb.edu) Clam AV found no viruses in this message X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail-6-1007637539 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii Thanks Ted. Then I have to write my own InputFormat to read a = block-of-lines per mapper. =20 NLineInputFormat didn't work with me, any working example about it is = appreciate it. Thanks again, Maha On Feb 7, 2011, at 6:32 PM, Mark Kerzner wrote: > Thanks! > Mark >=20 > On Mon, Feb 7, 2011 at 8:28 PM, Ted Dunning = wrote: >=20 >> That is quite doable. One way to do it is to make the max split size = quite >> small. >>=20 >> On Mon, Feb 7, 2011 at 6:14 PM, Mark Kerzner >> wrote: >>=20 >>> Ted, >>>=20 >>> I am also interested in this answer. >>>=20 >>> I put the name of a zip file on a line in an input file, and I want = one >>> mapper to read this line, and start working on it (since it now = knows the >>> path in HDFS). Are you saying it's not doable? >>>=20 >>> Thank you, >>> Mark >>>=20 >>> On Mon, Feb 7, 2011 at 8:10 PM, Ted Dunning >> wrote: >>>=20 >>>> Option (1) isn't the way that things normally work. Besides, = mappers >> are >>>> called many times for each construction of a mapper. >>>>=20 >>>> On Mon, Feb 7, 2011 at 3:38 PM, maha wrote: >>>>=20 >>>>> Hi, >>>>>=20 >>>>> I would appreciate it if you could give me your thoughts if there = is >>>>> affect on efficiency if: >>>>>=20 >>>>> 1) Mappers were per line in a document >>>>>=20 >>>>> or >>>>>=20 >>>>> 2) Mappers were per block of lines in a document. >>>>>=20 >>>>>=20 >>>>> I know the obvious difference I can see is that (1) has more >> mappers. >>>> Does >>>>> that mean (1) will be slower because of scheduling time ? >>>>>=20 >>>>> Thank you, >>>>> Maha >>>>>=20 >>>>=20 >>>=20 >>=20 --Apple-Mail-6-1007637539--