Return-Path: Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: (qmail 67629 invoked from network); 18 Feb 2011 19:26:20 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 18 Feb 2011 19:26:20 -0000 Received: (qmail 51307 invoked by uid 500); 18 Feb 2011 19:26:18 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 51010 invoked by uid 500); 18 Feb 2011 19:26:15 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 50944 invoked by uid 99); 18 Feb 2011 19:26:15 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Feb 2011 19:26:15 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [209.85.214.176] (HELO mail-iw0-f176.google.com) (209.85.214.176) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Feb 2011 19:26:08 +0000 Received: by iwn2 with SMTP id 2so4087600iwn.35 for ; Fri, 18 Feb 2011 11:25:48 -0800 (PST) Received: by 10.42.178.197 with SMTP id bn5mr1337621icb.298.1298057147083; Fri, 18 Feb 2011 11:25:47 -0800 (PST) MIME-Version: 1.0 Received: by 10.42.213.129 with HTTP; Fri, 18 Feb 2011 11:25:27 -0800 (PST) X-Originating-IP: [64.105.168.204] In-Reply-To: References: From: Ted Dunning Date: Fri, 18 Feb 2011 11:25:27 -0800 Message-ID: Subject: Re: Quick question To: common-user@hadoop.apache.org Cc: maha Content-Type: multipart/alternative; boundary=90e6ba6e8f4a266d59049c937b66 --90e6ba6e8f4a266d59049c937b66 Content-Type: text/plain; charset=ISO-8859-1 The input is effectively split by lines, but under the covers, the actual splits are by byte. Each mapper will cleverly scan from the specified start to the next line after the start point. At then end, it will over-read to the end of line that is at or after the end of its specified region. This can make the last split be a bit smaller than the others and the first be a bit larger. Practically speaking, however, your 2000 line file is extremely unlikely to be split at all because it is sooo small. On Fri, Feb 18, 2011 at 11:14 AM, maha wrote: > Hi all, > > I want to check if the following statement is right: > > If I use TextInputFormat to process a text file with 2000 lines (each > ending with \n) with 20 mappers. Then each map will have a sequence of > COMPLETE LINES . > > In other words, the input is not split byte-wise but by lines. > > Is that right? > > > Thank you, > Maha --90e6ba6e8f4a266d59049c937b66--