Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-user@hadoop.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <BEFFF5BE-A0DB-4AF8-BC39-FE844AA6CEC2@umail.ucsb.edu>
References: <BEFFF5BE-A0DB-4AF8-BC39-FE844AA6CEC2@umail.ucsb.edu>
From: Ted Dunning <tdunning@maprtech.com>
Date: Fri, 18 Feb 2011 11:25:27 -0800
Message-ID: <AANLkTi=K+K3+bsGvQw-w-HwvEqts_Es3A2kY-ThZZvGy@mail.gmail.com>
Subject: Re: Quick question
To: common-user@hadoop.apache.org
Cc: maha <maha@umail.ucsb.edu>
Content-Type: multipart/alternative; boundary=90e6ba6e8f4a266d59049c937b66

--90e6ba6e8f4a266d59049c937b66
Content-Type: text/plain; charset=ISO-8859-1

The input is effectively split by lines, but under the covers, the actual
splits are by byte.  Each mapper will cleverly scan from the specified start
to the next line after the start point.  At then end, it will over-read to
the end of line that is at or after the end of its specified region.  This
can make the last split be a bit smaller than the others and the first be a
bit larger.

Practically speaking, however, your 2000 line file is extremely unlikely to
be split at all because it is sooo small.

On Fri, Feb 18, 2011 at 11:14 AM, maha <maha@umail.ucsb.edu> wrote:

> Hi all,
>
>  I want to check if the following statement is right:
>
>  If I use TextInputFormat to process a text file with 2000 lines (each
> ending with \n) with 20 mappers. Then each map will have a sequence of
> COMPLETE LINES .
>
> In other words,  the input is not split byte-wise but by lines.
>
> Is that right?
>
>
> Thank you,
> Maha

--90e6ba6e8f4a266d59049c937b66--