Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-user@hadoop.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <AANLkTik97_-vB7X50KbLU6_Ni-Tf7WqWmQ1ix+kd+PrA@mail.gmail.com>
References: <AANLkTik97_-vB7X50KbLU6_Ni-Tf7WqWmQ1ix+kd+PrA@mail.gmail.com>
From: Ted Dunning <tdunning@maprtech.com>
Date: Fri, 17 Dec 2010 09:30:27 -0800
Message-ID: <AANLkTimn2P2Eq6FbDk+5PB5f5hJe_QmNr3f0yG_g+j8p@mail.gmail.com>
Subject: Re: InputFormat for a big file
To: common-user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=0015175cb8d2f0984804979e8712

--0015175cb8d2f0984804979e8712
Content-Type: text/plain; charset=ISO-8859-1

a) this is a small file by hadoop standards.  You should be able to process
it by conventional methods on a single machine in about the same time it
takes to start a hadoop job that does nothing at all.

b) reading a single line at a time is not as inefficient as you might think.
 If you write a mapper that reads each line, converts to an integer and
outputs a key consisting of a constant integer and the data you read, the
mapper will process the data reasonably quickly.  If you add a combiner and
a reducer that add up numbers in a list, then the amount of data spilled
will be nearly zero.


On Fri, Dec 17, 2010 at 7:58 AM, madhu phatak <phatak.dev@gmail.com> wrote:

> Hi
> I have a very large file of size 1.4 GB. Each line of the file is a number
> .
> I want to find the sum all those numbers.
> I wanted to use NLineInputFormat as a InputFormat but it sends only one
> line
> to the Mapper which is very in efficient.
> So can you guide me to write a InputFormat which splits the file
> into multiple Splits and each mapper can read multiple
> line from each split
>
> Regards
> Madhukar
>

--0015175cb8d2f0984804979e8712--