hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <tdunn...@maprtech.com>
Subject Re: InputFormat for a big file
Date Fri, 17 Dec 2010 17:30:27 GMT
a) this is a small file by hadoop standards.  You should be able to process
it by conventional methods on a single machine in about the same time it
takes to start a hadoop job that does nothing at all.

b) reading a single line at a time is not as inefficient as you might think.
 If you write a mapper that reads each line, converts to an integer and
outputs a key consisting of a constant integer and the data you read, the
mapper will process the data reasonably quickly.  If you add a combiner and
a reducer that add up numbers in a list, then the amount of data spilled
will be nearly zero.

On Fri, Dec 17, 2010 at 7:58 AM, madhu phatak <phatak.dev@gmail.com> wrote:

> Hi
> I have a very large file of size 1.4 GB. Each line of the file is a number
> .
> I want to find the sum all those numbers.
> I wanted to use NLineInputFormat as a InputFormat but it sends only one
> line
> to the Mapper which is very in efficient.
> So can you guide me to write a InputFormat which splits the file
> into multiple Splits and each mapper can read multiple
> line from each split
> Regards
> Madhukar

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message