hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Evans <ev...@yahoo-inc.com>
Subject Re: current line number as key?
Date Wed, 18 May 2011 19:18:47 GMT
You are correct, that there is no easy and efficient way to do this.

You could create a new InputFormat that derives from FileInputFormat that makes it so the
files do not split, and then have a RecordReader that keeps track of line numbers.  But then
each file is read by only one mapper.

Alternatively you could assume that the split is going to be done deterministically and do
two passes one, where you count the number of lines in each partition, and a second that then
assigns the lines based off of the output from the first.  But that requires two map passes.

--Bobby Evans

On 5/18/11 1:53 PM, "Alexandra Anghelescu" <axanghelescu@gmail.com> wrote:


It is hard to pick up certain lines of a text file - globally I mean.
Remember that the file is split according to its size (byte boundries) not
lines.,, so, it is possible to keep track of the lines inside a split, but
globally for the whole file, assuming it is split among map tasks... i don't
think it is possible.. I am new to hadoop, but that is my take on it.


On Wed, May 18, 2011 at 2:41 PM, bnonymous <libei.twer@gmail.com> wrote:

> Hello,
> I'm trying to pick up certain lines of a text file. (say 1st, 110th line of
> a file with 10^10 lines). I need a InputFormat which gives the Mapper line
> number as the key.
> I tried to implement RecordReader, but I can't get line information from
> InputSplit.
> Any solution to this???
> Thanks in advance!!!!!!!
> --
> View this message in context:
> http://old.nabble.com/current-line-number-as-key--tp31649694p31649694.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message