hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Coveney <jcove...@gmail.com>
Subject Re: current line number as key?
Date Wed, 18 May 2011 19:16:19 GMT
To the best of my knowledge, the only way to do this is if you have fix
width columns.

Think about it this way: as alexandra mentioned, you only get byte
difference...if you split 1 file among 50 mappers, they have the offset, but
they have no idea that that offset means. with respect to other the other
files, as they do not know how many lines came before. Finding lines
inherently involves a full scan, unless a) the width is fixed or b) you do a
job beforehand to explicitly put the line in the document.

I would think about what you want to do, and whether or not it is possible
to avoid making it line dependent, or if you can make each row a fixed
number of bytes...

2011/5/18 Alexandra Anghelescu <axanghelescu@gmail.com>

> Hi,
>
> It is hard to pick up certain lines of a text file - globally I mean.
> Remember that the file is split according to its size (byte boundries) not
> lines.,, so, it is possible to keep track of the lines inside a split, but
> globally for the whole file, assuming it is split among map tasks... i
> don't
> think it is possible.. I am new to hadoop, but that is my take on it.
>
> Alexandra
>
> On Wed, May 18, 2011 at 2:41 PM, bnonymous <libei.twer@gmail.com> wrote:
>
> >
> > Hello,
> >
> > I'm trying to pick up certain lines of a text file. (say 1st, 110th line
> of
> > a file with 10^10 lines). I need a InputFormat which gives the Mapper
> line
> > number as the key.
> >
> > I tried to implement RecordReader, but I can't get line information from
> > InputSplit.
> >
> > Any solution to this???
> >
> > Thanks in advance!!!!!!!
> > --
> > View this message in context:
> >
> http://old.nabble.com/current-line-number-as-key--tp31649694p31649694.html
> > Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message