hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark question <markq2...@gmail.com>
Subject Re: current line number as key?
Date Sun, 22 May 2011 01:58:14 GMT
What if you run a MapReduce program to generate a Sequence File from your
text file where key is the line number and value is the whole line, then for
the second job, the splits are done record wise hence, each mapper will be
getting a split/block of records [<lineNumber><line>] ~Cheers,
Mark

On Wed, May 18, 2011 at 12:18 PM, Robert Evans <evans@yahoo-inc.com> wrote:

> You are correct, that there is no easy and efficient way to do this.
>
> You could create a new InputFormat that derives from FileInputFormat that
> makes it so the files do not split, and then have a RecordReader that keeps
> track of line numbers.  But then each file is read by only one mapper.
>
> Alternatively you could assume that the split is going to be done
> deterministically and do two passes one, where you count the number of lines
> in each partition, and a second that then assigns the lines based off of the
> output from the first.  But that requires two map passes.
>
> --Bobby Evans
>
>
> On 5/18/11 1:53 PM, "Alexandra Anghelescu" <axanghelescu@gmail.com> wrote:
>
> Hi,
>
> It is hard to pick up certain lines of a text file - globally I mean.
> Remember that the file is split according to its size (byte boundries) not
> lines.,, so, it is possible to keep track of the lines inside a split, but
> globally for the whole file, assuming it is split among map tasks... i
> don't
> think it is possible.. I am new to hadoop, but that is my take on it.
>
> Alexandra
>
> On Wed, May 18, 2011 at 2:41 PM, bnonymous <libei.twer@gmail.com> wrote:
>
> >
> > Hello,
> >
> > I'm trying to pick up certain lines of a text file. (say 1st, 110th line
> of
> > a file with 10^10 lines). I need a InputFormat which gives the Mapper
> line
> > number as the key.
> >
> > I tried to implement RecordReader, but I can't get line information from
> > InputSplit.
> >
> > Any solution to this???
> >
> > Thanks in advance!!!!!!!
> > --
> > View this message in context:
> >
> http://old.nabble.com/current-line-number-as-key--tp31649694p31649694.html
> > Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message