hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Soren Flexner <sflex...@gmail.com>
Subject Re: Get the actual line number from inputformat in the mapper
Date Thu, 28 Apr 2011 05:59:33 GMT
  This is a bit out of left field, but you could add a 'key' field at the
beginning of each record (which you would arrange to be the record
"number"), and then use the keyValue input format.  Now your keys are the
record number.

  This might be prohibitive if your data is already on HDFS, and you have a
lot of it, since adding the counter key and copying the new dataset to HDFS
might be a significant time investment in itself.


On Wed, Apr 27, 2011 at 9:38 PM, Harsh J <harsh@cloudera.com> wrote:

> Hello Pei,
> On Thu, Apr 28, 2011 at 6:58 AM, Pei HE <peihe0@gmail.com> wrote:
> > The key, which TextInputFormat generates, is the bytes offset in the
> > file. So, how can I find the global line offset in the mapper?
> This is not achievable unless you have fixed byte records (in which
> case you should be able to divide and find). You can try pre-building
> and maintaining an index otherwise, but looking up these forms of
> structure for every record may get slow.
> Sometimes its also alright to process complete documents in mappers
> instead of letting it split across, as a solution (your task's input
> record counter could be used as line number).
> --
> Harsh J

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message