hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: Get the actual line number from inputformat in the mapper
Date Thu, 28 Apr 2011 04:38:31 GMT
Hello Pei,

On Thu, Apr 28, 2011 at 6:58 AM, Pei HE <peihe0@gmail.com> wrote:
> The key, which TextInputFormat generates, is the bytes offset in the
> file. So, how can I find the global line offset in the mapper?

This is not achievable unless you have fixed byte records (in which
case you should be able to divide and find). You can try pre-building
and maintaining an index otherwise, but looking up these forms of
structure for every record may get slow.

Sometimes its also alright to process complete documents in mappers
instead of letting it split across, as a solution (your task's input
record counter could be used as line number).

Harsh J

View raw message