flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fridtjof Sander <fsan...@mailbox.tu-berlin.de>
Subject Re: Possibility to get the line numbers?
Date Thu, 04 Feb 2016 09:25:13 GMT
I had that problem/question some time ago, too.

The quick fix is to just put the line number in the line itself. Go for it.

However, we worked out a solution for another distributed processing 
system, that did the following:
Read each partition, count the lines, broadcast a map 
"partition->lineCount", re-read the data and attach the line-numbers.
This is basically how distributed zipWithIndex works, that is available 
in Flink too.

But:

That only works if the data by both mapPartitions is read in the same 
order and if the partitions used by both are in the same boundaries.
I don't now if you can get that guarantee in Flink without a 
range-partition and sortPartition on the byte offset.
Doing that would work (I think), but it would add significant overhead, 
that can be completely avoided by adding the line-numbers into the lines 
in the first place.
I think it's just not worth it.

Am 4. Februar 2016 00:56:43 MEZ, schrieb Fabian Hueske <fhueske@gmail.com>:

    Hi Anastasiia,

    this is difficult because the input is usually read in parallel,
    i.e., an input file is split into several blogs which are
    independently read and processed by different threads (possibly on
    different machines). So it is difficult to have a sequential row
    number.

    If all rows have the same length (number of bytes), you could
    compute the row number from the byte offset. If this is not given,
    you can only read the input sequentially.
    Flink does not provide InputFormats for this. So you would need to
    implement a custom InputFormat.

    You can also keep track of the number of elements that you processed
    in a Mapper, but this is probably not what you are looking for.

    Best,
    Fabian

    2016-02-04 0:37 GMT+01:00 Анастасія Баша <nastja.basha@mail.ru
    <mailto:nastja.basha@mail.ru>>:

        Is there a way to get the current line number (or generally the
        number of element currently being processed) inside a mapper?
        The example is a matrix you read line-line by line from the file
        and need both the row and the column numbers. Column number is
        easy to get, but how to know the row number?
        Thanks a lot in advance,
        Anastasiia



Mime
View raw message