flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chesnay Schepler <ches...@apache.org>
Subject Re: Sequential/ordered map
Date Thu, 05 Jan 2017 11:39:09 GMT
So given an ordered list of texts, for each word find the earliest text 
it appears in?

As Kostas said, when splitting the text into words wrap them in a Tuple2 
containing the word
and text index and group them by the word.

As far as i can tell the next step would be a simple reduce that finds 
the smallest
index; for this there is a convenience minBy() transformation.

Regards,
Chesnay

On 05.01.2017 12:25, Kostas Kloudas wrote:
> Hi Sebastian,
>
> If T_1 must be processed before T_i, i>1, then you cannot parallelize the algorithm.
>
> If this is not a restriction, then you could;
> 1) split the text in words and also attach the id of the text they appear in,
> 2) do a groupBy that will send all the same words to the same node,
> 3) keep a “per-word” state with the word and the index of the text,
> 4) when a new word arrives you should check if the word already exists in the state.
>
> Regards,
> Kostas
>
>> On Jan 5, 2017, at 11:51 AM, Sebastian Neef <gehaxelt@mailbox.tu-berlin.de>
wrote:
>>
>> Hello,
>>
>> I'd like to implement an algorithm which doesn't really look
>> parallelizable to me, but maybe there's a way around it:
>>
>> In general the algorithm looks like this:
>>
>> 1. Take a list of texts T_1 ... T_n
>> 2. For every text T_i (i > 1) do
>> 2.1: Split text into a list of words W_1 ... W_m
>> 2.2: For every word W_j do:
>> 2.2.1.: Check if word already existed in a prior text T_k ( i > k )
>> 2.2.2.: If so, mark word W_j with k
>> 2.2.3.: Else mark word W_j with i
>> 3. Do something with texts based on marked words...
>>
>> I have a DataSet<Text> with all texts T_1...T_n.
>>
>> As far as I understand, I cannot simply .map(...) the DataSet, because
>> T_i's calculation is based on previous results (i.e. T_(i-1)).
>>
>> My current solution would be to set the  parallelism to 1.
>>
>> - Is there an elegant way to parallelize this algorithm?
>> - Does setting parallelism=1 guarantee a specific order of the DataSet?
>> - Is there a way to check if an element exists in a DataSet? E.g.
>> DataSet<>.contains(elem)?
>>
>> Best regards,
>> Sebastian
>


Mime
View raw message