flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chesnay Schepler <ches...@apache.org>
Subject Re: Custom keyBy(), look for similaties
Date Wed, 08 Jun 2016 18:41:09 GMT
the idea behind key-selectors is to extract a property on which you can 
to equality comparisons.

let's get one question out of the way first:
is your scoring algorithm transitive? as in if A==B and B==C, is it a 
given that A==C? because if not, there's
just no way to group(=partition) the data, since B would belong to 2 
distinct groups.

Even if it did work, one thing you have to realize is that this wouldn't 
scale at all. For every element that
comes in you would have to compare it to all other groups you have 
created so far.

What i would propose is the following: create a key-selector that allows 
a /rough/ grouping of your data.
something like "John L" => "J L". On that group (that is hopefully 
relatively small) you can then fire up your
algorithm between all possible pairs to do whatever you wanna do.

On 07.06.2016 10:48, iñaki williams wrote:
> Thanks for your answer Ufuk.
>
> However, I have been reading about KeySelector and I don't understand 
> completely how it works with my idea.
>
> I am using an algorithm that gives me an score between some different 
> strings. My idea is: if the score is higher than 0'80 for example, 
> then those two strings will be consider the same and when I apply the 
> keyby("name") those similar string will be keyed as they have the 
> exact same name.
>
> El lunes, 6 de junio de 2016, Ufuk Celebi <uce@apache.org 
> <mailto:uce@apache.org>> escribió:
>
>     Hey Iñaki,
>
>     you can use the KeySelector as described here:
>     https://ci.apache.org/projects/flink/flink-docs-release-1.0/apis/common/index.html#specifying-keys
>
>     But you only a local view for the current element, e.g. the library
>     you use to determine the similarity has to know the similarities
>     upfront.
>
>     – Ufuk
>
>
>     On Mon, Jun 6, 2016 at 9:31 AM, iñaki williams
>     <juanramallo80@gmail.com <javascript:;>> wrote:
>     > Hi guys,
>     >
>     > I am using Flink on my project and I have a question. (I am
>     using Java)
>     >
>     > Is it possible to modify the keyby method in order to key by
>     similarities
>     > and not by the exact name?
>     >
>     > Example: I recieve 2 DataStreams, in the first one , the name of
>     the field
>     > that I want to KeyBy is "John Locke", while in the Datastream
>     the field
>     > value is "John L". Can I use some java library to find for
>     similarities
>     > between strings and if the similitude is high, then key those
>     elements
>     > together.
>


Mime
View raw message