flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefano Bortoli <s.bort...@gmail.com>
Subject Re: PartitionByHash and usage of KeySelector
Date Thu, 06 Nov 2014 09:51:47 GMT
Hi Aljoscha,

with "creating partitions of each of the partitions" I mean that I need to
further partition the sets of objects result of the preliminary
macro-filter.

My problem is that the objects I deal with are relatively complex, and
rich. Furthermore, schema-less when it comes to 'selection of keys'.
Namely, I have a sets of Attributes, and each attribute has a name and
value. What I would like to do is to create rich KEY objects processing
these sets of Attributes according to some logic. Practically, I would have
to extract keys from each objects, and then let the partition function to
group records around these keys. Hopefully, the partitions generated by
these keys will be small enough to make the cross produce manageable, and
therefore reduce the cost of expensive matching functions implemented as
filters.

In my head, what I am trying to do is to prune as much as possible the
search space for duplicates (performing some sort of distributed blocking),
and then work expensive cross product comparison on relatively small sets.

Do you think it is feasible? Does it make sense?

Meanwhile I am creating a global index I can query in a map function, but
then Flink looses a bit of its appeal on this task. :-)

saluti,
Stefano


2014-11-06 10:18 GMT+01:00 Aljoscha Krettek <aljoscha@apache.org>:

> Hi Stefano,
> what to you mean by "creating partitions of each of the partitions"?
>
> The expressions keys can be used to specify fields of objects that
> should be used for hashing. So if you had objects of this class in a
> DataSet:
>
> class Foo {
>   public String bar;
>   public Integer baz;
>   public Float bat;
> }
>
> You could use for example:
>
> input.partitionByHash("baz", "bat")
>
> to perform the partitioning only on those two fields of the objects.
>
> Regards,
> Aljoscha
>
> On Thu, Nov 6, 2014 at 9:19 AM, Stefano Bortoli <s.bortoli@gmail.com>
> wrote:
> > Hi all,
> >
> > I am moving my first steps into becoming an Apache Flink user! I have
> > configured and run some simple jobs on a small cluster, and everything
> > worked quite fine so far.
> >
> > What I am trying to do right now is to run a duplication detection task
> on
> > dataset of about 9.5M records. The records are well structured, and
> > therefore we can exploit the semantic of attributes to narrow down
> expensive
> > match executions.
> >
> > My idea is the following:
> > 1. partition the dataset according to a macro-parameter written in the
> > record. This allows me to get to 7 partitions of different sizes but also
> > certainly disjoint. I do that by filtering on a specific type.
> > 2. create partitions of each of the partitions created in step 1 based on
> > some simple similarity that would reduce the number of expensive
> function. I
> > would like to do that by using partitionByHash and KeySelector.
> > 3. compute Cross product for each of the partitions defined in step 2;
> > 4. filter each pair of the cross product by applying an expensive boolean
> > matching function. Only positive matching duplicates will be retained.
> >
> > Currently I am working on the step 2, and I have some problems
> understanding
> > how to use the partitionByHash function. The main problem is that I need
> to
> > have a 'rich key' to support partition, and I discovered the
> ExpressionKeys
> > that would allow me to define hash keys with sets of Strings I can
> collect
> > from the record. However, the partitionByHash function does not allow to
> use
> > these objects as the hash must implement comparable.
> >
> > So, here is my question: how can I partition considering hash keys of
> more
> > than one String?
> >
> > Is there a better strategy to implement a de-duplication using Flink?
> >
> >
> > thanks a lot for your support.
> >
> > kind regards,
> >
> > Stefano Bortoli, PhD
> > ENS Technical Director
> > _______________________________________________
> > OKKAMSrl - www.okkam.it
> >
> > Email: bortoli@okkam.it
> >
> > Phone nr: +39 0461 1823912
> >
> > Headquarters: Trento (Italy), Via Trener 8
> > Registered office: Trento (Italy), via Segantini 23
> >
> > Confidentially notice. This e-mail transmission may contain legally
> > privileged and/or confidential information. Please do not read it if you
> are
> > not the intended recipient(S). Any use, distribution, reproduction or
> > disclosure by any other person is strictly prohibited. If you have
> received
> > this e-mail in error, please notify the sender and destroy the original
> > transmission and its attachments without reading or saving it in any
> manner.
>

Mime
View raw message