flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Greg Hogan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-3910) New self-join operator
Date Wed, 25 May 2016 12:18:12 GMT

    [ https://issues.apache.org/jira/browse/FLINK-3910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15299963#comment-15299963

Greg Hogan commented on FLINK-3910:

[~fhueske] thanks for looking at this idea. The current reduce-based implementations of {{selfJoin}}
only generate pairs from the "strictly upper triangular matrix" so we're not generating {{(x,
x)}} and only generating {{(x, y)}} not {{(x, y)}} and {{(y, x)}}. If {{selfJoin}} is a new
operation then we can retain the same algorithm performance by outputting {{(x, null)}} pairs
and allowing the user to assume {{(y, x)}} when given {{(x, y)}}.

The second listed method, using a reduce, requires that types implement {{CopyableValue}}
in order to enable object reuse whereas driver has access to the serializer.

A third method for {{selfJoin}} is demonstrated in the recently committed {{JaccardIndex}}
using reduceGroup, flatMap, and reduceGroup to obviate data skew.

A {{SelfJoinFunction}} would be configured with one input type and key set rather than two
as in {{JoinFunction}}. Also, wouldn't {{SelfJoinHint}} be exclusive of {{JoinHint}}?

> New self-join operator
> ----------------------
>                 Key: FLINK-3910
>                 URL: https://issues.apache.org/jira/browse/FLINK-3910
>             Project: Flink
>          Issue Type: New Feature
>          Components: DataSet API, Java API, Scala API
>    Affects Versions: 1.1.0
>            Reporter: Greg Hogan
>            Assignee: Greg Hogan
> Flink currently provides inner- and outer-joins as well as cogroup and the non-keyed
cross. {{JoinOperator}} hints at future support for semi- and anti-joins.
> Many Gelly algorithms perform a self-join [0]. Still pending reviews, FLINK-3768 performs
a self-join on non-skewed data in TriangleListing.java and FLINK-3780 performs a self-join
on skewed data in JaccardSimilarity.java. A {{SelfJoinHint}} will select between skewed and
non-skewed implementations.
> The object-reuse-disabled case can be simply handled with a new {{Operator}}. The object-reuse-enabled
case requires either {{CopyableValue}} types (as in the code above) or a custom driver which
has access to the serializer (or making the serializer accessible to rich functions, and I
think there be dragons).
> If the idea of a self-join is agreeable, I'd like to work out a rough implementation
and go from there.
> [0] https://en.wikipedia.org/wiki/Join_%28SQL%29#Self-join

This message was sent by Atlassian JIRA

View raw message