kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (KAFKA-3561) Auto create through topic for KStream aggregation and join
Date Sun, 05 Jun 2016 14:59:59 GMT

    [ https://issues.apache.org/jira/browse/KAFKA-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15315908#comment-15315908

ASF GitHub Bot commented on KAFKA-3561:

GitHub user dguy opened a pull request:


    KAFKA-3561: Auto create through topic for KStream aggregation and join [WIP]

    @guozhangwang @enothereska @mjsax @miguno
    If you get a chance can you please take a look at this. I've done the repartitioning in
the join, but it results in 2 internal topics for each join. This seems like overkill as sometimes
we wouldn't need to repartition at all, others just 1 topic, and then sometimes both, but
I'm not sure how we can know that. 
    I'd also need to implement something similar for leftJoin, but again, i'd like to see
if i'm heading down the right path or if anyone has any other bright ideas.
    For reference - https://github.com/apache/kafka/pull/1453 - the previous PR
    Thanks for taking the time and looking forward to getting some welcome advice :-)

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dguy/kafka KAFKA-3561

Alternatively you can review and apply these changes as the patch at:


To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1472
commit 053c809b5aa322327988909d53027d4682df6825
Author: Damian Guy <damian.guy@gmail.com>
Date:   2016-06-01T13:03:23Z

    repartition through internal topic on KStream.map()

commit ba1b50285f9eae028e531183500de90a2ae877c7
Author: Damian Guy <damian@continuum.local>
Date:   2016-06-04T15:17:51Z

    Merge remote-tracking branch 'upstream/trunk' into KAFKA-3561

commit a8c14f38914410bc3c7ff1e96c040bf1a1992cef
Author: Damian Guy <damian@continuum.local>
Date:   2016-06-05T14:41:12Z

    repartition on join

commit 1220c61464676881273e47d7e02ea8d502cd8fd4
Author: Damian Guy <damian.guy@gmail.com>
Date:   2016-06-05T14:51:17Z

    repartition on join


> Auto create through topic for KStream aggregation and join
> ----------------------------------------------------------
>                 Key: KAFKA-3561
>                 URL: https://issues.apache.org/jira/browse/KAFKA-3561
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>            Reporter: Guozhang Wang
>            Assignee: Damian Guy
>              Labels: api
>             Fix For:
> For KStream.join / aggregateByKey operations that requires the streams to be partitioned
on the record key, today users should repartition themselves through the "through" call:
> {code}
> stream1 = builder.stream("topic1");
> stream2 = builder.stream("topic2");
> stream3 = stream1.map(/* set the right key for join*/).through("topic3");
> stream4 = stream2.map(/* set the right key for join*/).through("topic4");
> stream3.join(stream4, ..)
> {code}
> This pattern can actually be done by the Streams DSL itself instead of requiring users
to specify themselves, i.e. users can just set the right key like (see KAFKA-3430) and then
call join, which will be translated by adding the "internal topic for repartition".
> Another thing is that today if user do not call "through" after setting a new key, the
aggregation result would not be correct as the aggregation is based on key B while the source
partitions is partitioned by key A and hence each task will only get a partial aggregation
for all keys. But this is not validated in the DSL today. We should do both the auto-translation
and validation.

This message was sent by Atlassian JIRA

View raw message