kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Filipiak <Jan.Filip...@trivago.com>
Subject Re: [VOTE] KIP-159: Introducing Rich functions to Streams
Date Sat, 11 Nov 2017 06:02:59 GMT
Hi,

i think this is the better way. Naming is always tricky Source is kinda 
taken
I had TopicBackedK[Source|Table] in mind
but for the user its way better already IMHO

Thank you for reconsideration

Best Jan


On 10.11.2017 22:48, Matthias J. Sax wrote:
> I was thinking about the source stream/table idea once more and it seems
> it would not be too hard to implement:
>
> We add two new classes
>
>    SourceKStream extends KStream
>
> and
>
>    SourceKTable extend KTable
>
> and return both from StreamsBuilder#stream and StreamsBuilder#table
>
> As both are sub-classes, this change is backward compatible. We change
> the return type for any single-record transform to this new types, too,
> and use KStream/KTable as return type for any multi-record operation.
>
> The new RecordContext API is added to both new classes. For old classes,
> we only implement KIP-149 to get access to the key.
>
>
> WDYT?
>
>
> -Matthias
>
> On 11/9/17 9:13 PM, Jan Filipiak wrote:
>> Okay,
>>
>> looks like it would _at least work_ for Cached KTableSources .
>> But we make it harder to the user to make mistakes by putting
>> features into places where they don't make sense and don't
>> help anyone.
>>
>> I once again think that my suggestion is easier to implement and
>> more correct. I will use this email to express my disagreement with the
>> proposed KIP (-1 non binding of course) state that I am open for any
>> questions
>> regarding this. I will also do the usual thing and point out that the
>> friends
>> over at Hive got it correct aswell.
>> One can not user their
>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VirtualColumns
>>
>> in any place where its not read from the Sources.
>>
>> With KSQl in mind it makes me sad how this is evolving here.
>>
>> Best Jan
>>
>>
>>
>>
>>
>> On 10.11.2017 01:06, Guozhang Wang wrote:
>>> Hello Jan,
>>>
>>> Regarding your question about caching: today we keep the record context
>>> with the cached entry already so when we flush the cache which may
>>> generate
>>> new records forwarding we will set the record context appropriately; and
>>> then after the flush is completed we will reset the context to the record
>>> before the flush happens. But I think when Jeyhun did the PR it is a good
>>> time to double check on such stages to make sure we are not
>>> introducing any
>>> regressions.
>>>
>>>
>>> Guozhang
>>>
>>>
>>> On Mon, Nov 6, 2017 at 8:54 PM, Jan Filipiak <Jan.Filipiak@trivago.com>
>>> wrote:
>>>
>>>> I Aggree completely.
>>>>
>>>> Exposing this information in a place where it has no _natural_ belonging
>>>> might really be a bad blocker in the long run.
>>>>
>>>> Concerning your first point. I would argue its not to hard to have a
>>>> user
>>>> keep track of these. If we still don't want the user
>>>> to keep track of these I would argue that all > projection only <
>>>> transformations on a Source-backed KTable/KStream
>>>> could also return a Ktable/KStream instance of the type we return
>>>> from the
>>>> topology builder.
>>>> Only after any operation that exceeds projection or filter one would
>>>> return a KTable not granting access to this any longer.
>>>>
>>>> Even then its difficult already: I never ran a topology with caching
>>>> but I
>>>> am not even 100% sure what the record Context means behind
>>>> a materialized KTable with Caching? Topic and Partition are probably
>>>> with
>>>> some reasoning but offset is probably only the offset causing the flush?
>>>> So one might aswell think to drop offsets from this RecordContext.
>>>>
>>>> Best Jan
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 07.11.2017 03:18, Guozhang Wang wrote:
>>>>
>>>>> Regarding the API design (the proposed set of overloads v.s. one
>>>>> overload
>>>>> on #map to enrich the record), I think what we have represents a good
>>>>> trade-off between API succinctness and user convenience: on one hand
we
>>>>> definitely want to keep as fewer overloaded functions as possible.
>>>>> But on
>>>>> the other hand if we only do that in, say, the #map() function then
>>>>> this
>>>>> enrichment could be an overkill: think of a topology that has 7
>>>>> operators
>>>>> in a chain, where users want to access the record context on
>>>>> operator #2
>>>>> and #6 only, with the "enrichment" manner they need to do the
>>>>> enrichment
>>>>> on
>>>>> operator #2 and keep it that way until #6. In addition, the
>>>>> RecordContext
>>>>> fields (topic, offset, etc) are really orthogonal to the key-value
>>>>> payloads
>>>>> themselves, so I think separating them into this object is a cleaner
>>>>> way.
>>>>>
>>>>> Regarding the RecordContext inheritance, this is actually a good point
>>>>> that
>>>>> have not been discussed thoroughly before. Here are my my two cents:
>>>>> one
>>>>> natural way would be to inherit the record context from the
>>>>> "triggering"
>>>>> record, for example in a join operator, if the record from stream A
>>>>> triggers the join then the record context is inherited from with that
>>>>> record. This is also aligned with the lower-level PAPI interface. A
>>>>> counter
>>>>> argument, though, would be that this is sort of leaking the internal
>>>>> implementations of the DSL, so that moving forward if we did some
>>>>> refactoring to our join implementations so that the triggering
>>>>> record can
>>>>> change, the RecordContext would also be different. I do not know how
>>>>> much
>>>>> it would really affect end users, but would like to hear your opinions.
>>>>>
>>>> Agreed to 100% exposing this information
>>>>
>>>>
>>>>> Guozhang
>>>>>
>>>>>
>>>>> On Mon, Nov 6, 2017 at 1:00 PM, Jeyhun Karimov <je.karimov@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Hi Jan,
>>>>>> Sorry for late reply.
>>>>>>
>>>>>>
>>>>>> The API Design doesn't look appealing
>>>>>>
>>>>>>
>>>>>> In terms of API design we tried to preserve the java functional
>>>>>> interfaces.
>>>>>> We applied the same set of rich methods for KTable to make it
>>>>>> compatible
>>>>>> with the rest of overloaded APIs.
>>>>>>
>>>>>> It should be 100% sufficient to offer a KTable + KStream that is
>>>>>> directly
>>>>>>
>>>>>>> feed from a topic with 1 additional overload for the #map()
>>>>>>> methods to
>>>>>>> cover every usecase while keeping the API in a way better state.
>>>>>>>
>>>>>> - IMO this seems a workaround, rather than a direct solution.
>>>>>>
>>>>>> Perhaps we should continue this discussion in DISCUSS thread.
>>>>>>
>>>>>>
>>>>>> Cheers,
>>>>>> Jeyhun
>>>>>>
>>>>>>
>>>>>> On Mon, Nov 6, 2017 at 9:14 PM Jan Filipiak <Jan.Filipiak@trivago.com>
>>>>>> wrote:
>>>>>>
>>>>>> Hi.
>>>>>>> I do understand that it might come in Handy.
>>>>>>>     From my POV in any relational algebra this is only a projection.
>>>>>>> Currently we hide these "fields" that come with the input record.
>>>>>>> It should be 100% sufficient to offer a KTable + KStream that
is
>>>>>>> directly
>>>>>>> feed from a topic with 1 additional overload for the #map()
>>>>>>> methods to
>>>>>>> cover every usecase while keeping the API in a way better state.
>>>>>>>
>>>>>>> best Jan
>>>>>>>
>>>>>>> On 06.11.2017 17:52, Matthias J. Sax wrote:
>>>>>>>
>>>>>>>> Jan,
>>>>>>>>
>>>>>>>> I understand what you are saying. However, having a RecordContext
is
>>>>>>>> super useful for operations that are applied to input topic.
Many
>>>>>>>> users
>>>>>>>> requested this feature -- it's much more convenient that
falling
>>>>>>>> back
>>>>>>>>
>>>>>>> to
>>>>>>> transform() to implement a a filter() for example that want to
access
>>>>>>>> some meta data.
>>>>>>>>
>>>>>>>> Because we cannot distinguish different "origins" of a
>>>>>>>> KStream/KTable,
>>>>>>>>
>>>>>>> I
>>>>>>> am not sure if there would be a better way to do this. The only
>>>>>>>> "workaround" I see, is to have two KStream/KTable interfaces
each
>>>>>>>> and
>>>>>>>>
>>>>>>> we
>>>>>>> would use the first one for KStream/KTable with "proper"
>>>>>>> RecordContext.
>>>>>>>> But this does not seem to be a good solution either.
>>>>>>>>
>>>>>>>> Note, a KTable can also be read directly from a topic, I
agree that
>>>>>>>> using RecordContext on a KTable that is the result of an
>>>>>>>> aggregation is
>>>>>>>> questionable. But I don't see a reason to down vote the KIP
for this
>>>>>>>>
>>>>>>> reason.
>>>>>>>
>>>>>>>> WDYT about this?
>>>>>>>>
>>>>>>>>
>>>>>>>> -Matthias
>>>>>>>>
>>>>>>>> On 11/1/17 10:19 PM, Jan Filipiak wrote:
>>>>>>>>
>>>>>>>>> -1 non binding
>>>>>>>>>
>>>>>>>>> I don't get the motivation.
>>>>>>>>> In 80% of my DSL processors there is no such thing as
a reasonable
>>>>>>>>> RecordContext.
>>>>>>>>> After a join  the record I am processing belongs to at
least 2
>>>>>>>>> topics.
>>>>>>>>> After a Group by the record I am processing was created
from
>>>>>>>>> multiple
>>>>>>>>> offsets.
>>>>>>>>>
>>>>>>>>> The API Design doesn't look appealing
>>>>>>>>>
>>>>>>>>> Best Jan
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 01.11.2017 22:02, Jeyhun Karimov wrote:
>>>>>>>>>
>>>>>>>>>> Dear community,
>>>>>>>>>>
>>>>>>>>>> It seems the discussion for KIP-159 [1] converged
finally. I would
>>>>>>>>>> like to
>>>>>>>>>> initiate voting for the particular KIP.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> [1]
>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
>>>>>>>>>>
>>>>>>>>> 159%3A+Introducing+Rich+functions+to+Streams
>>>>>>>>>> Cheers,
>>>>>>>>>> Jeyhun
>>>>>>>>>>
>>>>>>>>>>


Mime
View raw message