apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Siyuan Hua (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (APEXMALHAR-2099) Identify overlap between Beam API and existing Apex Stream API
Date Thu, 02 Jun 2016 01:04:59 GMT

    [ https://issues.apache.org/jira/browse/APEXMALHAR-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15311525#comment-15311525
] 

Siyuan Hua commented on APEXMALHAR-2099:
----------------------------------------

[~ilganeli] Also we do have plan for GroupBy, It's just not there yet :) We need to build
the operator fist. 
You can follow this ticket https://issues.apache.org/jira/browse/APEXMALHAR-2085 for details.
Actually it would address more problems.
In terms of high-level API, we categorize all transformations into 2 different type windowed
operations/non-windowed operations. As you might see there is a WindowedStream interface 
as well(empty for now but methods will be added later). For operations like map, filter, window
is meaningless, but for some operations like group, join etc window is a must have(by default
it might be one globalwindow, but still there is a window for that). Further more, we would
divide windowed operations into 3 different types: group, combine, join. Group needs to retain
the tuples in the window(sorting for example), Combine needs to combine the tuples in the
window into one/multiple tuples(max, sum, reduce, topN etc) and Join needs to merge multiple
upstream inputs into one. All in all, we do have detail plan for that. 
Now let's look at the Beam API again, personally I think there are a lot of mess there, I
also looked at all PTransform there(The categorization I have above actually comes from the
analysis of all PTransforms  :) ). They don't have clear hierarchy in their PTransform family.
For example window is one of transformations, that is really against my interpretation of
the word "Transformation". Window alone is meaningless, what we actually want to do is Sum,
Max, Average in a specific Window. Like you said, it's just a strategy in our next cut of
Stream API.
And then, here you will have abstraction in different layers eventually,  A simple PTransform
base class DOES NOT work.  The reason we categorize operations into windowed and non-windowed
and further break them into Group/Combine/Join is because operations in same category usually
share similar internal mechanism like state checkpointing, trigger behaviour so on so forth.
You don't want to do repeat work for those. And Beam actually do the same thing. For example,
if you want to do Sum, there is no such thing called SumTransform, instead they have a number
of Sum**Fn classes which implements CombineFn class which is finally used by some PTransform.
  We will have similar method in WindowedStream interface which takes a CombineFn instead
of Sum, Reduce (or maybe have them all) etc.

Thanks for your opinion, let's together make our API rock! 

> Identify overlap between Beam API and existing Apex Stream API
> --------------------------------------------------------------
>
>                 Key: APEXMALHAR-2099
>                 URL: https://issues.apache.org/jira/browse/APEXMALHAR-2099
>             Project: Apache Apex Malhar
>          Issue Type: Sub-task
>            Reporter: Ilya Ganelin
>
> There should be some overlap between the Beam API and the recently released Apex Stream
API. This task captures the need to understand and document this overlap.
> AC:
> * A document or JIRA comment identifying which components of the Beam API are implement,
similar, or absent within the Apex Stream API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message