flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-2549) Add topK operator for DataSet
Date Wed, 23 Sep 2015 17:31:04 GMT

    [ https://issues.apache.org/jira/browse/FLINK-2549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14904869#comment-14904869

ASF GitHub Bot commented on FLINK-2549:

Github user StephanEwen commented on the pull request:

    This looks super impressive and very well tested.
    The way that the operator is integrated into the system needs some improvement, though.
The problem is mainly how the managed memory is obtained.
    The MemoryManager's memory is shared among all concurrently running tasks. This implementation
takes up to half the total memory, which will cause programs to crash that have other memory
consumers in the same pipeline. The tests here run, because the operator is executed in isolation,
with no other memory consuming operators in the test program.
    Memory consumers need to be known to the Optimizer (in the program generation) to compute
what maximal fraction of memory a certain consumer may request. That value is part of the
Task's configuration and used by the memory consumer to obtain the right maximum amount.
    Integrating operators into the optimizer's planning is a bit tedious and not as easy as
it could be (we did not get around to refactoring this so far, unfortunately). Maybe we can
add some tooling that would mark a UDF as MemoryConsuming and would in that case expose a
Memory Allocator that returns the right amount of memory.
    What we could do is the following: I will try to get to refactoring some of the Managed
Memory Allocation abstractions (we need this anyways for more components) and then expose
a MemoryAllocator in the runtime context, which is accessible if a user-defined function has
been annotated as a memory consumer.
    This may take me two weeks (I am currently in the mids of working on the streaming windows),
but if you don't mind letting this rest for some days, I think that is the cleanest approach.
    The other parts of the code look good, so after I finish my part, it should be a simple
rebase of the TopKMapPartition function and the TopKReducer, and then this is good to merge.
    What do you think?

> Add topK operator for DataSet
> -----------------------------
>                 Key: FLINK-2549
>                 URL: https://issues.apache.org/jira/browse/FLINK-2549
>             Project: Flink
>          Issue Type: New Feature
>          Components: Core, Java API, Scala API
>            Reporter: Chengxiang Li
>            Assignee: Chengxiang Li
>            Priority: Minor
> topK is a common operation for user, it would be great to have it in Flink. 

This message was sent by Atlassian JIRA

View raw message