flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vasiliki Kalavri <vasilikikala...@gmail.com>
Subject Re: Question about limitations of iterative algorithms
Date Wed, 04 Nov 2015 15:32:36 GMT
Hi Andre,

On 4 November 2015 at 16:04, André Petermann <
petermann@informatik.uni-leipzig.de> wrote:

> Hi Fabian,
>
> thanks for your fast reply!
>
> I created a gist to explain the while-not-empty loop in more detail:
> https://gist.github.com/p3et/9f6e56cf0b68213e3e2b
>
> It is an approach to create a minimal example of the kind of algorithm
> corresponding to page 1 of the PDF, in particular frequent substrings.
> Please don't care about the algorithm itself, the actual algorithm is much
> more complex. However, even this simple case runs only for 7 iterations on
> my machine before "too few memory segments" is raising.
>
> If I understood delta iteration right, it's, necessary to provide a 1:1
> mapping between working and solution set items. The empty-case you referred
> to is the no-more-updates case, but not no-more-items, right?
>


I think it's possible to do what you want with a delta iteration.
​
The solution set and workset *don't* need to be of the same type. When you
define the delta iteration,
e.g. DeltaIteration<ST, WT> iteration = ...,
ST is the type of the solution set and WT is the type of the workset.

​

>
> In my example, the working set is completely replaced for each iteration
> (line 52), with only a parent-child mapping. The solution set is initially
> empty (line 33) and stores results of all iterations (line 48). I hope this
> shows the difference to the delta iteration and while-not-empty. Further
> on, you see the different data types of working and solution sets.


> I will provide a further Gist for the "second choice" solution using delta
> iteration. However, what we actually would prefer is to replace the
> for-loop by something like while(embeddings.isNotEmpty()) including a truly
> iterative execution.
>


​The default convergence criterion for a delta iteration is an empty
workset. Thus, if you set embeddings as the workset, you have your
"while(embeddings.isNotEmpty())" logic.​
​ ​Also, as far as I know, there is no problem with appending new elements
to the solution set. So, using allFrequentPatterns as the solution set
should be fine.



>
> But please let me know if I missed some Flink features already enabling
> such loops.
>
> Thanks,
> Andre



Does this clear things out a bit? Let me know if I misunderstood what you
want to do.

Cheers,
-Vasia.



>
>
> On 03.11.2015 15:47, Fabian Hueske wrote:
>
>> Hi Andre,
>>
>> Thanks for reaching out to the Flink community!
>>
>> I am not sure your analysis is based on correct assumptions about Flink's
>> delta iterations.
>> Flink's delta iterations do support
>> - working and solution sets of different types
>> - worksets that grow and shrink or are completely replaced in each
>> iteration
>> - solution sets that grow (up to the point of the too few memory segments
>> error)
>> - termination via a custom aggregator. So it is max number of iterations,
>> empty workset, and a custom termination criterion.
>>
>> The problem of the "too few memory segments" occurs because the solution
>> set is held in an in-memory hash table.
>> Spilling that table would result in a significant iteration slowdown. One
>> solution is to throw more resources at the problem (more nodes, more RAM).
>> Another work-around is to reduce the amount of managed memory and switch
>> the solution set hash table to "unmanaged". This will keep the result set
>> in a regular Java HashMap but this might cause a OOMError and kill the JVM
>> if it grows too large.
>> AFAIK, there is no way to shrink the solution set at the moment. It might
>> be worth to investigate into that direction.
>>
>> Regarding your questions about the while-not-empty loop. What exactly is
>> the difference to the default delta iteration. It does stop when the
>> workset is empty (in addition to the custom termination criterion).
>> I am not sure if I got all your points. Please let me know, if I got
>> something wrong.
>>
>> Thanks,
>> Fabian
>>
>> 2015-11-03 13:50 GMT+01:00 André Petermann <
>> petermann@informatik.uni-leipzig.de>:
>>
>> Sorry for redundant post ...
>>>
>>>
>>> Hi all,
>>>
>>> We are working on an implementation of Frequent Subgraph Mining using
>>> Flink. At that, the "too few memory segments error" prevents the most
>>> promising solution. The problem is not specific for graphs, but all
>>> iterative problems where
>>>
>>> - working and solution sets contain data of different types
>>> - the working set may grow, shrink or is replaced for each iteration
>>> - the solution set grows for each iteration
>>> - the termination criterion is based on data set metrics,
>>>    e.g. while working set not empty
>>>
>>> An illustration of our problem workflow, generalized to graph unspecific
>>> frequent pattern mining, can be found here:
>>>
>>>
>>> https://github.com/dbs-leipzig/gradoop/blob/master/dev-support/loopWithIntermediateResultMemory.pdf
>>>
>>> Page 1 shows the most promising solution. We started implementing it
>>> using a for loop. However, the "too few memory segments error" makes it
>>> untestable. As the iteration body itself is a complex workflow and the
>>> number of iterations is arbitrary, unrolling it while reserving operator
>>> memory will be a permanent limitation. Increasing limits or physical
>>> memory would only delay the problem. The resulting question is:
>>>
>>> Would it be possible to implement a while-not-empty or at least a for
>>> loop, that isn't unrolled and can be executed more memory efficient?
>>>
>>> Page 2 shows an alternative solution to our problem using the concept of
>>> delta iteration. However, Flink delta iteration does neither support
>>> broadcasting nor working-set independent intermediate results. Page 3
>>> shows our working solution using two workarounds for those restrictions.
>>> However, these workarounds lead to unnecessary memory consumption and
>>> redundant expensive computations. So, in the case the answer to the
>>> first question is no, a second question:
>>>
>>> Would it be possible to extend the delta iteration by the support of
>>> rich map functions with broadcast sets and the memory of intermediate
>>> results?
>>>
>>> We think, that a while-not-empty loop might be useful for other
>>> algorithms too, e.g. variable length path search in graphs. Did we miss
>>> Flink features meeting our requirements? Do you think it's worth to
>>> create an improvement issue? At that, we'd of course be willing to
>>> contribute in the form of development.
>>>
>>> Best Regards
>>> Andre
>>>
>>> --
>>> -------------------------------------------
>>> PhD Student
>>> University of Leipzig
>>> Department of Computer Science
>>> Database Research Group
>>>
>>> email: petermann@informatik.uni-leipzig.de
>>> web:   dbs.uni-leipzig.de
>>> -------------------------------------------
>>>
>>>
>>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message