flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vasiliki Kalavri <vasilikikala...@gmail.com>
Subject Re: Question about limitations of iterative algorithms
Date Thu, 05 Nov 2015 14:04:47 GMT
Hi Andre,

I'm happy you were able to solve your problem :)

Improvements to the documentation are always welcome!
To me ST != WT is straight-forward from the javadocs, but I guess it
wouldn't hurt to stress it in the docs.
Do you think you could simplify your implementation a bit to make for a
nice example? It might be a bit too complicated to follow as it is right

In any case, if you would like to improve the delta iteration docs, please
go ahead and open a JIRA. We can discuss the details of what improvements
to make over there.


On 5 November 2015 at 11:41, André Petermann <
petermann@informatik.uni-leipzig.de> wrote:

> Hi Vasia!
> Many thanks for your reply! Your hints finally enabled me implementing the
> problems first-choice solution using delta iteration:
> https://gist.github.com/p3et/12deb7d6321b48e9efab
> Do you think this could be worth to be contributed as an example within
> the Flink documentation? The examples I found so far could not help
> enlightening me how to use delta iteration for this kind of loop
> (ST != WT, start from empty solution set, ...).
> Cheers,
> Andre
> On 04.11.2015 16:32, Vasiliki Kalavri wrote:
>> Hi Andre,
>> On 4 November 2015 at 16:04, André Petermann <
>> petermann@informatik.uni-leipzig.de> wrote:
>> Hi Fabian,
>>> thanks for your fast reply!
>>> I created a gist to explain the while-not-empty loop in more detail:
>>> https://gist.github.com/p3et/9f6e56cf0b68213e3e2b
>>> It is an approach to create a minimal example of the kind of algorithm
>>> corresponding to page 1 of the PDF, in particular frequent substrings.
>>> Please don't care about the algorithm itself, the actual algorithm is
>>> much
>>> more complex. However, even this simple case runs only for 7 iterations
>>> on
>>> my machine before "too few memory segments" is raising.
>>> If I understood delta iteration right, it's, necessary to provide a 1:1
>>> mapping between working and solution set items. The empty-case you
>>> referred
>>> to is the no-more-updates case, but not no-more-items, right?
>> I think it's possible to do what you want with a delta iteration.
>> ​
>> The solution set and workset *don't* need to be of the same type. When you
>> define the delta iteration,
>> e.g. DeltaIteration<ST, WT> iteration = ...,
>> ST is the type of the solution set and WT is the type of the workset.
>> ​
>>> In my example, the working set is completely replaced for each iteration
>>> (line 52), with only a parent-child mapping. The solution set is
>>> initially
>>> empty (line 33) and stores results of all iterations (line 48). I hope
>>> this
>>> shows the difference to the delta iteration and while-not-empty. Further
>>> on, you see the different data types of working and solution sets.
>> I will provide a further Gist for the "second choice" solution using delta
>>> iteration. However, what we actually would prefer is to replace the
>>> for-loop by something like while(embeddings.isNotEmpty()) including a
>>> truly
>>> iterative execution.
>> ​The default convergence criterion for a delta iteration is an empty
>> workset. Thus, if you set embeddings as the workset, you have your
>> "while(embeddings.isNotEmpty())" logic.​
>> ​ ​Also, as far as I know, there is no problem with appending new elements
>> to the solution set. So, using allFrequentPatterns as the solution set
>> should be fine.
>>> But please let me know if I missed some Flink features already enabling
>>> such loops.
>>> Thanks,
>>> Andre
>> Does this clear things out a bit? Let me know if I misunderstood what you
>> want to do.
>> Cheers,
>> -Vasia.
>>> On 03.11.2015 15:47, Fabian Hueske wrote:
>>> Hi Andre,
>>>> Thanks for reaching out to the Flink community!
>>>> I am not sure your analysis is based on correct assumptions about
>>>> Flink's
>>>> delta iterations.
>>>> Flink's delta iterations do support
>>>> - working and solution sets of different types
>>>> - worksets that grow and shrink or are completely replaced in each
>>>> iteration
>>>> - solution sets that grow (up to the point of the too few memory
>>>> segments
>>>> error)
>>>> - termination via a custom aggregator. So it is max number of
>>>> iterations,
>>>> empty workset, and a custom termination criterion.
>>>> The problem of the "too few memory segments" occurs because the solution
>>>> set is held in an in-memory hash table.
>>>> Spilling that table would result in a significant iteration slowdown.
>>>> One
>>>> solution is to throw more resources at the problem (more nodes, more
>>>> RAM).
>>>> Another work-around is to reduce the amount of managed memory and switch
>>>> the solution set hash table to "unmanaged". This will keep the result
>>>> set
>>>> in a regular Java HashMap but this might cause a OOMError and kill the
>>>> JVM
>>>> if it grows too large.
>>>> AFAIK, there is no way to shrink the solution set at the moment. It
>>>> might
>>>> be worth to investigate into that direction.
>>>> Regarding your questions about the while-not-empty loop. What exactly is
>>>> the difference to the default delta iteration. It does stop when the
>>>> workset is empty (in addition to the custom termination criterion).
>>>> I am not sure if I got all your points. Please let me know, if I got
>>>> something wrong.
>>>> Thanks,
>>>> Fabian
>>>> 2015-11-03 13:50 GMT+01:00 André Petermann <
>>>> petermann@informatik.uni-leipzig.de>:
>>>> Sorry for redundant post ...
>>>>> Hi all,
>>>>> We are working on an implementation of Frequent Subgraph Mining using
>>>>> Flink. At that, the "too few memory segments error" prevents the most
>>>>> promising solution. The problem is not specific for graphs, but all
>>>>> iterative problems where
>>>>> - working and solution sets contain data of different types
>>>>> - the working set may grow, shrink or is replaced for each iteration
>>>>> - the solution set grows for each iteration
>>>>> - the termination criterion is based on data set metrics,
>>>>>     e.g. while working set not empty
>>>>> An illustration of our problem workflow, generalized to graph
>>>>> unspecific
>>>>> frequent pattern mining, can be found here:
>>>>> https://github.com/dbs-leipzig/gradoop/blob/master/dev-support/loopWithIntermediateResultMemory.pdf
>>>>> Page 1 shows the most promising solution. We started implementing it
>>>>> using a for loop. However, the "too few memory segments error" makes
>>>>> untestable. As the iteration body itself is a complex workflow and the
>>>>> number of iterations is arbitrary, unrolling it while reserving
>>>>> operator
>>>>> memory will be a permanent limitation. Increasing limits or physical
>>>>> memory would only delay the problem. The resulting question is:
>>>>> Would it be possible to implement a while-not-empty or at least a for
>>>>> loop, that isn't unrolled and can be executed more memory efficient?
>>>>> Page 2 shows an alternative solution to our problem using the concept
>>>>> of
>>>>> delta iteration. However, Flink delta iteration does neither support
>>>>> broadcasting nor working-set independent intermediate results. Page 3
>>>>> shows our working solution using two workarounds for those
>>>>> restrictions.
>>>>> However, these workarounds lead to unnecessary memory consumption and
>>>>> redundant expensive computations. So, in the case the answer to the
>>>>> first question is no, a second question:
>>>>> Would it be possible to extend the delta iteration by the support of
>>>>> rich map functions with broadcast sets and the memory of intermediate
>>>>> results?
>>>>> We think, that a while-not-empty loop might be useful for other
>>>>> algorithms too, e.g. variable length path search in graphs. Did we miss
>>>>> Flink features meeting our requirements? Do you think it's worth to
>>>>> create an improvement issue? At that, we'd of course be willing to
>>>>> contribute in the form of development.
>>>>> Best Regards
>>>>> Andre
>>>>> --
>>>>> -------------------------------------------
>>>>> PhD Student
>>>>> University of Leipzig
>>>>> Department of Computer Science
>>>>> Database Research Group
>>>>> email: petermann@informatik.uni-leipzig.de
>>>>> web:   dbs.uni-leipzig.de
>>>>> -------------------------------------------

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message