Mailing-List: contact dev-help@flink.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@flink.apache.org
MIME-Version: 1.0
In-Reply-To: <563A1E8A.3060006@informatik.uni-leipzig.de>
References: <5638AD8A.9050700@informatik.uni-leipzig.de>
	<CAAdrtT01axYVEdFSkST5YU6UuPkT-1sQ1rjdjSGjqYD=voYd6w@mail.gmail.com>
	<563A1E8A.3060006@informatik.uni-leipzig.de>
Date: Wed, 4 Nov 2015 16:32:36 +0100
Message-ID: 
 <CAJZ2dcUU=RA_5=HCUTqSjNOZVE7W_M1dvK8kKsEavhY692Rs+g@mail.gmail.com>
Subject: Re: Question about limitations of iterative algorithms
From: Vasiliki Kalavri <vasilikikalavri@gmail.com>
To: dev@flink.apache.org
Content-Type: multipart/alternative; boundary=001a11c2400046d2e40523b8b8e8

--001a11c2400046d2e40523b8b8e8
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Hi Andre,

On 4 November 2015 at 16:04, Andr=C3=A9 Petermann <
petermann@informatik.uni-leipzig.de> wrote:

> Hi Fabian,
>
> thanks for your fast reply!
>
> I created a gist to explain the while-not-empty loop in more detail:
> https://gist.github.com/p3et/9f6e56cf0b68213e3e2b
>
> It is an approach to create a minimal example of the kind of algorithm
> corresponding to page 1 of the PDF, in particular frequent substrings.
> Please don't care about the algorithm itself, the actual algorithm is muc=
h
> more complex. However, even this simple case runs only for 7 iterations o=
n
> my machine before "too few memory segments" is raising.
>
> If I understood delta iteration right, it's, necessary to provide a 1:1
> mapping between working and solution set items. The empty-case you referr=
ed
> to is the no-more-updates case, but not no-more-items, right?
>


I think it's possible to do what you want with a delta iteration.
=E2=80=8B
The solution set and workset *don't* need to be of the same type. When you
define the delta iteration,
e.g. DeltaIteration<ST, WT> iteration =3D ...,
ST is the type of the solution set and WT is the type of the workset.

=E2=80=8B

>
> In my example, the working set is completely replaced for each iteration
> (line 52), with only a parent-child mapping. The solution set is initiall=
y
> empty (line 33) and stores results of all iterations (line 48). I hope th=
is
> shows the difference to the delta iteration and while-not-empty. Further
> on, you see the different data types of working and solution sets.


> I will provide a further Gist for the "second choice" solution using delt=
a
> iteration. However, what we actually would prefer is to replace the
> for-loop by something like while(embeddings.isNotEmpty()) including a tru=
ly
> iterative execution.
>


=E2=80=8BThe default convergence criterion for a delta iteration is an empt=
y
workset. Thus, if you set embeddings as the workset, you have your
"while(embeddings.isNotEmpty())" logic.=E2=80=8B
=E2=80=8B =E2=80=8BAlso, as far as I know, there is no problem with appendi=
ng new elements
to the solution set. So, using allFrequentPatterns as the solution set
should be fine.


>
> But please let me know if I missed some Flink features already enabling
> such loops.
>
> Thanks,
> Andre


Does this clear things out a bit? Let me know if I misunderstood what you
want to do.

Cheers,
-Vasia.


>
>
> On 03.11.2015 15:47, Fabian Hueske wrote:
>
>> Hi Andre,
>>
>> Thanks for reaching out to the Flink community!
>>
>> I am not sure your analysis is based on correct assumptions about Flink'=
s
>> delta iterations.
>> Flink's delta iterations do support
>> - working and solution sets of different types
>> - worksets that grow and shrink or are completely replaced in each
>> iteration
>> - solution sets that grow (up to the point of the too few memory segment=
s
>> error)
>> - termination via a custom aggregator. So it is max number of iterations=
,
>> empty workset, and a custom termination criterion.
>>
>> The problem of the "too few memory segments" occurs because the solution
>> set is held in an in-memory hash table.
>> Spilling that table would result in a significant iteration slowdown. On=
e
>> solution is to throw more resources at the problem (more nodes, more RAM=
).
>> Another work-around is to reduce the amount of managed memory and switch
>> the solution set hash table to "unmanaged". This will keep the result se=
t
>> in a regular Java HashMap but this might cause a OOMError and kill the J=
VM
>> if it grows too large.
>> AFAIK, there is no way to shrink the solution set at the moment. It migh=
t
>> be worth to investigate into that direction.
>>
>> Regarding your questions about the while-not-empty loop. What exactly is
>> the difference to the default delta iteration. It does stop when the
>> workset is empty (in addition to the custom termination criterion).
>> I am not sure if I got all your points. Please let me know, if I got
>> something wrong.
>>
>> Thanks,
>> Fabian
>>
>> 2015-11-03 13:50 GMT+01:00 Andr=C3=A9 Petermann <
>> petermann@informatik.uni-leipzig.de>:
>>
>> Sorry for redundant post ...
>>>
>>>
>>> Hi all,
>>>
>>> We are working on an implementation of Frequent Subgraph Mining using
>>> Flink. At that, the "too few memory segments error" prevents the most
>>> promising solution. The problem is not specific for graphs, but all
>>> iterative problems where
>>>
>>> - working and solution sets contain data of different types
>>> - the working set may grow, shrink or is replaced for each iteration
>>> - the solution set grows for each iteration
>>> - the termination criterion is based on data set metrics,
>>>    e.g. while working set not empty
>>>
>>> An illustration of our problem workflow, generalized to graph unspecifi=
c
>>> frequent pattern mining, can be found here:
>>>
>>>
>>> https://github.com/dbs-leipzig/gradoop/blob/master/dev-support/loopWith=
IntermediateResultMemory.pdf
>>>
>>> Page 1 shows the most promising solution. We started implementing it
>>> using a for loop. However, the "too few memory segments error" makes it
>>> untestable. As the iteration body itself is a complex workflow and the
>>> number of iterations is arbitrary, unrolling it while reserving operato=
r
>>> memory will be a permanent limitation. Increasing limits or physical
>>> memory would only delay the problem. The resulting question is:
>>>
>>> Would it be possible to implement a while-not-empty or at least a for
>>> loop, that isn't unrolled and can be executed more memory efficient?
>>>
>>> Page 2 shows an alternative solution to our problem using the concept o=
f
>>> delta iteration. However, Flink delta iteration does neither support
>>> broadcasting nor working-set independent intermediate results. Page 3
>>> shows our working solution using two workarounds for those restrictions=
.
>>> However, these workarounds lead to unnecessary memory consumption and
>>> redundant expensive computations. So, in the case the answer to the
>>> first question is no, a second question:
>>>
>>> Would it be possible to extend the delta iteration by the support of
>>> rich map functions with broadcast sets and the memory of intermediate
>>> results?
>>>
>>> We think, that a while-not-empty loop might be useful for other
>>> algorithms too, e.g. variable length path search in graphs. Did we miss
>>> Flink features meeting our requirements? Do you think it's worth to
>>> create an improvement issue? At that, we'd of course be willing to
>>> contribute in the form of development.
>>>
>>> Best Regards
>>> Andre
>>>
>>> --
>>> -------------------------------------------
>>> PhD Student
>>> University of Leipzig
>>> Department of Computer Science
>>> Database Research Group
>>>
>>> email: petermann@informatik.uni-leipzig.de
>>> web:   dbs.uni-leipzig.de
>>> -------------------------------------------
>>>
>>>
>>

--001a11c2400046d2e40523b8b8e8--