Return-Path: X-Original-To: apmail-flink-dev-archive@www.apache.org Delivered-To: apmail-flink-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 30DFD18DD1 for ; Wed, 4 Nov 2015 15:32:48 +0000 (UTC) Received: (qmail 60381 invoked by uid 500); 4 Nov 2015 15:32:48 -0000 Delivered-To: apmail-flink-dev-archive@flink.apache.org Received: (qmail 60319 invoked by uid 500); 4 Nov 2015 15:32:47 -0000 Mailing-List: contact dev-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@flink.apache.org Delivered-To: mailing list dev@flink.apache.org Received: (qmail 60302 invoked by uid 99); 4 Nov 2015 15:32:47 -0000 Received: from Unknown (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Nov 2015 15:32:47 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id E24C01A20E3 for ; Wed, 4 Nov 2015 15:32:46 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.9 X-Spam-Level: ** X-Spam-Status: No, score=2.9 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id 5s9XHWoNtAQC for ; Wed, 4 Nov 2015 15:32:38 +0000 (UTC) Received: from mail-lb0-f182.google.com (mail-lb0-f182.google.com [209.85.217.182]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id 0867F212A3 for ; Wed, 4 Nov 2015 15:32:38 +0000 (UTC) Received: by lbjm5 with SMTP id m5so18877822lbj.3 for ; Wed, 04 Nov 2015 07:32:36 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=JOA3jeRZNs4kTKNovX4XRDkneZKbVrS+d9A4fOIZu70=; b=vrWqwT890xYynUIhuOV89F/gcdaXuTErLZwl8Lf740T+YFIoE0lJnr+UERkJX8JizN ftR3ycyJzPF7hZBhapHEc7B9NOlCF7T0cDcmUh6gLizHHuj2QokAqbO5nF3IZlZed8QE 5XrCReaI42mVIrlTOP7kyH/3KB61U9k2Dfu4+db8ARi2H3N9M/rFSyxL8/6tufArOGBQ x8OHEBTp4dcPok1D0pQNbkXBbgrXitsKqzy55Ht79BvNiDm7WBTyMA19RKCtQHHZHt9t NVRYuavriCcOYVwDJyA1nMAUfI1NkBHhzfVb6fmSuC9118cQbECzCjqTgqpPo8WqvYcn u8PQ== MIME-Version: 1.0 X-Received: by 10.112.72.201 with SMTP id f9mr1280710lbv.62.1446651156154; Wed, 04 Nov 2015 07:32:36 -0800 (PST) Received: by 10.25.73.11 with HTTP; Wed, 4 Nov 2015 07:32:36 -0800 (PST) In-Reply-To: <563A1E8A.3060006@informatik.uni-leipzig.de> References: <5638AD8A.9050700@informatik.uni-leipzig.de> <563A1E8A.3060006@informatik.uni-leipzig.de> Date: Wed, 4 Nov 2015 16:32:36 +0100 Message-ID: Subject: Re: Question about limitations of iterative algorithms From: Vasiliki Kalavri To: dev@flink.apache.org Content-Type: multipart/alternative; boundary=001a11c2400046d2e40523b8b8e8 --001a11c2400046d2e40523b8b8e8 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi Andre, On 4 November 2015 at 16:04, Andr=C3=A9 Petermann < petermann@informatik.uni-leipzig.de> wrote: > Hi Fabian, > > thanks for your fast reply! > > I created a gist to explain the while-not-empty loop in more detail: > https://gist.github.com/p3et/9f6e56cf0b68213e3e2b > > It is an approach to create a minimal example of the kind of algorithm > corresponding to page 1 of the PDF, in particular frequent substrings. > Please don't care about the algorithm itself, the actual algorithm is muc= h > more complex. However, even this simple case runs only for 7 iterations o= n > my machine before "too few memory segments" is raising. > > If I understood delta iteration right, it's, necessary to provide a 1:1 > mapping between working and solution set items. The empty-case you referr= ed > to is the no-more-updates case, but not no-more-items, right? > I think it's possible to do what you want with a delta iteration. =E2=80=8B The solution set and workset *don't* need to be of the same type. When you define the delta iteration, e.g. DeltaIteration iteration =3D ..., ST is the type of the solution set and WT is the type of the workset. =E2=80=8B > > In my example, the working set is completely replaced for each iteration > (line 52), with only a parent-child mapping. The solution set is initiall= y > empty (line 33) and stores results of all iterations (line 48). I hope th= is > shows the difference to the delta iteration and while-not-empty. Further > on, you see the different data types of working and solution sets. > I will provide a further Gist for the "second choice" solution using delt= a > iteration. However, what we actually would prefer is to replace the > for-loop by something like while(embeddings.isNotEmpty()) including a tru= ly > iterative execution. > =E2=80=8BThe default convergence criterion for a delta iteration is an empt= y workset. Thus, if you set embeddings as the workset, you have your "while(embeddings.isNotEmpty())" logic.=E2=80=8B =E2=80=8B =E2=80=8BAlso, as far as I know, there is no problem with appendi= ng new elements to the solution set. So, using allFrequentPatterns as the solution set should be fine. > > But please let me know if I missed some Flink features already enabling > such loops. > > Thanks, > Andre Does this clear things out a bit? Let me know if I misunderstood what you want to do. Cheers, -Vasia. > > > On 03.11.2015 15:47, Fabian Hueske wrote: > >> Hi Andre, >> >> Thanks for reaching out to the Flink community! >> >> I am not sure your analysis is based on correct assumptions about Flink'= s >> delta iterations. >> Flink's delta iterations do support >> - working and solution sets of different types >> - worksets that grow and shrink or are completely replaced in each >> iteration >> - solution sets that grow (up to the point of the too few memory segment= s >> error) >> - termination via a custom aggregator. So it is max number of iterations= , >> empty workset, and a custom termination criterion. >> >> The problem of the "too few memory segments" occurs because the solution >> set is held in an in-memory hash table. >> Spilling that table would result in a significant iteration slowdown. On= e >> solution is to throw more resources at the problem (more nodes, more RAM= ). >> Another work-around is to reduce the amount of managed memory and switch >> the solution set hash table to "unmanaged". This will keep the result se= t >> in a regular Java HashMap but this might cause a OOMError and kill the J= VM >> if it grows too large. >> AFAIK, there is no way to shrink the solution set at the moment. It migh= t >> be worth to investigate into that direction. >> >> Regarding your questions about the while-not-empty loop. What exactly is >> the difference to the default delta iteration. It does stop when the >> workset is empty (in addition to the custom termination criterion). >> I am not sure if I got all your points. Please let me know, if I got >> something wrong. >> >> Thanks, >> Fabian >> >> 2015-11-03 13:50 GMT+01:00 Andr=C3=A9 Petermann < >> petermann@informatik.uni-leipzig.de>: >> >> Sorry for redundant post ... >>> >>> >>> Hi all, >>> >>> We are working on an implementation of Frequent Subgraph Mining using >>> Flink. At that, the "too few memory segments error" prevents the most >>> promising solution. The problem is not specific for graphs, but all >>> iterative problems where >>> >>> - working and solution sets contain data of different types >>> - the working set may grow, shrink or is replaced for each iteration >>> - the solution set grows for each iteration >>> - the termination criterion is based on data set metrics, >>> e.g. while working set not empty >>> >>> An illustration of our problem workflow, generalized to graph unspecifi= c >>> frequent pattern mining, can be found here: >>> >>> >>> https://github.com/dbs-leipzig/gradoop/blob/master/dev-support/loopWith= IntermediateResultMemory.pdf >>> >>> Page 1 shows the most promising solution. We started implementing it >>> using a for loop. However, the "too few memory segments error" makes it >>> untestable. As the iteration body itself is a complex workflow and the >>> number of iterations is arbitrary, unrolling it while reserving operato= r >>> memory will be a permanent limitation. Increasing limits or physical >>> memory would only delay the problem. The resulting question is: >>> >>> Would it be possible to implement a while-not-empty or at least a for >>> loop, that isn't unrolled and can be executed more memory efficient? >>> >>> Page 2 shows an alternative solution to our problem using the concept o= f >>> delta iteration. However, Flink delta iteration does neither support >>> broadcasting nor working-set independent intermediate results. Page 3 >>> shows our working solution using two workarounds for those restrictions= . >>> However, these workarounds lead to unnecessary memory consumption and >>> redundant expensive computations. So, in the case the answer to the >>> first question is no, a second question: >>> >>> Would it be possible to extend the delta iteration by the support of >>> rich map functions with broadcast sets and the memory of intermediate >>> results? >>> >>> We think, that a while-not-empty loop might be useful for other >>> algorithms too, e.g. variable length path search in graphs. Did we miss >>> Flink features meeting our requirements? Do you think it's worth to >>> create an improvement issue? At that, we'd of course be willing to >>> contribute in the form of development. >>> >>> Best Regards >>> Andre >>> >>> -- >>> ------------------------------------------- >>> PhD Student >>> University of Leipzig >>> Department of Computer Science >>> Database Research Group >>> >>> email: petermann@informatik.uni-leipzig.de >>> web: dbs.uni-leipzig.de >>> ------------------------------------------- >>> >>> >> --001a11c2400046d2e40523b8b8e8--