Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 8F0C62009C6 for ; Tue, 31 May 2016 11:15:43 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 8D964160A43; Tue, 31 May 2016 09:15:43 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id AA8AC160A01 for ; Tue, 31 May 2016 11:15:42 +0200 (CEST) Received: (qmail 75595 invoked by uid 500); 31 May 2016 09:15:41 -0000 Mailing-List: contact dev-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@flink.apache.org Delivered-To: mailing list dev@flink.apache.org Received: (qmail 75583 invoked by uid 99); 31 May 2016 09:15:41 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 31 May 2016 09:15:41 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id F2122180481 for ; Tue, 31 May 2016 09:15:40 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.179 X-Spam-Level: * X-Spam-Status: No, score=1.179 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id SMDvulNDCIZ2 for ; Tue, 31 May 2016 09:15:38 +0000 (UTC) Received: from mail-wm0-f51.google.com (mail-wm0-f51.google.com [74.125.82.51]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id F25225F4EB for ; Tue, 31 May 2016 09:15:37 +0000 (UTC) Received: by mail-wm0-f51.google.com with SMTP id z87so98906810wmh.0 for ; Tue, 31 May 2016 02:15:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to; bh=2lDTRqoBPoEoYyHdiWYbKU2mN+Lwg9LCA+qwIuqRLu4=; b=JgMo7iSnsKlnrQpHFdyDGKjwlh0/3fujs/6jnugRDAfKDrHYWgHx6O7H18tlCz8iOV ZdJJ9SUWtjainLLbVpv+gLyj8N+UWFb8jwdqQEFkv2HMriDG9CE8U3HxImnrFWMN2zNw jqBjuOiQLRnyVYzuYKhMfJab82WN0j38kMoVP3cBvSknNL2ASt0eCBMpdKszPuHpoenf Zjfmvko+zEliwMiO65kgPkrux+uCHbVxWPdEUVAQtoDNBfutsLmOR4ND4uETAtxfvlRe bAJEb7gP37/1vl0wfZ2uTZ2bE2BW9zRCv8MK2K+67kPJ3bc0XWUC5Dz5jku1RRyY59dX n+JA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to; bh=2lDTRqoBPoEoYyHdiWYbKU2mN+Lwg9LCA+qwIuqRLu4=; b=aFksGl6sWAEn6E/5rTJ3sL8pOyr6Bv5vSQ14iIkWR0Dxze+LxgFVEUpxCaHlPcLyuu i/OFo5llmPPz+TZ0Lhpo5R600YfRz25w3dMNvmCtqq0Y4cQOGSKXpdPteD4EcfrPBXX7 jYjdyQJR0kbOohbPmGBPf816gR0+a7UQyTJUzQcR4opdKCSB5aFCIsnO3qjPl/Avzkbc okX9BOW5i/y0TMEuFpniRlAGbBEdCHZg/0CmyuXJZMwk8GCW4lUWYIbkmt/mTbfQ512r HsGBpoFiB15RYZyMUKqYuOgs8jVK0CatW1llY/pLYw4+2jQR5OlqjroeXCCXGkgrdySk Aitg== X-Gm-Message-State: ALyK8tL+674ypbjDFYeqNPnw03c5Q4sx+nIvb6THX07w0SzY47KaIFMb5eVMBzQX68yXh0/3IDhSbLBXe/hZJw== MIME-Version: 1.0 X-Received: by 10.194.173.161 with SMTP id bl1mr30520194wjc.11.1464686137621; Tue, 31 May 2016 02:15:37 -0700 (PDT) Received: by 10.194.121.234 with HTTP; Tue, 31 May 2016 02:15:37 -0700 (PDT) In-Reply-To: References: Date: Tue, 31 May 2016 12:15:37 +0300 Message-ID: Subject: Re: buffering in operators, implementing statistics From: Stavros Kontopoulos To: dev@flink.apache.org Content-Type: multipart/alternative; boundary=089e013c6336f1022005341fd0db archived-at: Tue, 31 May 2016 09:15:43 -0000 --089e013c6336f1022005341fd0db Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi Stephan, An external project would be possible and maybe merge it in the future if it makes sense. Just wanted to point out that in general there is a need, but i understand priorities and may also try to work on these. Best, Stavros On Thu, May 26, 2016 at 10:00 PM, Stephan Ewen wrote: > Hi Stavros! > > I think what Aljoscha wants to say is that the community is a bit hard > pressed reviewing new and complex things right now. > There are a lot of threads going on already. > > If you want to work on this, why not make your own GitHub project > "Approximate algorithms on Apache Flink" or so? > > Greetings, > Stephan > > > > On Wed, May 25, 2016 at 3:02 PM, Aljoscha Krettek > wrote: > > > Hi, > > that link was interesting, thanks! As I said though, it's probably not = a > > good fit for Flink right now. > > > > The things that I feel are important right now are: > > > > - dynamic scaling: the ability of a streaming pipeline to adapt to > changes > > in the amount of incoming data. This is tricky with stateful operations > and > > long-running pipelines. For Spark this is easier to do because every > > mini-batch is individually scheduled and they can therefore be schedule= d > on > > differing numbers of machines. > > > > - an API for joining static (or slowly evolving) data with streaming > data: > > this has been coming up in different forms on the mailing lists and whe= n > > talking with people. Apache Beam solves this with "side inputs". In Fli= nk > > we want to add something as well, maybe along the lines of side inputs = or > > maybe something more specific for the case of pure joins. > > > > - working on managed memory: In Flink we were always very conscious > about > > how memory was used, we were using our own abstractions for dealing wit= h > > memory and efficient serialization. We call this the "managed memory" > > abstraction. Spark recently also started going in this direction with > > Project Tungsten. For the streaming API there are still some places whe= re > > we could make things more efficient by working on the managed memory > more, > > for example, there is no state backend that uses managed memory. We are > > either completely on the Java Heap or use RocksDB there. > > > > - stream SQL: this is obvious and everybody wants it. > > > > - A generic cross-runner API: This is what Apache Beam (n=C3=A9e Googl= e > > Dataflow) does. It is very interesting to write a program once and then > be > > able to run it on different runners. This brings more flexibility for > > users. It's not clear how this will play out in the long run but it's > very > > interesting to keep an eye on. > > > > For most of these the Flink community is in various stages of > implementing > > it, so that's good. :-) > > > > Cheers, > > Aljoscha > > > > On Mon, 23 May 2016 at 17:48 Stavros Kontopoulos < > st.kontopoulos@gmail.com > > > > > wrote: > > > > > Hey Aljoscha, > > > > > > Thnax for the useful comments. I have recently looked at spark > sketches: > > > > > > > > > http://www.slideshare.net/databricks/sketching-big-data-with-spark-random= ized-algorithms-for-largescale-data-analytics > > > So there must be value in this effort. > > > In my experience counting in general is a common need for large data > > sets. > > > For example people often in a non streaming setting use redis for > > > its hyperlolog algo. > > > > > > What are other areas you find more important or of higher priority fo= r > > the > > > time being? > > > > > > Best, > > > Stavros > > > > > > On Mon, May 23, 2016 at 6:18 PM, Aljoscha Krettek > > > > wrote: > > > > > > > Hi, > > > > no such changes are planned right now. The separaten between the ke= ys > > is > > > > very strict in order to make the windowing state re-partitionable s= o > > that > > > > we can implement dynamic rescaling of the parallelism of a program. > > > > > > > > The WindowAll is only used for specific cases where you need a > Trigger > > > that > > > > sees all elements of the stream. I personally don't think it is ver= y > > > useful > > > > because it is not scaleable. In theory, for time windows this can b= e > > > > parallelized but it is not currently done in Flink. > > > > > > > > Do you have a specific use case for the count-min sketch in mind. I= f > > not, > > > > maybe our energy is better spent on producing examples with > real-world > > > > applicability. I'm not against having an example for a count-min > > sketch, > > > > I'm just worried that you might put your energy into something that > is > > > not > > > > useful to a lot of people. > > > > > > > > Cheers, > > > > Aljoscha > > > > > > > > On Fri, 20 May 2016 at 20:13 Stavros Kontopoulos < > > > st.kontopoulos@gmail.com > > > > > > > > > wrote: > > > > > > > > > Hi thnx for the feedback. > > > > > > > > > > So there is a limitation due to parallel windows implementation. > > > > > No intentions to change that somehow to accommodate similar > > > estimations? > > > > > > > > > > WindowAll in practice is used as step in the pipeline? I mean sin= ce > > its > > > > > inherently not parallel cannot scale correct? > > > > > Although there is an exception: "Only for special cases, such as > > > aligned > > > > > time windows is it possible to perform this operation in parallel= " > > > > > Probably missing something... > > > > > > > > > > I could try do the example stuff (and open a new feature on jira > for > > > > that). > > > > > I will also vote for closing the old issue too since there is no > > other > > > > way > > > > > at least for the time being... > > > > > > > > > > Thanx, > > > > > Stavros > > > > > > > > > > On Fri, May 20, 2016 at 7:02 PM, Aljoscha Krettek < > > aljoscha@apache.org > > > > > > > > > wrote: > > > > > > > > > > > Hi, > > > > > > with how the window API currently works this can only be done f= or > > > > > > non-parallel windows. For keyed windows everything that happens > is > > > > scoped > > > > > > to the key of the elements: window contents are kept in per-key > > > state, > > > > > > triggers fire on a per-key basis. Therefore a count-min sketch > > cannot > > > > be > > > > > > used because it would require to keep state across keys. > > > > > > > > > > > > For non-parallel windows a user could do this: > > > > > > > > > > > > DataStream input =3D ... > > > > > > input > > > > > > .windowAll() > > > > > > .fold(new MySketch(), new MySketchFoldFunction()) > > > > > > > > > > > > with sketch data types and a fold function that is tailored to > the > > > user > > > > > > types. Therefore, I would prefer to not add a special API for > this > > > and > > > > > vote > > > > > > to close https://issues.apache.org/jira/browse/FLINK-2147. I > > already > > > > > > commented on https://issues.apache.org/jira/browse/FLINK-2144, > > > saying > > > > a > > > > > > similar thing. > > > > > > > > > > > > What I would welcome very much is to add some well documented > > > examples > > > > to > > > > > > Flink that showcase how some of these operations can be written= . > > > > > > > > > > > > Cheers, > > > > > > Aljoscha > > > > > > > > > > > > On Thu, 19 May 2016 at 16:38 Stavros Kontopoulos < > > > > > st.kontopoulos@gmail.com > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > Hi guys, > > > > > > > > > > > > > > I would like to push forward the work here: > > > > > > > https://issues.apache.org/jira/browse/FLINK-2147 > > > > > > > > > > > > > > Can anyone more familiar with streaming api verify if this > could > > > be a > > > > > > > mature task. > > > > > > > The intention is to summarize data over a window like in the > case > > > of > > > > > > > StreamGroupedFold. > > > > > > > Specifically implement count min in a window. > > > > > > > > > > > > > > Best, > > > > > > > Stavros > > > > > > > > > > > > > > > > > > > > > > > > > > > > --089e013c6336f1022005341fd0db--