Return-Path: X-Original-To: apmail-streams-dev-archive@minotaur.apache.org Delivered-To: apmail-streams-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1271111FB5 for ; Tue, 6 May 2014 13:27:56 +0000 (UTC) Received: (qmail 45815 invoked by uid 500); 6 May 2014 13:24:48 -0000 Delivered-To: apmail-streams-dev-archive@streams.apache.org Received: (qmail 45755 invoked by uid 500); 6 May 2014 13:24:46 -0000 Mailing-List: contact dev-help@streams.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@streams.incubator.apache.org Delivered-To: mailing list dev@streams.incubator.apache.org Received: (qmail 45737 invoked by uid 99); 6 May 2014 13:24:46 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 May 2014 13:24:46 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of m.ben.franklin@gmail.com designates 209.85.216.175 as permitted sender) Received: from [209.85.216.175] (HELO mail-qc0-f175.google.com) (209.85.216.175) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 May 2014 13:24:42 +0000 Received: by mail-qc0-f175.google.com with SMTP id w7so7792244qcr.20 for ; Tue, 06 May 2014 06:24:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=AyLJxM7A3fslR9ol7Srxq6iVHwS5K+0Z8wCOvVoeAe4=; b=p3P/sdX+RmU901jqH/UYRvWsiwJNRMKoz2BGH+XfIpxOXeRNlyu5nCjA29HS+S4Aj+ NIPTjQUzqeOEvRMqoIak2OOmLLe+9HKb1jL39xm15JsLPahTM4yDcT09O/nLNyvx8r+V xJAK1onCwT75ygr5nT4zNqHaQxlX0b8RSAs02a4vvi1Og9GGQr4Q7reUALmfhYb/qA+A W/JMkSleDpd2Ty2ZSFJsimZc+bBF4H//EoYZkrwHXpeTFfO0NHamo7B/GdXZ9cNsnxlj oh2pHTGdaGV63Gu4Fbxv9LmzABOvt+fBZbknUuyI7/a2aPDGSrH4yOjRymT4ZY3Wg5g+ YJaQ== MIME-Version: 1.0 X-Received: by 10.140.92.131 with SMTP id b3mr49641762qge.41.1399382659091; Tue, 06 May 2014 06:24:19 -0700 (PDT) Received: by 10.140.32.97 with HTTP; Tue, 6 May 2014 06:24:19 -0700 (PDT) In-Reply-To: References: Date: Tue, 6 May 2014 09:24:19 -0400 Message-ID: Subject: Re: Proposing Changes to the StreamsProvider interface and StreamsProviderTask From: Matt Franklin To: "dev@streams.incubator.apache.org" Content-Type: multipart/alternative; boundary=001a1139bdee4cd27c04f8bb2ac2 X-Virus-Checked: Checked by ClamAV on apache.org --001a1139bdee4cd27c04f8bb2ac2 Content-Type: text/plain; charset=UTF-8 On Mon, May 5, 2014 at 1:15 PM, Steve Blackmon wrote: > What I meant to say re #1 below is that batch-level metadata could be > useful for modules downstream of the StreamsProvider / > StreamsPersistReader, and the StreamsResultSet gives us a class to > which we can add new metadata in core as the project evolves, or > supplement on a per-module or per-implementation basis via > subclassing. Within a provider there's no need to modify or extend > StreamsResultSet to maintain and utilize state from a third-party API. > I agree that in batch mode, metadata might be important. In conversations with other people, I think what might be missing is a completely reactive, event-driven mode where a provider pushes to the rest of the stream rather than gets polled. > > I think I would support making StreamsResultSet an interface rather > than a class. > +1 on interface > > Steve Blackmon > sblackmon@apache.org > > On Mon, May 5, 2014 at 12:07 PM, Steve Blackmon > wrote: > > Comments on this in-line below. > > > > On Thu, May 1, 2014 at 4:38 PM, Ryan Ebanks > wrote: > >> The use and implementations of the StreamsProviders seems to have > drifted > >> away from what it was originally designed for. I recommend that we > change > >> the StreamsProvider interface and StreamsProvider task to reflect the > >> current usage patterns and to be more efficient. > >> > >> Current Problems: > >> > >> 1.) newPerpetualStream in LocalStream builder is not perpetual. The > >> StreamProvider task will shut down after a certain amount of empty > returns > >> from the provider. A perpetual stream implies that it will run in > >> perpetuity. If I open a Twitter Gardenhose that is returning tweets > with > >> obscure key words, I don't want my stream shutting down if it is just > quiet > >> for a few time periods. > >> > >> 2.) StreamsProviderTasks assumes that a single read*, will return all > the > >> data for that request. This means that if I do a readRange for a year, > the > >> provider has to hold all of that data in memory and return it as one > >> StreamsResultSet. I believe the readPerpetual was designed to get > around > >> this problem. > >> > >> Proposed Fixes/Changes: > >> > >> Fix 1.) Remove the StreamsResultSet. No implementations in the project > >> currently use it for anything other than a wrapper around a Queue that > is > >> then iterated over. StreamsProvider will now return a > Queue > >> instead of a StreamsResultSet. This will allow providers to queue data > as > >> they receive it, and the StreamsProviderTask can pop them off as soon as > >> they are available. It will help fix problem #2, as well as help to > lower > >> memory usage. > >> > > > > I'm not convinced this is a good idea. StreamsResultSet is a useful > > abstraction even if no modules are using it as more than a wrapper for > > Queue at the moment. For example read* in a provider or persistReader > > could return batch-level (as opposed to datum-level) metadata from the > > underlying API which would be useful state for the provider. > > Switching to Queue would eliminate our ability to add those > > capabilities at the core level or at the module level. > > > >> Fix 2.) Add a method, public boolean isRunning(), to the StreamsProvider > >> interface. The StreamsProviderTask can call this function to see if the > >> provider is still operating. This will help fix problems #1 and #2. This > >> will allow the provider to run mulitthreaded, queue data as it's > available, > >> and notify the task when it's done so that it can be closed down > properly. > >> It will also allow the stream to be run in perpetuity as the StreamTask > >> won't shut down providers that have not been producing data for a while. > >> > > > > I think this is a good idea. +1 > > > >> Right now the StreamsProvider and StreamsProviderTask seem to be full of > >> short term fixes that need to be redesigned into long term solutions. > With > >> enough positive feedback, I will create Jira tasks, a feature branch, > and > >> begin work. > >> > >> Sincerely, > >> Ryan Ebanks > --001a1139bdee4cd27c04f8bb2ac2--