streams-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matthew Hager [W2O Digital]" <mha...@w2odigital.com>
Subject Re: Proposing Changes to the StreamsProvider interface and StreamsProviderTask
Date Fri, 16 May 2014 02:29:29 GMT
Steve,

Thanks for the reply!

I think that both can be accommodated, using different interfaces, maybe
there is room for 2 types of processor? While, this paradigm might be
great for Pig and MR1, it falls short on what can be done with Storm. It
also falls short of many complicated ETL problems that many major
companies face. I think having more flexibility would allow for Streams to
appeal to a wider audience.

Thanks!
Smashew



On 5/14/14, 10:16 PM, "Steve Blackmon" <steve@blackmon.org> wrote:

>Fundamentally processors as initially conceived do not maintain fire
>events autonomously or maintain state between messages.  Changing that
>paradigm would mean Pig/MR1 would not longer be capable of serving as a
>full-featured processor runtime.  Agree this is limiting, but only in
>terms of what a processor can do, not what streams can do.
>
>Steve Blackmon
>
>> On May 14, 2014, at 2:02 PM, Ryan Ebanks <ryanebanks@gmail.com> wrote:
>> 
>> I think a processor should solely be responsible for processing data.  I
>> think the interface describes exactly what a processor should do, get a
>> piece of data and produce out put data.  Having it do more than that is
>> expanding the functionality of a processor beyond its intent.
>> 
>> I do agree that being able to fire off events for lack of data is
>> necessary, but that should be the responsibility of the provider.  The
>> provider is the connection to the outside world and  should be in
>>charge of
>> producing new data and firing off events when new data is not produced
>>or
>> missing.  The provider should know exactly when the last piece of new
>>data
>> entered the system, and so it will know when to send alerts.  Everything
>> downstream of the provider should be a closed flow and have guarantees
>>of
>> processing.  Which most runtimes do, the local runtime needs a lot of
>>love
>> though.
>> 
>> Without a doubt the runtime environments need to be improved and
>>rigorously
>> tested to make sure that once data enters a stream from a provider,
>>that no
>> data is dropped and completes the stream.  However, I do not believe
>>that
>> changing the processor is the answer that problem.
>> 
>> Ryan Ebanks
>> 
>> 
>> On Wed, May 14, 2014 at 3:12 PM, Matthew Hager [W2O Digital] <
>> mhager@w2odigital.com> wrote:
>> 
>>> Matt,
>>> 
>>> 
>>> As always thanks for your feedback and mentorship as I work to
>>>contribute
>>> to this project.
>>> 
>>> I feel the current pattern for processor is extremely limiting given
>>>the
>>> constraint of the return statement rather than the alternative of
>>>receive
>>> -> process -> write. It seems that if we revise the current interfaces
>>>we
>>> could handle many more use-cases with streams that cannot currently be
>>> accomplished.
>>> 
>>> Current "StreamsProcessor":
>>> 
>>> -- Contract --
>>> List<StreamsDatum> process(StreamsDatum datum);
>>> 
>>> -- Drawbacks --
>>> For the processor to access the 'writer' he must have an item to be
>>> processed. If he doesn't have an item, then he can't write anything
>>> downstream.
>>> 
>>> -- Use Cases Not Satisfied --
>>> In many cases the absence of data on a 'stream' is an event in itself.
>>>For
>>> instance, if I want to have a listener at various portions of the
>>>process
>>> to ensure my processing data providers are working appropriately. If I
>>> want to send out a message, log an event to Oracle, and alert
>>>PagerDuty,
>>> for stale data. I can write a quick processor that checks for stale
>>>data,
>>> then upon recognition of stale data (timer expiring), fire the proper
>>> events to it's writers.
>>> 
>>> Other use cases involve... debugging, stream summarization metrics,
>>> trending algorithms, processing  stats, and many others.
>> 
>> 
>> 
>> 
>> 
>> 
>>> 
>>> +++ Proposed Interfaces +++
>>> 
>>> // Analogous to Provider
>>> public interface Provider {
>>>        push(StreamsDatum);
>>>        registerListener(Listener);
>>> }
>>> 
>>> 
>>> // Analogous to Writer
>>> public interface Listener {
>>>        receive(StreamsDatum);
>>> }
>>> 
>>> 
>>> // Analogous to Processor
>>> public interface Processor implements Prover, Listener {
>>> 
>>> }
>>> 
>>> +++ Conclusion +++
>>> I feel the event driven model is more flexible than the current model.
>>>We
>>> can write an interface for the legacy 'contract' to ease
>>>implementation /
>>> transition.
>>> 
>>> IE:
>>> 
>>> interface LegacyProcessor {
>>>        List<StreamsDatum> process(StreamsDatum datum);
>>> 
>>> }
>>> 
>>> class LegacyProcessorImpl {
>>> 
>>>        listen(StreamsDatum datum) {?
>>>                forEach(StreamsDatum d : this.transform(datum) {
>>>                        ?this.write(datum);?
>>>                }
>>>        ?}
>>> 
>>> }
>>> 
>>> 
>>> I really love streams, I think the concept is great, I think adding
>>>this
>>> would take streams from "great" --> "OMG, the whole world should use
>>>this
>>> all the time for everything ever, it is flipping amazing and better
>>>than
>>> ice cream on a hot sunny day."
>>> 
>>> 
>>> Thoughts?
>>> 
>>> 
>>> 
>>> 
>>> Thanks!
>>> Smashew
>>> 
>>> 
>>> 
>>>> On 5/13/14, 2:38 PM, "Matt Franklin" <m.ben.franklin@gmail.com> wrote:
>>>> 
>>>> On Tue, May 6, 2014 at 10:53 PM, Matthew Hager [W2O Digital] <
>>>> mhager@w2odigital.com> wrote:
>>>> 
>>>> <snip />
>>>> 
>>>> 
>>>>> StreamsResultSet - I actually found this to be quite useful
>>>>>paradigm. A
>>>>> queue prevents a buffer overflow, an iterator makes it fun and easy
>>>>>to
>>>>> read (I love iterators), and it is simple and succinct. I do,
>>>>>however,
>>>>> feel it is best expressed as an interface instead of a class.
>>>> 
>>>> 
>>>> I agree with an interface.  IMO, anything that is not a utility style
>>>> helper should be interacted with via its interface.
>>>> 
>>>> 
>>>>> The thing missing from this, as an interface, would be the notion of
>>>>> "isRunning" which could easily
>>>>> satisfy both of the aforementioned modalities.
>>>> 
>>>> Reasonable.
>>>> 
>>>> 
>>>>> Event Driven - I concur with Matt 100% on this. As currently
>>>>> implemented,
>>>>> LocalStreamsBuilder is exceedingly inefficient from a memory
>>>>>perspective
>>>>> and time execution perspective. To me, it seems, that we could almost
>>>>> abstract out 2 common interfaces to make this happen.
>>>>> 
>>>>>        * Listener { receive(StreamsDatum); }
>>>>>        * Producer { push(StreamsDatum); registerListener(Listener); }
>>>>> 
>>>>> Where the following implementations would place:
>>>>> 
>>>>>        * Reader implements Producer
>>>>>        * Processor implements Producer, Listener
>>>>>        * Writer implements Listener
>>>> 
>>>> Seems logical.  I would like to see the two possible operating modes
>>>> represented as distinct interfaces.
>>>> 
>>> 
>>> 


Mime
View raw message