streams-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Letourneau <jletournea...@gmail.com>
Subject Re: Streams Subscriptions
Date Fri, 01 Feb 2013 22:32:27 GMT
I definitely think we can support the osgi wish based on our current component break down 

Sent from my iPhone

On Feb 1, 2013, at 4:23 PM, "Steve Blackmon [W2O Digital]" <sblackmon@w2odigital.com>
wrote:

> One nice thing lucene offer is support for nested conditional logic right
> in the query - so subscribers can request very complicated filters with a
> single filter tag in the json request.  Lucene is also the basis for
> querying elastic search, and some of the largest data providers such as
> Sysomos/Marketwire - within W2O we have a large library of lucene queries
> and it would be great to use those with minimal modification to configure
> streams.
> 
> But this brings up a wider topic regarding adoption - many users will be
> migrating or integrating solutions where they filter based on lucene, or
> solr, or ham crest, or regex, etcŠ  So a plug-in architecture that would
> let users who can compile java embed whatever filtering logic works best
> for them into streams, without having to commit to master would be
> advisable.  Bonus points if those plugins can bring their own class path
> via osgi or similar approach.
> 
> Steve Blackmon
> Director, Data Sciences
> 
> 101 W. 6th Street
> Austin, Texas 78701
> cell 512.965.0451 | work 512.402.6366
> twitter @steveblackmon
> 
> 
> 
> 
> 
> 
> 
> On 2/1/13 3:08 PM, "Jason Letourneau" <jletourneau80@gmail.com> wrote:
> 
>> that seems like a great place to go - I'm not personally familiar with
>> the DSL syntax of lucene, but I am familiar with the project
>> 
>> Jason
>> 
>> On Fri, Feb 1, 2013 at 2:34 PM, Steve Blackmon [W2O Digital]
>> <sblackmon@w2odigital.com> wrote:
>>> What do you think about standardizing on lucene (or at least supporting
>>> it
>>> natively) as a DSL to describe textual filters?
>>> 
>>> Steve Blackmon
>>> Director, Data Sciences
>>> 
>>> 101 W. 6th Street
>>> Austin, Texas 78701
>>> cell 512.965.0451 | work 512.402.6366
>>> twitter @steveblackmon
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On 2/1/13 1:31 PM, "Jason Letourneau" <jletourneau80@gmail.com> wrote:
>>> 
>>>> slight iteration for clarity:
>>>> 
>>>> {
>>>>   "auth_token": "token",
>>>>   "filters": [
>>>>       {
>>>>           "field": "fieldname",
>>>>           "comparison_operator": "operator",
>>>>           "value_set": [
>>>>               "val1",
>>>>               "val2"
>>>>           ]
>>>>       }
>>>>   ],
>>>>   "outputs": [
>>>>       {
>>>>           "output_type": "http",
>>>>           "method": "post",
>>>>           "url": "http.example.com:8888",
>>>>           "delivery_frequency": "60",
>>>>           "max_size": "10485760",
>>>>           "auth_type": "none",
>>>>           "username": "username",
>>>>           "password": "password"
>>>>       }
>>>>   ]
>>>> }
>>>> 
>>>> On Fri, Feb 1, 2013 at 12:51 PM, Jason Letourneau
>>>> <jletourneau80@gmail.com> wrote:
>>>>> So a subscription URL (result of setting up a subscription) is for all
>>>>> intents and purposes representative of a set of filters.  That
>>>>> subscription can be told to do a variety of things for delivery to the
>>>>> subscriber, but the identity of the subscription is rooted in its
>>>>> filters.  Posting additional filters to the subscription URL or
>>>>> additional output configurations affect the behavior of that
>>>>> subscription by either adding more filters or more outputs (removal as
>>>>> well).
>>>>> 
>>>>> On Fri, Feb 1, 2013 at 12:17 PM, Craig McClanahan <craigmcc@gmail.com>
>>>>> wrote:
>>>>>> A couple of thoughts.
>>>>>> 
>>>>>> * On "outputs" you list "username" and "password" as possible fields.
>>>>>>  I presume that specifying these would imply using HTTP Basic auth?
>>>>>>  We might want to consider different options as well.
>>>>>> 
>>>>>> * From my (possibly myopic :-) viewpoint, the filtering and delivery
>>>>>>  decisions are different object types.  I'd like to be able to
>>>>>> register
>>>>>>  my set of filters and get a unique identifier for them, and then
>>>>>>  separately be able to say "send the results of subscription 123
>>>>>>  to this webhook URL every 60 minutes".
>>>>>> 
>>>>>> * Regarding query syntax, pretty much any sort of simple patterns
>>>>>>  are probably not going to be sufficient for some use cases.  Maybe
>>>>>>  we should offer that as simple defaults, but also support falling
>>>>>> back
>>>>>>  to some sort of SQL-like syntax (i.e. what JIRA does on the
>>>>>>  advanced search).
>>>>>> 
>>>>>> Craig
>>>>>> 
>>>>>> 
>>>>>> On Fri, Feb 1, 2013 at 8:55 AM, Jason Letourneau
>>>>>> <jletourneau80@gmail.com>
>>>>>> wrote:
>>>>>>> 
>>>>>>> Based on Steve and Craig's feedback, I've come up with something
>>>>>>> that
>>>>>>> I think can work.  Below it specifies that:
>>>>>>> 1) you can set up more than one subscription at a time
>>>>>>> 2) each subscription can have many outputs
>>>>>>> 3) each subscription can have many filters
>>>>>>> 
>>>>>>> The details of the config would do things like determine the
>>>>>>> behavior
>>>>>>> of the stream delivery (is it posted back or is the subscriber
>>>>>>> polling
>>>>>>> for instance).  Also, all subscriptions created in this way would
be
>>>>>>> accessed through a single URL.
>>>>>>> 
>>>>>>> {
>>>>>>>    "auth_token": "token",
>>>>>>>    "subscriptions": [
>>>>>>>        {
>>>>>>>            "outputs": [
>>>>>>>                {
>>>>>>>                    "output_type": "http",
>>>>>>>                    "method": "post",
>>>>>>>                    "url": "http.example.com:8888",
>>>>>>>                    "delivery_frequency": "60",
>>>>>>>                    "max_size": "10485760",
>>>>>>>                    "auth_type": "none",
>>>>>>>                    "username": "username",
>>>>>>>                    "password": "password"
>>>>>>>                }
>>>>>>>            ]
>>>>>>>        },
>>>>>>>        {
>>>>>>>            "filters": [
>>>>>>>                {
>>>>>>>                    "field": "fieldname",
>>>>>>>                    "comparison_operator": "operator",
>>>>>>>                    "value_set": [
>>>>>>>                        "val1",
>>>>>>>                        "val2"
>>>>>>>                    ]
>>>>>>>                }
>>>>>>>            ]
>>>>>>>        }
>>>>>>>    ]
>>>>>>> }
>>>>>>> 
>>>>>>> Thoughts?
>>>>>>> 
>>>>>>> Jason
>>>>>>> 
>>>>>>> On Thu, Jan 31, 2013 at 7:53 PM, Craig McClanahan
>>>>>>> <craigmcc@gmail.com>
>>>>>>> wrote:
>>>>>>>> Welcome Steve!
>>>>>>>> 
>>>>>>>> DataSift's UI to set these things up is indeed pretty cool.
 I
>>>>>>> think
>>>>>>>> what
>>>>>>>> we're talking about here is more what the internal REST APIs
>>>>>>> between the
>>>>>>>> UI
>>>>>>>> and the back end might look like.
>>>>>>>> 
>>>>>>>> I also think we should deliberately separate the filter definition
>>>>>>> of a
>>>>>>>> "subscription" from the instructions on how the data gets
>>>>>>> delivered.  I
>>>>>>>> could see use cases for any or all of:
>>>>>>>> * Polling with a filter on oldest date of interest
>>>>>>>> * Webhook that gets updated at some specified interval
>>>>>>>> * URL to which the Streams server would periodically POST
>>>>>>>>  new activities (in case I don't have webhooks set up)
>>>>>>>> 
>>>>>>>> Separately, looking at DataSift is a reminder we will want
to be
>>>>>>> able to
>>>>>>>> filter on words inside an activity stream value like "subject"
or
>>>>>>>> "content", not just on the entire value.
>>>>>>>> 
>>>>>>>> Craig
>>>>>>>> 
>>>>>>>> On Thu, Jan 31, 2013 at 4:29 PM, Jason Letourneau
>>>>>>>> <jletourneau80@gmail.com>wrote:
>>>>>>>> 
>>>>>>>>> Hi Steve - thanks for the input and congrats on your
first post
>>>>>>> - I
>>>>>>>>> think what you are describing is where Craig and I are
circling
>>>>>>> around
>>>>>>>>> (or something similar anyways) - the details on that
POST request
>>>>>>> are
>>>>>>>>> really helpful in particular.  I'll try and put something
>>>>>>> together
>>>>>>>>> tomorrow that would be a start for the "setup" request
(and
>>>>>>> subsequent
>>>>>>>>> additional configuration after the subscription is initialized)
>>>>>>> and
>>>>>>>>> post back to the group.
>>>>>>>>> 
>>>>>>>>> Jason
>>>>>>>>> 
>>>>>>>>> On Thu, Jan 31, 2013 at 7:00 PM, Steve Blackmon [W2O
Digital]
>>>>>>>>> <sblackmon@w2odigital.com> wrote:
>>>>>>>>>> First post from me (btw I am Steve, stoked about
this project
>>>>>>> and
>>>>>>>>>> meeting
>>>>>>>>>> everyone eventually.)
>>>>>>>>>> 
>>>>>>>>>> Sorry if I missed the point of the thread, but I
think this is
>>>>>>>>>> related
>>>>>>>>> and
>>>>>>>>>> might be educational for some in the group.
>>>>>>>>>> 
>>>>>>>>>> I like the way DataSift's API lets you establish
streams - you
>>>>>>> POST a
>>>>>>>>>> definition, it returns a hash, and thereafter their
service
>>>>>>> follows
>>>>>>>>>> the
>>>>>>>>>> instructions you gave it as new messages meet the
filter you
>>>>>>> defined.
>>>>>>>>>> In
>>>>>>>>>> addition, once a stream exists, then you can set
up listeners
>>>>>>> on
>>>>>>> that
>>>>>>>>>> specific hash via web sockets with the hash.
>>>>>>>>>> 
>>>>>>>>>> For example, here is how you instruct DataSift to
push new
>>>>>>> messages
>>>>>>>>>> meeting your criteria to a WebHooks end-point.
>>>>>>>>>> 
>>>>>>>>>> curl -X POST 'https://api.datasift.com/push/create'
\
>>>>>>>>>> -d 'name=connectorhttp' \
>>>>>>>>>> -d 'hash=dce320ce31a8919784e6e85aecbd040e' \
>>>>>>>>>> -d 'output_type=http' \
>>>>>>>>>> -d 'output_params.method=post' \
>>>>>>>>>> -d 'output_params.url=http.example.com:8888' \
>>>>>>>>>> -d 'output_params.use_gzip' \
>>>>>>>>>> -d 'output_params.delivery_frequency=60' \
>>>>>>>>>> -d 'output_params.max_size=10485760' \
>>>>>>>>>> -d 'output_params.verify_ssl=false' \
>>>>>>>>>> -d 'output_params.auth.type=none' \
>>>>>>>>>> -d 'output_params.auth.username=YourHTTPServerUsername'
\
>>>>>>>>>> -d 'output_params.auth.password=YourHTTPServerPassword'
\
>>>>>>>>>> -H 'Auth: datasift-user:your-datasift-api-key
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Now new messages get pushed to me every 60 seconds,
and I can
>>>>>>> get the
>>>>>>>>> feed
>>>>>>>>>> in real-time like this:
>>>>>>>>>> 
>>>>>>>>>> var websocketsUser = 'datasift-user';
>>>>>>>>>> var websocketsHost = 'websocket.datasift.com';
>>>>>>>>>> var streamHash = 'dce320ce31a8919784e6e85aecbd040e';
>>>>>>>>>> var apiKey = 'your-datasift-api-key';
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> var ws = new
>>>>>>> WebSocket('ws://'+websocketsHost+'/'+streamHash+'?username='+websocke
>>>>>>> ts
>>>>>>> User
>>>>>>>>>> +'&api_key='+apiKey);
>>>>>>>>>> 
>>>>>>>>>> ws.onopen = function(evt) {
>>>>>>>>>>    // connection event
>>>>>>>>>>        $("#stream").append('open: '+evt.data+'<br/>');
>>>>>>>>>> }
>>>>>>>>>> 
>>>>>>>>>> ws.onmessage = function(evt) {
>>>>>>>>>>    // parse received message
>>>>>>>>>>        $("#stream").append('message: '+evt.data+'<br/>');
>>>>>>>>>> }
>>>>>>>>>> 
>>>>>>>>>> ws.onclose = function(evt) {
>>>>>>>>>>    // parse event
>>>>>>>>>>        $("#stream").append('close: '+evt.data+'<br/>');
>>>>>>>>>> }
>>>>>>>>>> 
>>>>>>>>>> // No event object is passed to the event callback,
so no
>>>>>>> useful
>>>>>>>>> debugging
>>>>>>>>>> can be done
>>>>>>>>>> ws.onerror = function() {
>>>>>>>>>>    // Some error occurred
>>>>>>>>>>        $("#stream").append('error: '+evt.data+'<br/>');
>>>>>>>>>> }
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> At W2OGroup we have built utility libraries for receiving
and
>>>>>>>>>> processing
>>>>>>>>>> Json object streams from data sift in Storm/Kafka
that I'm
>>>>>>> interested
>>>>>>>>>> in
>>>>>>>>>> extending to work with Streams, and can probably
commit to the
>>>>>>>>>> project if
>>>>>>>>>> the community would find them useful.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Steve Blackmon
>>>>>>>>>> Director, Data Sciences
>>>>>>>>>> 
>>>>>>>>>> 101 W. 6th Street
>>>>>>>>>> Austin, Texas 78701
>>>>>>>>>> cell 512.965.0451 | work 512.402.6366
>>>>>>>>>> twitter @steveblackmon
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On 1/31/13 5:45 PM, "Craig McClanahan" <craigmcc@gmail.com>
>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> We'll probably want some way to do the equivalent
of ">", ">=",
>>>>>>> "<",
>>>>>>>>> "<=",
>>>>>>>>>>> and "!=" in addition to the implicit "equal"
that I assume you
>>>>>>> mean
>>>>>>>>>>> in
>>>>>>>>>>> this
>>>>>>>>>>> example.
>>>>>>>>>>> 
>>>>>>>>>>> Craig
>>>>>>>>>>> 
>>>>>>>>>>> On Thu, Jan 31, 2013 at 3:39 PM, Jason Letourneau
>>>>>>>>>>> <jletourneau80@gmail.com>wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> I really like this - this is somewhat what
I was getting at
>>>>>>> with
>>>>>>>>>>>> the
>>>>>>>>>>>> JSON object i.e. POST:
>>>>>>>>>>>> {
>>>>>>>>>>>> "subscriptions":
>>>>>>>>>>>> [{"activityField":"value"},
>>>>>>>>>>>> {"activityField":"value",
>>>>>>>>>>>> "anotherActivityField":"value" }
>>>>>>>>>>>> ]
>>>>>>>>>>>> }
>>>>>>>>>>>> 
>>>>>>>>>>>> On Thu, Jan 31, 2013 at 4:32 PM, Craig McClanahan
>>>>>>>>>>>> <craigmcc@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> On Thu, Jan 31, 2013 at 12:00 PM, Jason
Letourneau
>>>>>>>>>>>>> <jletourneau80@gmail.com>wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I am curious on the group's thinking
about subscriptions
>>>>>>> to
>>>>>>>>>>>>>> activity
>>>>>>>>>>>>>> streams.  As I am stubbing out the
end-to-end heartbeat on
>>>>>>> my
>>>>>>>>>>>> proposed
>>>>>>>>>>>>>> architecture, I've just been working
with URL sources as
>>>>>>> the
>>>>>>>>>>>>>> subscription mode.  Obviously this
is a way
>>>>>>> over-simplification.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I know for shindig the social graph
can be used, but we
>>>>>>> don't
>>>>>>>>>>>>>> necessarily have that.  Considering
the mechanism for
>>>>>>>>>>>>>> establishing a
>>>>>>>>>>>>>> new subscription stream (defined
as aggregated individual
>>>>>>>>>>>>>> activities
>>>>>>>>>>>>>> pulled from a varying array of sources)
is POSTing to the
>>>>>>>>>>>>>> Activity
>>>>>>>>>>>>>> Streams server to establish the channel
(currently just a
>>>>>>>>>>>>>> subscriptions=url1,url2,url3 is the
over simplified
>>>>>>>>> mechanism)...what
>>>>>>>>>>>>>> would people see as a reasonable
way to establish
>>>>>>> subscriptions?
>>>>>>>>>>>> List
>>>>>>>>>>>>>> of userIds? Subjects?  How should
these be represented?  I
>>>>>>> was
>>>>>>>>>>>>>> thinking of a JSON object, but any
one have other
>>>>>>> thoughts?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Jason
>>>>>>>>>>>>> 
>>>>>>>>>>>>> One idea would be take some inspiration
from how JIRA lets
>>>>>>> you
>>>>>>>>>>>>> (in
>>>>>>>>>>>> effect)
>>>>>>>>>>>>> create a WHERE clause that looks at any
fields (in all the
>>>>>>>>>>>>> activities
>>>>>>>>>>>>> flowing through the server) that you
want.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Example filter criteria
>>>>>>>>>>>>> * provider.id = 'xxx' // Filter on a
particular provider
>>>>>>>>>>>>> * verb = 'yyy'
>>>>>>>>>>>>> * object.type = 'blogpost'
>>>>>>>>>>>>> and you'd want to accept more than one
value (effectively
>>>>>>>>>>>>> creating OR
>>>>>>>>>>>> or
>>>>>>>>>>>> IN
>>>>>>>>>>>>> type clauses).
>>>>>>>>>>>>> 
>>>>>>>>>>>>> For completeness, I'd want to be able
to specify more than
>>>>>>> one
>>>>>>>>>>>>> filter
>>>>>>>>>>>>> expression in the same subscription.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Craig
> 

Mime
View raw message