streams-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Blackmon [W2O Digital]" <sblack...@w2odigital.com>
Subject Re: Streams Subscriptions
Date Fri, 01 Feb 2013 19:34:46 GMT
What do you think about standardizing on lucene (or at least supporting it
natively) as a DSL to describe textual filters?

Steve Blackmon
Director, Data Sciences

101 W. 6th Street
Austin, Texas 78701
cell 512.965.0451 | work 512.402.6366
twitter @steveblackmon







On 2/1/13 1:31 PM, "Jason Letourneau" <jletourneau80@gmail.com> wrote:

>slight iteration for clarity:
>
>{
>    "auth_token": "token",
>    "filters": [
>        {
>            "field": "fieldname",
>            "comparison_operator": "operator",
>            "value_set": [
>                "val1",
>                "val2"
>            ]
>        }
>    ],
>    "outputs": [
>        {
>            "output_type": "http",
>            "method": "post",
>            "url": "http.example.com:8888",
>            "delivery_frequency": "60",
>            "max_size": "10485760",
>            "auth_type": "none",
>            "username": "username",
>            "password": "password"
>        }
>    ]
>}
>
>On Fri, Feb 1, 2013 at 12:51 PM, Jason Letourneau
><jletourneau80@gmail.com> wrote:
>> So a subscription URL (result of setting up a subscription) is for all
>> intents and purposes representative of a set of filters.  That
>> subscription can be told to do a variety of things for delivery to the
>> subscriber, but the identity of the subscription is rooted in its
>> filters.  Posting additional filters to the subscription URL or
>> additional output configurations affect the behavior of that
>> subscription by either adding more filters or more outputs (removal as
>> well).
>>
>> On Fri, Feb 1, 2013 at 12:17 PM, Craig McClanahan <craigmcc@gmail.com>
>>wrote:
>>> A couple of thoughts.
>>>
>>> * On "outputs" you list "username" and "password" as possible fields.
>>>   I presume that specifying these would imply using HTTP Basic auth?
>>>   We might want to consider different options as well.
>>>
>>> * From my (possibly myopic :-) viewpoint, the filtering and delivery
>>>   decisions are different object types.  I'd like to be able to
>>>register
>>>   my set of filters and get a unique identifier for them, and then
>>>   separately be able to say "send the results of subscription 123
>>>   to this webhook URL every 60 minutes".
>>>
>>> * Regarding query syntax, pretty much any sort of simple patterns
>>>   are probably not going to be sufficient for some use cases.  Maybe
>>>   we should offer that as simple defaults, but also support falling
>>>back
>>>   to some sort of SQL-like syntax (i.e. what JIRA does on the
>>>   advanced search).
>>>
>>> Craig
>>>
>>>
>>> On Fri, Feb 1, 2013 at 8:55 AM, Jason Letourneau
>>><jletourneau80@gmail.com>
>>> wrote:
>>>>
>>>> Based on Steve and Craig's feedback, I've come up with something that
>>>> I think can work.  Below it specifies that:
>>>> 1) you can set up more than one subscription at a time
>>>> 2) each subscription can have many outputs
>>>> 3) each subscription can have many filters
>>>>
>>>> The details of the config would do things like determine the behavior
>>>> of the stream delivery (is it posted back or is the subscriber polling
>>>> for instance).  Also, all subscriptions created in this way would be
>>>> accessed through a single URL.
>>>>
>>>> {
>>>>     "auth_token": "token",
>>>>     "subscriptions": [
>>>>         {
>>>>             "outputs": [
>>>>                 {
>>>>                     "output_type": "http",
>>>>                     "method": "post",
>>>>                     "url": "http.example.com:8888",
>>>>                     "delivery_frequency": "60",
>>>>                     "max_size": "10485760",
>>>>                     "auth_type": "none",
>>>>                     "username": "username",
>>>>                     "password": "password"
>>>>                 }
>>>>             ]
>>>>         },
>>>>         {
>>>>             "filters": [
>>>>                 {
>>>>                     "field": "fieldname",
>>>>                     "comparison_operator": "operator",
>>>>                     "value_set": [
>>>>                         "val1",
>>>>                         "val2"
>>>>                     ]
>>>>                 }
>>>>             ]
>>>>         }
>>>>     ]
>>>> }
>>>>
>>>> Thoughts?
>>>>
>>>> Jason
>>>>
>>>> On Thu, Jan 31, 2013 at 7:53 PM, Craig McClanahan <craigmcc@gmail.com>
>>>> wrote:
>>>> > Welcome Steve!
>>>> >
>>>> > DataSift's UI to set these things up is indeed pretty cool.  I think
>>>> > what
>>>> > we're talking about here is more what the internal REST APIs
>>>>between the
>>>> > UI
>>>> > and the back end might look like.
>>>> >
>>>> > I also think we should deliberately separate the filter definition
>>>>of a
>>>> > "subscription" from the instructions on how the data gets
>>>>delivered.  I
>>>> > could see use cases for any or all of:
>>>> > * Polling with a filter on oldest date of interest
>>>> > * Webhook that gets updated at some specified interval
>>>> > * URL to which the Streams server would periodically POST
>>>> >   new activities (in case I don't have webhooks set up)
>>>> >
>>>> > Separately, looking at DataSift is a reminder we will want to be
>>>>able to
>>>> > filter on words inside an activity stream value like "subject" or
>>>> > "content", not just on the entire value.
>>>> >
>>>> > Craig
>>>> >
>>>> > On Thu, Jan 31, 2013 at 4:29 PM, Jason Letourneau
>>>> > <jletourneau80@gmail.com>wrote:
>>>> >
>>>> >> Hi Steve - thanks for the input and congrats on your first post
- I
>>>> >> think what you are describing is where Craig and I are circling
>>>>around
>>>> >> (or something similar anyways) - the details on that POST request
>>>>are
>>>> >> really helpful in particular.  I'll try and put something together
>>>> >> tomorrow that would be a start for the "setup" request (and
>>>>subsequent
>>>> >> additional configuration after the subscription is initialized)
and
>>>> >> post back to the group.
>>>> >>
>>>> >> Jason
>>>> >>
>>>> >> On Thu, Jan 31, 2013 at 7:00 PM, Steve Blackmon [W2O Digital]
>>>> >> <sblackmon@w2odigital.com> wrote:
>>>> >> > First post from me (btw I am Steve, stoked about this project
and
>>>> >> > meeting
>>>> >> > everyone eventually.)
>>>> >> >
>>>> >> > Sorry if I missed the point of the thread, but I think this
is
>>>> >> > related
>>>> >> and
>>>> >> > might be educational for some in the group.
>>>> >> >
>>>> >> > I like the way DataSift's API lets you establish streams -
you
>>>>POST a
>>>> >> > definition, it returns a hash, and thereafter their service
>>>>follows
>>>> >> > the
>>>> >> > instructions you gave it as new messages meet the filter you
>>>>defined.
>>>> >> > In
>>>> >> > addition, once a stream exists, then you can set up listeners
on
>>>>that
>>>> >> > specific hash via web sockets with the hash.
>>>> >> >
>>>> >> > For example, here is how you instruct DataSift to push new
>>>>messages
>>>> >> > meeting your criteria to a WebHooks end-point.
>>>> >> >
>>>> >> > curl -X POST 'https://api.datasift.com/push/create' \
>>>> >> > -d 'name=connectorhttp' \
>>>> >> > -d 'hash=dce320ce31a8919784e6e85aecbd040e' \
>>>> >> > -d 'output_type=http' \
>>>> >> > -d 'output_params.method=post' \
>>>> >> > -d 'output_params.url=http.example.com:8888' \
>>>> >> > -d 'output_params.use_gzip' \
>>>> >> > -d 'output_params.delivery_frequency=60' \
>>>> >> > -d 'output_params.max_size=10485760' \
>>>> >> > -d 'output_params.verify_ssl=false' \
>>>> >> > -d 'output_params.auth.type=none' \
>>>> >> > -d 'output_params.auth.username=YourHTTPServerUsername' \
>>>> >> > -d 'output_params.auth.password=YourHTTPServerPassword' \
>>>> >> > -H 'Auth: datasift-user:your-datasift-api-key
>>>> >> >
>>>> >> >
>>>> >> > Now new messages get pushed to me every 60 seconds, and I can
>>>>get the
>>>> >> feed
>>>> >> > in real-time like this:
>>>> >> >
>>>> >> > var websocketsUser = 'datasift-user';
>>>> >> > var websocketsHost = 'websocket.datasift.com';
>>>> >> > var streamHash = 'dce320ce31a8919784e6e85aecbd040e';
>>>> >> > var apiKey = 'your-datasift-api-key';
>>>> >> >
>>>> >> >
>>>> >> > var ws = new
>>>> >> >
>>>> >>
>>>> >> 
>>>>WebSocket('ws://'+websocketsHost+'/'+streamHash+'?username='+websockets
>>>>User
>>>> >> > +'&api_key='+apiKey);
>>>> >> >
>>>> >> > ws.onopen = function(evt) {
>>>> >> >     // connection event
>>>> >> >         $("#stream").append('open: '+evt.data+'<br/>');
>>>> >> > }
>>>> >> >
>>>> >> > ws.onmessage = function(evt) {
>>>> >> >     // parse received message
>>>> >> >         $("#stream").append('message: '+evt.data+'<br/>');
>>>> >> > }
>>>> >> >
>>>> >> > ws.onclose = function(evt) {
>>>> >> >     // parse event
>>>> >> >         $("#stream").append('close: '+evt.data+'<br/>');
>>>> >> > }
>>>> >> >
>>>> >> > // No event object is passed to the event callback, so no useful
>>>> >> debugging
>>>> >> > can be done
>>>> >> > ws.onerror = function() {
>>>> >> >     // Some error occurred
>>>> >> >         $("#stream").append('error: '+evt.data+'<br/>');
>>>> >> > }
>>>> >> >
>>>> >> >
>>>> >> > At W2OGroup we have built utility libraries for receiving and
>>>> >> > processing
>>>> >> > Json object streams from data sift in Storm/Kafka that I'm
>>>>interested
>>>> >> > in
>>>> >> > extending to work with Streams, and can probably commit to
the
>>>> >> > project if
>>>> >> > the community would find them useful.
>>>> >> >
>>>> >> >
>>>> >> > Steve Blackmon
>>>> >> > Director, Data Sciences
>>>> >> >
>>>> >> > 101 W. 6th Street
>>>> >> > Austin, Texas 78701
>>>> >> > cell 512.965.0451 | work 512.402.6366
>>>> >> > twitter @steveblackmon
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > On 1/31/13 5:45 PM, "Craig McClanahan" <craigmcc@gmail.com>
>>>>wrote:
>>>> >> >
>>>> >> >>We'll probably want some way to do the equivalent of ">",
">=",
>>>>"<",
>>>> >> "<=",
>>>> >> >>and "!=" in addition to the implicit "equal" that I assume
you
>>>>mean
>>>> >> >> in
>>>> >> >>this
>>>> >> >>example.
>>>> >> >>
>>>> >> >>Craig
>>>> >> >>
>>>> >> >>On Thu, Jan 31, 2013 at 3:39 PM, Jason Letourneau
>>>> >> >><jletourneau80@gmail.com>wrote:
>>>> >> >>
>>>> >> >>> I really like this - this is somewhat what I was getting
at
>>>>with
>>>> >> >>> the
>>>> >> >>> JSON object i.e. POST:
>>>> >> >>> {
>>>> >> >>> "subscriptions":
>>>> >> >>> [{"activityField":"value"},
>>>> >> >>> {"activityField":"value",
>>>> >> >>>  "anotherActivityField":"value" }
>>>> >> >>> ]
>>>> >> >>> }
>>>> >> >>>
>>>> >> >>> On Thu, Jan 31, 2013 at 4:32 PM, Craig McClanahan
>>>> >> >>> <craigmcc@gmail.com>
>>>> >> >>> wrote:
>>>> >> >>> > On Thu, Jan 31, 2013 at 12:00 PM, Jason Letourneau
>>>> >> >>> > <jletourneau80@gmail.com>wrote:
>>>> >> >>> >
>>>> >> >>> >> I am curious on the group's thinking about
subscriptions to
>>>> >> >>> >> activity
>>>> >> >>> >> streams.  As I am stubbing out the end-to-end
heartbeat on
>>>>my
>>>> >> >>>proposed
>>>> >> >>> >> architecture, I've just been working with
URL sources as the
>>>> >> >>> >> subscription mode.  Obviously this is a way
>>>>over-simplification.
>>>> >> >>> >>
>>>> >> >>> >> I know for shindig the social graph can be
used, but we
>>>>don't
>>>> >> >>> >> necessarily have that.  Considering the mechanism
for
>>>> >> >>> >> establishing a
>>>> >> >>> >> new subscription stream (defined as aggregated
individual
>>>> >> >>> >> activities
>>>> >> >>> >> pulled from a varying array of sources) is
POSTing to the
>>>> >> >>> >> Activity
>>>> >> >>> >> Streams server to establish the channel (currently
just a
>>>> >> >>> >> subscriptions=url1,url2,url3 is the over simplified
>>>> >> mechanism)...what
>>>> >> >>> >> would people see as a reasonable way to establish
>>>>subscriptions?
>>>> >> >>>List
>>>> >> >>> >> of userIds? Subjects?  How should these be
represented?  I
>>>>was
>>>> >> >>> >> thinking of a JSON object, but any one have
other thoughts?
>>>> >> >>> >>
>>>> >> >>> >> Jason
>>>> >> >>> >>
>>>> >> >>> >
>>>> >> >>> > One idea would be take some inspiration from how
JIRA lets
>>>>you
>>>> >> >>> > (in
>>>> >> >>> effect)
>>>> >> >>> > create a WHERE clause that looks at any fields
(in all the
>>>> >> >>> > activities
>>>> >> >>> > flowing through the server) that you want.
>>>> >> >>> >
>>>> >> >>> > Example filter criteria
>>>> >> >>> > * provider.id = 'xxx' // Filter on a particular
provider
>>>> >> >>> > * verb = 'yyy'
>>>> >> >>> > * object.type = 'blogpost'
>>>> >> >>> > and you'd want to accept more than one value (effectively
>>>> >> >>> > creating OR
>>>> >> >>>or
>>>> >> >>> IN
>>>> >> >>> > type clauses).
>>>> >> >>> >
>>>> >> >>> > For completeness, I'd want to be able to specify
more than
>>>>one
>>>> >> >>> > filter
>>>> >> >>> > expression in the same subscription.
>>>> >> >>> >
>>>> >> >>> > Craig
>>>> >> >>>
>>>> >> >
>>>> >>
>>>
>>>


Mime
View raw message