streams-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Letourneau <jletournea...@gmail.com>
Subject Re: Streams Subscriptions
Date Fri, 01 Feb 2013 21:08:52 GMT
that seems like a great place to go - I'm not personally familiar with
the DSL syntax of lucene, but I am familiar with the project

Jason

On Fri, Feb 1, 2013 at 2:34 PM, Steve Blackmon [W2O Digital]
<sblackmon@w2odigital.com> wrote:
> What do you think about standardizing on lucene (or at least supporting it
> natively) as a DSL to describe textual filters?
>
> Steve Blackmon
> Director, Data Sciences
>
> 101 W. 6th Street
> Austin, Texas 78701
> cell 512.965.0451 | work 512.402.6366
> twitter @steveblackmon
>
>
>
>
>
>
>
> On 2/1/13 1:31 PM, "Jason Letourneau" <jletourneau80@gmail.com> wrote:
>
>>slight iteration for clarity:
>>
>>{
>>    "auth_token": "token",
>>    "filters": [
>>        {
>>            "field": "fieldname",
>>            "comparison_operator": "operator",
>>            "value_set": [
>>                "val1",
>>                "val2"
>>            ]
>>        }
>>    ],
>>    "outputs": [
>>        {
>>            "output_type": "http",
>>            "method": "post",
>>            "url": "http.example.com:8888",
>>            "delivery_frequency": "60",
>>            "max_size": "10485760",
>>            "auth_type": "none",
>>            "username": "username",
>>            "password": "password"
>>        }
>>    ]
>>}
>>
>>On Fri, Feb 1, 2013 at 12:51 PM, Jason Letourneau
>><jletourneau80@gmail.com> wrote:
>>> So a subscription URL (result of setting up a subscription) is for all
>>> intents and purposes representative of a set of filters.  That
>>> subscription can be told to do a variety of things for delivery to the
>>> subscriber, but the identity of the subscription is rooted in its
>>> filters.  Posting additional filters to the subscription URL or
>>> additional output configurations affect the behavior of that
>>> subscription by either adding more filters or more outputs (removal as
>>> well).
>>>
>>> On Fri, Feb 1, 2013 at 12:17 PM, Craig McClanahan <craigmcc@gmail.com>
>>>wrote:
>>>> A couple of thoughts.
>>>>
>>>> * On "outputs" you list "username" and "password" as possible fields.
>>>>   I presume that specifying these would imply using HTTP Basic auth?
>>>>   We might want to consider different options as well.
>>>>
>>>> * From my (possibly myopic :-) viewpoint, the filtering and delivery
>>>>   decisions are different object types.  I'd like to be able to
>>>>register
>>>>   my set of filters and get a unique identifier for them, and then
>>>>   separately be able to say "send the results of subscription 123
>>>>   to this webhook URL every 60 minutes".
>>>>
>>>> * Regarding query syntax, pretty much any sort of simple patterns
>>>>   are probably not going to be sufficient for some use cases.  Maybe
>>>>   we should offer that as simple defaults, but also support falling
>>>>back
>>>>   to some sort of SQL-like syntax (i.e. what JIRA does on the
>>>>   advanced search).
>>>>
>>>> Craig
>>>>
>>>>
>>>> On Fri, Feb 1, 2013 at 8:55 AM, Jason Letourneau
>>>><jletourneau80@gmail.com>
>>>> wrote:
>>>>>
>>>>> Based on Steve and Craig's feedback, I've come up with something that
>>>>> I think can work.  Below it specifies that:
>>>>> 1) you can set up more than one subscription at a time
>>>>> 2) each subscription can have many outputs
>>>>> 3) each subscription can have many filters
>>>>>
>>>>> The details of the config would do things like determine the behavior
>>>>> of the stream delivery (is it posted back or is the subscriber polling
>>>>> for instance).  Also, all subscriptions created in this way would be
>>>>> accessed through a single URL.
>>>>>
>>>>> {
>>>>>     "auth_token": "token",
>>>>>     "subscriptions": [
>>>>>         {
>>>>>             "outputs": [
>>>>>                 {
>>>>>                     "output_type": "http",
>>>>>                     "method": "post",
>>>>>                     "url": "http.example.com:8888",
>>>>>                     "delivery_frequency": "60",
>>>>>                     "max_size": "10485760",
>>>>>                     "auth_type": "none",
>>>>>                     "username": "username",
>>>>>                     "password": "password"
>>>>>                 }
>>>>>             ]
>>>>>         },
>>>>>         {
>>>>>             "filters": [
>>>>>                 {
>>>>>                     "field": "fieldname",
>>>>>                     "comparison_operator": "operator",
>>>>>                     "value_set": [
>>>>>                         "val1",
>>>>>                         "val2"
>>>>>                     ]
>>>>>                 }
>>>>>             ]
>>>>>         }
>>>>>     ]
>>>>> }
>>>>>
>>>>> Thoughts?
>>>>>
>>>>> Jason
>>>>>
>>>>> On Thu, Jan 31, 2013 at 7:53 PM, Craig McClanahan <craigmcc@gmail.com>
>>>>> wrote:
>>>>> > Welcome Steve!
>>>>> >
>>>>> > DataSift's UI to set these things up is indeed pretty cool.  I think
>>>>> > what
>>>>> > we're talking about here is more what the internal REST APIs
>>>>>between the
>>>>> > UI
>>>>> > and the back end might look like.
>>>>> >
>>>>> > I also think we should deliberately separate the filter definition
>>>>>of a
>>>>> > "subscription" from the instructions on how the data gets
>>>>>delivered.  I
>>>>> > could see use cases for any or all of:
>>>>> > * Polling with a filter on oldest date of interest
>>>>> > * Webhook that gets updated at some specified interval
>>>>> > * URL to which the Streams server would periodically POST
>>>>> >   new activities (in case I don't have webhooks set up)
>>>>> >
>>>>> > Separately, looking at DataSift is a reminder we will want to be
>>>>>able to
>>>>> > filter on words inside an activity stream value like "subject" or
>>>>> > "content", not just on the entire value.
>>>>> >
>>>>> > Craig
>>>>> >
>>>>> > On Thu, Jan 31, 2013 at 4:29 PM, Jason Letourneau
>>>>> > <jletourneau80@gmail.com>wrote:
>>>>> >
>>>>> >> Hi Steve - thanks for the input and congrats on your first post
- I
>>>>> >> think what you are describing is where Craig and I are circling
>>>>>around
>>>>> >> (or something similar anyways) - the details on that POST request
>>>>>are
>>>>> >> really helpful in particular.  I'll try and put something together
>>>>> >> tomorrow that would be a start for the "setup" request (and
>>>>>subsequent
>>>>> >> additional configuration after the subscription is initialized)
and
>>>>> >> post back to the group.
>>>>> >>
>>>>> >> Jason
>>>>> >>
>>>>> >> On Thu, Jan 31, 2013 at 7:00 PM, Steve Blackmon [W2O Digital]
>>>>> >> <sblackmon@w2odigital.com> wrote:
>>>>> >> > First post from me (btw I am Steve, stoked about this project
and
>>>>> >> > meeting
>>>>> >> > everyone eventually.)
>>>>> >> >
>>>>> >> > Sorry if I missed the point of the thread, but I think
this is
>>>>> >> > related
>>>>> >> and
>>>>> >> > might be educational for some in the group.
>>>>> >> >
>>>>> >> > I like the way DataSift's API lets you establish streams
- you
>>>>>POST a
>>>>> >> > definition, it returns a hash, and thereafter their service
>>>>>follows
>>>>> >> > the
>>>>> >> > instructions you gave it as new messages meet the filter
you
>>>>>defined.
>>>>> >> > In
>>>>> >> > addition, once a stream exists, then you can set up listeners
on
>>>>>that
>>>>> >> > specific hash via web sockets with the hash.
>>>>> >> >
>>>>> >> > For example, here is how you instruct DataSift to push
new
>>>>>messages
>>>>> >> > meeting your criteria to a WebHooks end-point.
>>>>> >> >
>>>>> >> > curl -X POST 'https://api.datasift.com/push/create' \
>>>>> >> > -d 'name=connectorhttp' \
>>>>> >> > -d 'hash=dce320ce31a8919784e6e85aecbd040e' \
>>>>> >> > -d 'output_type=http' \
>>>>> >> > -d 'output_params.method=post' \
>>>>> >> > -d 'output_params.url=http.example.com:8888' \
>>>>> >> > -d 'output_params.use_gzip' \
>>>>> >> > -d 'output_params.delivery_frequency=60' \
>>>>> >> > -d 'output_params.max_size=10485760' \
>>>>> >> > -d 'output_params.verify_ssl=false' \
>>>>> >> > -d 'output_params.auth.type=none' \
>>>>> >> > -d 'output_params.auth.username=YourHTTPServerUsername'
\
>>>>> >> > -d 'output_params.auth.password=YourHTTPServerPassword'
\
>>>>> >> > -H 'Auth: datasift-user:your-datasift-api-key
>>>>> >> >
>>>>> >> >
>>>>> >> > Now new messages get pushed to me every 60 seconds, and
I can
>>>>>get the
>>>>> >> feed
>>>>> >> > in real-time like this:
>>>>> >> >
>>>>> >> > var websocketsUser = 'datasift-user';
>>>>> >> > var websocketsHost = 'websocket.datasift.com';
>>>>> >> > var streamHash = 'dce320ce31a8919784e6e85aecbd040e';
>>>>> >> > var apiKey = 'your-datasift-api-key';
>>>>> >> >
>>>>> >> >
>>>>> >> > var ws = new
>>>>> >> >
>>>>> >>
>>>>> >>
>>>>>WebSocket('ws://'+websocketsHost+'/'+streamHash+'?username='+websockets
>>>>>User
>>>>> >> > +'&api_key='+apiKey);
>>>>> >> >
>>>>> >> > ws.onopen = function(evt) {
>>>>> >> >     // connection event
>>>>> >> >         $("#stream").append('open: '+evt.data+'<br/>');
>>>>> >> > }
>>>>> >> >
>>>>> >> > ws.onmessage = function(evt) {
>>>>> >> >     // parse received message
>>>>> >> >         $("#stream").append('message: '+evt.data+'<br/>');
>>>>> >> > }
>>>>> >> >
>>>>> >> > ws.onclose = function(evt) {
>>>>> >> >     // parse event
>>>>> >> >         $("#stream").append('close: '+evt.data+'<br/>');
>>>>> >> > }
>>>>> >> >
>>>>> >> > // No event object is passed to the event callback, so
no useful
>>>>> >> debugging
>>>>> >> > can be done
>>>>> >> > ws.onerror = function() {
>>>>> >> >     // Some error occurred
>>>>> >> >         $("#stream").append('error: '+evt.data+'<br/>');
>>>>> >> > }
>>>>> >> >
>>>>> >> >
>>>>> >> > At W2OGroup we have built utility libraries for receiving
and
>>>>> >> > processing
>>>>> >> > Json object streams from data sift in Storm/Kafka that
I'm
>>>>>interested
>>>>> >> > in
>>>>> >> > extending to work with Streams, and can probably commit
to the
>>>>> >> > project if
>>>>> >> > the community would find them useful.
>>>>> >> >
>>>>> >> >
>>>>> >> > Steve Blackmon
>>>>> >> > Director, Data Sciences
>>>>> >> >
>>>>> >> > 101 W. 6th Street
>>>>> >> > Austin, Texas 78701
>>>>> >> > cell 512.965.0451 | work 512.402.6366
>>>>> >> > twitter @steveblackmon
>>>>> >> >
>>>>> >> >
>>>>> >> >
>>>>> >> >
>>>>> >> >
>>>>> >> >
>>>>> >> >
>>>>> >> > On 1/31/13 5:45 PM, "Craig McClanahan" <craigmcc@gmail.com>
>>>>>wrote:
>>>>> >> >
>>>>> >> >>We'll probably want some way to do the equivalent of
">", ">=",
>>>>>"<",
>>>>> >> "<=",
>>>>> >> >>and "!=" in addition to the implicit "equal" that I
assume you
>>>>>mean
>>>>> >> >> in
>>>>> >> >>this
>>>>> >> >>example.
>>>>> >> >>
>>>>> >> >>Craig
>>>>> >> >>
>>>>> >> >>On Thu, Jan 31, 2013 at 3:39 PM, Jason Letourneau
>>>>> >> >><jletourneau80@gmail.com>wrote:
>>>>> >> >>
>>>>> >> >>> I really like this - this is somewhat what I was
getting at
>>>>>with
>>>>> >> >>> the
>>>>> >> >>> JSON object i.e. POST:
>>>>> >> >>> {
>>>>> >> >>> "subscriptions":
>>>>> >> >>> [{"activityField":"value"},
>>>>> >> >>> {"activityField":"value",
>>>>> >> >>>  "anotherActivityField":"value" }
>>>>> >> >>> ]
>>>>> >> >>> }
>>>>> >> >>>
>>>>> >> >>> On Thu, Jan 31, 2013 at 4:32 PM, Craig McClanahan
>>>>> >> >>> <craigmcc@gmail.com>
>>>>> >> >>> wrote:
>>>>> >> >>> > On Thu, Jan 31, 2013 at 12:00 PM, Jason Letourneau
>>>>> >> >>> > <jletourneau80@gmail.com>wrote:
>>>>> >> >>> >
>>>>> >> >>> >> I am curious on the group's thinking about
subscriptions to
>>>>> >> >>> >> activity
>>>>> >> >>> >> streams.  As I am stubbing out the end-to-end
heartbeat on
>>>>>my
>>>>> >> >>>proposed
>>>>> >> >>> >> architecture, I've just been working with
URL sources as the
>>>>> >> >>> >> subscription mode.  Obviously this is
a way
>>>>>over-simplification.
>>>>> >> >>> >>
>>>>> >> >>> >> I know for shindig the social graph can
be used, but we
>>>>>don't
>>>>> >> >>> >> necessarily have that.  Considering the
mechanism for
>>>>> >> >>> >> establishing a
>>>>> >> >>> >> new subscription stream (defined as aggregated
individual
>>>>> >> >>> >> activities
>>>>> >> >>> >> pulled from a varying array of sources)
is POSTing to the
>>>>> >> >>> >> Activity
>>>>> >> >>> >> Streams server to establish the channel
(currently just a
>>>>> >> >>> >> subscriptions=url1,url2,url3 is the over
simplified
>>>>> >> mechanism)...what
>>>>> >> >>> >> would people see as a reasonable way to
establish
>>>>>subscriptions?
>>>>> >> >>>List
>>>>> >> >>> >> of userIds? Subjects?  How should these
be represented?  I
>>>>>was
>>>>> >> >>> >> thinking of a JSON object, but any one
have other thoughts?
>>>>> >> >>> >>
>>>>> >> >>> >> Jason
>>>>> >> >>> >>
>>>>> >> >>> >
>>>>> >> >>> > One idea would be take some inspiration from
how JIRA lets
>>>>>you
>>>>> >> >>> > (in
>>>>> >> >>> effect)
>>>>> >> >>> > create a WHERE clause that looks at any fields
(in all the
>>>>> >> >>> > activities
>>>>> >> >>> > flowing through the server) that you want.
>>>>> >> >>> >
>>>>> >> >>> > Example filter criteria
>>>>> >> >>> > * provider.id = 'xxx' // Filter on a particular
provider
>>>>> >> >>> > * verb = 'yyy'
>>>>> >> >>> > * object.type = 'blogpost'
>>>>> >> >>> > and you'd want to accept more than one value
(effectively
>>>>> >> >>> > creating OR
>>>>> >> >>>or
>>>>> >> >>> IN
>>>>> >> >>> > type clauses).
>>>>> >> >>> >
>>>>> >> >>> > For completeness, I'd want to be able to specify
more than
>>>>>one
>>>>> >> >>> > filter
>>>>> >> >>> > expression in the same subscription.
>>>>> >> >>> >
>>>>> >> >>> > Craig
>>>>> >> >>>
>>>>> >> >
>>>>> >>
>>>>
>>>>
>

Mime
View raw message