Return-Path: X-Original-To: apmail-streams-dev-archive@minotaur.apache.org Delivered-To: apmail-streams-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 805CFE7E2 for ; Fri, 1 Feb 2013 21:09:19 +0000 (UTC) Received: (qmail 31923 invoked by uid 500); 1 Feb 2013 21:09:19 -0000 Delivered-To: apmail-streams-dev-archive@streams.apache.org Received: (qmail 31888 invoked by uid 500); 1 Feb 2013 21:09:19 -0000 Mailing-List: contact dev-help@streams.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@streams.incubator.apache.org Delivered-To: mailing list dev@streams.incubator.apache.org Received: (qmail 31879 invoked by uid 99); 1 Feb 2013 21:09:19 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 01 Feb 2013 21:09:19 +0000 X-ASF-Spam-Status: No, hits=-0.5 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jletourneau80@gmail.com designates 74.125.82.41 as permitted sender) Received: from [74.125.82.41] (HELO mail-wg0-f41.google.com) (74.125.82.41) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 01 Feb 2013 21:09:12 +0000 Received: by mail-wg0-f41.google.com with SMTP id ds1so1619095wgb.4 for ; Fri, 01 Feb 2013 13:08:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:cc:content-type; bh=9B3lhYiTPJFRUJyjyWfWxcBMdugwITs2RfzG0AfLu1I=; b=HnHlzKE3uR+Jk5oNXpb/swklNMs10I5NGRk/HaGr4cK1TOxGKn73uPeAPk8eW/rUgP 5jwsFYXMaRyVHej+kWZhmD21xMoeqWF3KmtUBc/sAgsunKBye0J8wrpGOYD16q45C4Lk dQXHpoMcuWr6RgVrdLEz8pWqC7FpVQaGEEa7wlXRPnB7U9R2tv+nq2mxZht8IQ+k+wsH AZ2SyCYlMJ/iqRRMDQ2QTegC3BUw/YOBVHVehNS2vIkBGa9iDacZEpr0s/u+Vwa4J9Tq bN9r4TeoPUogXBiWUEc2Hye2+d5fzy+oN0AOESyG62HvYtmU3HF+2a4dk6goU0V0bgCV T9YQ== MIME-Version: 1.0 X-Received: by 10.194.76.7 with SMTP id g7mr24212655wjw.50.1359752932765; Fri, 01 Feb 2013 13:08:52 -0800 (PST) Received: by 10.194.57.164 with HTTP; Fri, 1 Feb 2013 13:08:52 -0800 (PST) In-Reply-To: References: Date: Fri, 1 Feb 2013 16:08:52 -0500 Message-ID: Subject: Re: Streams Subscriptions From: Jason Letourneau To: dev@streams.incubator.apache.org Cc: Craig McClanahan Content-Type: text/plain; charset=ISO-8859-1 X-Virus-Checked: Checked by ClamAV on apache.org that seems like a great place to go - I'm not personally familiar with the DSL syntax of lucene, but I am familiar with the project Jason On Fri, Feb 1, 2013 at 2:34 PM, Steve Blackmon [W2O Digital] wrote: > What do you think about standardizing on lucene (or at least supporting it > natively) as a DSL to describe textual filters? > > Steve Blackmon > Director, Data Sciences > > 101 W. 6th Street > Austin, Texas 78701 > cell 512.965.0451 | work 512.402.6366 > twitter @steveblackmon > > > > > > > > On 2/1/13 1:31 PM, "Jason Letourneau" wrote: > >>slight iteration for clarity: >> >>{ >> "auth_token": "token", >> "filters": [ >> { >> "field": "fieldname", >> "comparison_operator": "operator", >> "value_set": [ >> "val1", >> "val2" >> ] >> } >> ], >> "outputs": [ >> { >> "output_type": "http", >> "method": "post", >> "url": "http.example.com:8888", >> "delivery_frequency": "60", >> "max_size": "10485760", >> "auth_type": "none", >> "username": "username", >> "password": "password" >> } >> ] >>} >> >>On Fri, Feb 1, 2013 at 12:51 PM, Jason Letourneau >> wrote: >>> So a subscription URL (result of setting up a subscription) is for all >>> intents and purposes representative of a set of filters. That >>> subscription can be told to do a variety of things for delivery to the >>> subscriber, but the identity of the subscription is rooted in its >>> filters. Posting additional filters to the subscription URL or >>> additional output configurations affect the behavior of that >>> subscription by either adding more filters or more outputs (removal as >>> well). >>> >>> On Fri, Feb 1, 2013 at 12:17 PM, Craig McClanahan >>>wrote: >>>> A couple of thoughts. >>>> >>>> * On "outputs" you list "username" and "password" as possible fields. >>>> I presume that specifying these would imply using HTTP Basic auth? >>>> We might want to consider different options as well. >>>> >>>> * From my (possibly myopic :-) viewpoint, the filtering and delivery >>>> decisions are different object types. I'd like to be able to >>>>register >>>> my set of filters and get a unique identifier for them, and then >>>> separately be able to say "send the results of subscription 123 >>>> to this webhook URL every 60 minutes". >>>> >>>> * Regarding query syntax, pretty much any sort of simple patterns >>>> are probably not going to be sufficient for some use cases. Maybe >>>> we should offer that as simple defaults, but also support falling >>>>back >>>> to some sort of SQL-like syntax (i.e. what JIRA does on the >>>> advanced search). >>>> >>>> Craig >>>> >>>> >>>> On Fri, Feb 1, 2013 at 8:55 AM, Jason Letourneau >>>> >>>> wrote: >>>>> >>>>> Based on Steve and Craig's feedback, I've come up with something that >>>>> I think can work. Below it specifies that: >>>>> 1) you can set up more than one subscription at a time >>>>> 2) each subscription can have many outputs >>>>> 3) each subscription can have many filters >>>>> >>>>> The details of the config would do things like determine the behavior >>>>> of the stream delivery (is it posted back or is the subscriber polling >>>>> for instance). Also, all subscriptions created in this way would be >>>>> accessed through a single URL. >>>>> >>>>> { >>>>> "auth_token": "token", >>>>> "subscriptions": [ >>>>> { >>>>> "outputs": [ >>>>> { >>>>> "output_type": "http", >>>>> "method": "post", >>>>> "url": "http.example.com:8888", >>>>> "delivery_frequency": "60", >>>>> "max_size": "10485760", >>>>> "auth_type": "none", >>>>> "username": "username", >>>>> "password": "password" >>>>> } >>>>> ] >>>>> }, >>>>> { >>>>> "filters": [ >>>>> { >>>>> "field": "fieldname", >>>>> "comparison_operator": "operator", >>>>> "value_set": [ >>>>> "val1", >>>>> "val2" >>>>> ] >>>>> } >>>>> ] >>>>> } >>>>> ] >>>>> } >>>>> >>>>> Thoughts? >>>>> >>>>> Jason >>>>> >>>>> On Thu, Jan 31, 2013 at 7:53 PM, Craig McClanahan >>>>> wrote: >>>>> > Welcome Steve! >>>>> > >>>>> > DataSift's UI to set these things up is indeed pretty cool. I think >>>>> > what >>>>> > we're talking about here is more what the internal REST APIs >>>>>between the >>>>> > UI >>>>> > and the back end might look like. >>>>> > >>>>> > I also think we should deliberately separate the filter definition >>>>>of a >>>>> > "subscription" from the instructions on how the data gets >>>>>delivered. I >>>>> > could see use cases for any or all of: >>>>> > * Polling with a filter on oldest date of interest >>>>> > * Webhook that gets updated at some specified interval >>>>> > * URL to which the Streams server would periodically POST >>>>> > new activities (in case I don't have webhooks set up) >>>>> > >>>>> > Separately, looking at DataSift is a reminder we will want to be >>>>>able to >>>>> > filter on words inside an activity stream value like "subject" or >>>>> > "content", not just on the entire value. >>>>> > >>>>> > Craig >>>>> > >>>>> > On Thu, Jan 31, 2013 at 4:29 PM, Jason Letourneau >>>>> > wrote: >>>>> > >>>>> >> Hi Steve - thanks for the input and congrats on your first post - I >>>>> >> think what you are describing is where Craig and I are circling >>>>>around >>>>> >> (or something similar anyways) - the details on that POST request >>>>>are >>>>> >> really helpful in particular. I'll try and put something together >>>>> >> tomorrow that would be a start for the "setup" request (and >>>>>subsequent >>>>> >> additional configuration after the subscription is initialized) and >>>>> >> post back to the group. >>>>> >> >>>>> >> Jason >>>>> >> >>>>> >> On Thu, Jan 31, 2013 at 7:00 PM, Steve Blackmon [W2O Digital] >>>>> >> wrote: >>>>> >> > First post from me (btw I am Steve, stoked about this project and >>>>> >> > meeting >>>>> >> > everyone eventually.) >>>>> >> > >>>>> >> > Sorry if I missed the point of the thread, but I think this is >>>>> >> > related >>>>> >> and >>>>> >> > might be educational for some in the group. >>>>> >> > >>>>> >> > I like the way DataSift's API lets you establish streams - you >>>>>POST a >>>>> >> > definition, it returns a hash, and thereafter their service >>>>>follows >>>>> >> > the >>>>> >> > instructions you gave it as new messages meet the filter you >>>>>defined. >>>>> >> > In >>>>> >> > addition, once a stream exists, then you can set up listeners on >>>>>that >>>>> >> > specific hash via web sockets with the hash. >>>>> >> > >>>>> >> > For example, here is how you instruct DataSift to push new >>>>>messages >>>>> >> > meeting your criteria to a WebHooks end-point. >>>>> >> > >>>>> >> > curl -X POST 'https://api.datasift.com/push/create' \ >>>>> >> > -d 'name=connectorhttp' \ >>>>> >> > -d 'hash=dce320ce31a8919784e6e85aecbd040e' \ >>>>> >> > -d 'output_type=http' \ >>>>> >> > -d 'output_params.method=post' \ >>>>> >> > -d 'output_params.url=http.example.com:8888' \ >>>>> >> > -d 'output_params.use_gzip' \ >>>>> >> > -d 'output_params.delivery_frequency=60' \ >>>>> >> > -d 'output_params.max_size=10485760' \ >>>>> >> > -d 'output_params.verify_ssl=false' \ >>>>> >> > -d 'output_params.auth.type=none' \ >>>>> >> > -d 'output_params.auth.username=YourHTTPServerUsername' \ >>>>> >> > -d 'output_params.auth.password=YourHTTPServerPassword' \ >>>>> >> > -H 'Auth: datasift-user:your-datasift-api-key >>>>> >> > >>>>> >> > >>>>> >> > Now new messages get pushed to me every 60 seconds, and I can >>>>>get the >>>>> >> feed >>>>> >> > in real-time like this: >>>>> >> > >>>>> >> > var websocketsUser = 'datasift-user'; >>>>> >> > var websocketsHost = 'websocket.datasift.com'; >>>>> >> > var streamHash = 'dce320ce31a8919784e6e85aecbd040e'; >>>>> >> > var apiKey = 'your-datasift-api-key'; >>>>> >> > >>>>> >> > >>>>> >> > var ws = new >>>>> >> > >>>>> >> >>>>> >> >>>>>WebSocket('ws://'+websocketsHost+'/'+streamHash+'?username='+websockets >>>>>User >>>>> >> > +'&api_key='+apiKey); >>>>> >> > >>>>> >> > ws.onopen = function(evt) { >>>>> >> > // connection event >>>>> >> > $("#stream").append('open: '+evt.data+'
'); >>>>> >> > } >>>>> >> > >>>>> >> > ws.onmessage = function(evt) { >>>>> >> > // parse received message >>>>> >> > $("#stream").append('message: '+evt.data+'
'); >>>>> >> > } >>>>> >> > >>>>> >> > ws.onclose = function(evt) { >>>>> >> > // parse event >>>>> >> > $("#stream").append('close: '+evt.data+'
'); >>>>> >> > } >>>>> >> > >>>>> >> > // No event object is passed to the event callback, so no useful >>>>> >> debugging >>>>> >> > can be done >>>>> >> > ws.onerror = function() { >>>>> >> > // Some error occurred >>>>> >> > $("#stream").append('error: '+evt.data+'
'); >>>>> >> > } >>>>> >> > >>>>> >> > >>>>> >> > At W2OGroup we have built utility libraries for receiving and >>>>> >> > processing >>>>> >> > Json object streams from data sift in Storm/Kafka that I'm >>>>>interested >>>>> >> > in >>>>> >> > extending to work with Streams, and can probably commit to the >>>>> >> > project if >>>>> >> > the community would find them useful. >>>>> >> > >>>>> >> > >>>>> >> > Steve Blackmon >>>>> >> > Director, Data Sciences >>>>> >> > >>>>> >> > 101 W. 6th Street >>>>> >> > Austin, Texas 78701 >>>>> >> > cell 512.965.0451 | work 512.402.6366 >>>>> >> > twitter @steveblackmon >>>>> >> > >>>>> >> > >>>>> >> > >>>>> >> > >>>>> >> > >>>>> >> > >>>>> >> > >>>>> >> > On 1/31/13 5:45 PM, "Craig McClanahan" >>>>>wrote: >>>>> >> > >>>>> >> >>We'll probably want some way to do the equivalent of ">", ">=", >>>>>"<", >>>>> >> "<=", >>>>> >> >>and "!=" in addition to the implicit "equal" that I assume you >>>>>mean >>>>> >> >> in >>>>> >> >>this >>>>> >> >>example. >>>>> >> >> >>>>> >> >>Craig >>>>> >> >> >>>>> >> >>On Thu, Jan 31, 2013 at 3:39 PM, Jason Letourneau >>>>> >> >>wrote: >>>>> >> >> >>>>> >> >>> I really like this - this is somewhat what I was getting at >>>>>with >>>>> >> >>> the >>>>> >> >>> JSON object i.e. POST: >>>>> >> >>> { >>>>> >> >>> "subscriptions": >>>>> >> >>> [{"activityField":"value"}, >>>>> >> >>> {"activityField":"value", >>>>> >> >>> "anotherActivityField":"value" } >>>>> >> >>> ] >>>>> >> >>> } >>>>> >> >>> >>>>> >> >>> On Thu, Jan 31, 2013 at 4:32 PM, Craig McClanahan >>>>> >> >>> >>>>> >> >>> wrote: >>>>> >> >>> > On Thu, Jan 31, 2013 at 12:00 PM, Jason Letourneau >>>>> >> >>> > wrote: >>>>> >> >>> > >>>>> >> >>> >> I am curious on the group's thinking about subscriptions to >>>>> >> >>> >> activity >>>>> >> >>> >> streams. As I am stubbing out the end-to-end heartbeat on >>>>>my >>>>> >> >>>proposed >>>>> >> >>> >> architecture, I've just been working with URL sources as the >>>>> >> >>> >> subscription mode. Obviously this is a way >>>>>over-simplification. >>>>> >> >>> >> >>>>> >> >>> >> I know for shindig the social graph can be used, but we >>>>>don't >>>>> >> >>> >> necessarily have that. Considering the mechanism for >>>>> >> >>> >> establishing a >>>>> >> >>> >> new subscription stream (defined as aggregated individual >>>>> >> >>> >> activities >>>>> >> >>> >> pulled from a varying array of sources) is POSTing to the >>>>> >> >>> >> Activity >>>>> >> >>> >> Streams server to establish the channel (currently just a >>>>> >> >>> >> subscriptions=url1,url2,url3 is the over simplified >>>>> >> mechanism)...what >>>>> >> >>> >> would people see as a reasonable way to establish >>>>>subscriptions? >>>>> >> >>>List >>>>> >> >>> >> of userIds? Subjects? How should these be represented? I >>>>>was >>>>> >> >>> >> thinking of a JSON object, but any one have other thoughts? >>>>> >> >>> >> >>>>> >> >>> >> Jason >>>>> >> >>> >> >>>>> >> >>> > >>>>> >> >>> > One idea would be take some inspiration from how JIRA lets >>>>>you >>>>> >> >>> > (in >>>>> >> >>> effect) >>>>> >> >>> > create a WHERE clause that looks at any fields (in all the >>>>> >> >>> > activities >>>>> >> >>> > flowing through the server) that you want. >>>>> >> >>> > >>>>> >> >>> > Example filter criteria >>>>> >> >>> > * provider.id = 'xxx' // Filter on a particular provider >>>>> >> >>> > * verb = 'yyy' >>>>> >> >>> > * object.type = 'blogpost' >>>>> >> >>> > and you'd want to accept more than one value (effectively >>>>> >> >>> > creating OR >>>>> >> >>>or >>>>> >> >>> IN >>>>> >> >>> > type clauses). >>>>> >> >>> > >>>>> >> >>> > For completeness, I'd want to be able to specify more than >>>>>one >>>>> >> >>> > filter >>>>> >> >>> > expression in the same subscription. >>>>> >> >>> > >>>>> >> >>> > Craig >>>>> >> >>> >>>>> >> > >>>>> >> >>>> >>>> >