Return-Path: X-Original-To: apmail-spark-dev-archive@minotaur.apache.org Delivered-To: apmail-spark-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BFE8118B49 for ; Tue, 1 Mar 2016 16:53:51 +0000 (UTC) Received: (qmail 12204 invoked by uid 500); 1 Mar 2016 16:53:50 -0000 Delivered-To: apmail-spark-dev-archive@spark.apache.org Received: (qmail 12091 invoked by uid 500); 1 Mar 2016 16:53:50 -0000 Mailing-List: contact dev-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list dev@spark.apache.org Received: (qmail 12079 invoked by uid 99); 1 Mar 2016 16:53:49 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 01 Mar 2016 16:53:49 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 47E731800ED for ; Tue, 1 Mar 2016 16:53:49 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.179 X-Spam-Level: * X-Spam-Status: No, score=1.179 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx2-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id HgnLfj4BwsAU for ; Tue, 1 Mar 2016 16:53:47 +0000 (UTC) Received: from mail-vk0-f45.google.com (mail-vk0-f45.google.com [209.85.213.45]) by mx2-lw-eu.apache.org (ASF Mail Server at mx2-lw-eu.apache.org) with ESMTPS id 3E4575F33F for ; Tue, 1 Mar 2016 16:53:46 +0000 (UTC) Received: by mail-vk0-f45.google.com with SMTP id k196so174647837vka.0 for ; Tue, 01 Mar 2016 08:53:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc; bh=ZdrYPsK72/FjtMJUCyOugjLMsOZ+RfvXd1cCTaA6klw=; b=IMyMiKfxATit2qSPNdZ947h663IUSxpJDPAgEMBN8E3V6PFQYAzbSK5vguAWegftT2 4v180SNbDIEdkrif2tmAIe7Iievu9CjHd6RY+ZUjwazuob6yNh8pfKMhFr6D3NkSg9l/ YAZC+M1kGb/MMmKz4ffxa39e2KgZU+U2H3fQrnMBxopadpBtCDozKdskXYnQ38ZQRdZg 7LS+ESWT4e3MlvG7lYpkBRDmCAHtrf9v+AggmSXuZy4fmOHjBTCqqw6BEzO77qs68fk7 1gFdV6kinHeKeyplj692zvp8FX9KwibdL2z1Edundb76UEs0axzdFdjucIc04hmWT2sS u/ag== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc; bh=ZdrYPsK72/FjtMJUCyOugjLMsOZ+RfvXd1cCTaA6klw=; b=D0MG19lKdTbbQRKaboHJAkByoXSu5gfCH1Ultb9QfYhDGMo/1sk0i7E1eBsSpHE7dx DMoV7Ous6Ywq6ePsbaNuaNGS+/Hczc12dKSo5ZdGDl2QmANwaoQiwEuxw9UrfG2zOOqp 4UriMPZNfF8NVXtApeSR4Agrx8ngIjGZUz3ZIKqvoCgldXIcW/sxR+KNvjqKRDSF1BCk 2oydhJzfQJq1EeqpBXsc9REpk2r+gK+WQFZ9Q4kMfqnRBblirY4nccT1FV8Xt5660dO/ ZWj1sSgbX+K0ikG9xl/oK/P4hjoU8qJ0/UP1ochuHIw864H9ctVkXdpi3HLpJz/Y9f87 q3HA== X-Gm-Message-State: AD7BkJKI5wXFi/Xrpf9n9sJ6jJN/bOLLCckgJc8wNMNmok5K3xNLP36M5IrIOlD4wJ3ZIyUVSvod+K98Qqyo/Q== MIME-Version: 1.0 X-Received: by 10.31.16.37 with SMTP id g37mr14250455vki.106.1456850919102; Tue, 01 Mar 2016 08:48:39 -0800 (PST) Received: by 10.31.129.9 with HTTP; Tue, 1 Mar 2016 08:48:39 -0800 (PST) In-Reply-To: References: Date: Tue, 1 Mar 2016 11:48:39 -0500 Message-ID: Subject: Re: SPARK-SQL: Pattern Detection on Live Event or Archived Event Data From: Jerry Lam To: Henri Dubois-Ferriere Cc: =?UTF-8?Q?Herman_van_H=C3=B6vell_tot_Westerflier?= , Spark dev list Content-Type: multipart/alternative; boundary=001a114318c88620da052cff89a3 --001a114318c88620da052cff89a3 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi Henri, Finally, there is a good reason for me to use Flink! Thanks for sharing this information. This is exactly the solution I'm looking for especially the ticket references a paper I was reading a week ago. It would be nice if Flink adds support SQL because this makes business analyst (traders as well) a way to express it. Best Regards, Jerry On Tue, Mar 1, 2016 at 11:34 AM, Henri Dubois-Ferriere wrote: > fwiw Apache Flink just added CEP. Queries are constructed programmaticall= y > rather than in SQL, but the underlying functionality is similar. > > https://issues.apache.org/jira/browse/FLINK-3215 > > On 1 March 2016 at 08:19, Jerry Lam wrote: > >> Hi Herman, >> >> Thank you for your reply! >> This functionality usually finds its place in financial services which >> use CEP (complex event processing) for correlation and pattern matching. >> Many commercial products have this including Oracle and Teradata Aster D= ata >> MR Analytics. I do agree the syntax a bit awkward but after you understa= nd >> it, it is actually very compact for expressing something that is very >> complex. Esper has this feature partially implemented ( >> http://www.espertech.com/esper/release-5.1.0/esper-reference/html/match-= recognize.html >> ). >> >> I found the Teradata Analytics documentation best to describe the usage >> of it. For example (note npath is similar to match_recognize): >> >> SELECT last_pageid, MAX( count_page80 ) >> FROM nPath( >> ON ( SELECT * FROM clicks WHERE category >=3D 0 ) >> PARTITION BY sessionid >> ORDER BY ts >> PATTERN ( 'A.(B|C)*' ) >> MODE ( OVERLAPPING ) >> SYMBOLS ( pageid =3D 50 AS A, >> pageid =3D 80 AS B, >> pageid <> 80 AND category IN (9,10) AS C ) >> RESULT ( LAST ( pageid OF ANY ( A,B,C ) ) AS last_pageid, >> COUNT ( * OF B ) AS count_page80, >> COUNT ( * OF ANY ( A,B,C ) ) AS count_any ) >> ) >> WHERE count_any >=3D 5 >> GROUP BY last_pageid >> ORDER BY MAX( count_page80 ) >> >> The above means: >> Find user click-paths starting at pageid 50 and passing exclusively >> through either pageid 80 or pages in category 9 or category 10. Find the >> pageid of the last page in the path and count the number of times page 8= 0 >> was visited. Report the maximum count for each last page, and sort the >> output by the latter. Restrict to paths containing at least 5 pages. Ign= ore >> pages in the sequence with category < 0. >> >> If this query is written in pure SQL (if possible at all), it requires >> several self-joins. The interesting thing about this feature is that it >> integrates SQL+Streaming+ML in one (perhaps potentially graph too). >> >> Best Regards, >> >> Jerry >> >> >> On Tue, Mar 1, 2016 at 9:39 AM, Herman van H=C3=B6vell tot Westerflier < >> hvanhovell@questtec.nl> wrote: >> >>> Hi Jerry, >>> >>> This is not on any roadmap. I (shortly) browsed through this; and this >>> looks like some sort of a window function with very awkward syntax. I t= hink >>> spark provided better constructs for this using dataframes/datasets/nes= ted >>> data... >>> >>> Feel free to submit a PR. >>> >>> Kind regards, >>> >>> Herman van H=C3=B6vell >>> >>> 2016-03-01 15:16 GMT+01:00 Jerry Lam : >>> >>>> Hi Spark developers, >>>> >>>> Will you consider to add support for implementing "Pattern matching in >>>> sequences of rows"? More specifically, I'm referring to this: >>>> http://web.cs.ucla.edu/classes/fall15/cs240A/notes/temporal/row-patter= n-recogniton-11.pdf >>>> >>>> This is a very cool/useful feature to pattern matching over live >>>> stream/archived data. It is sorted of related to machine learning beca= use >>>> this is usually used in clickstream analysis or path analysis. Also it= is >>>> related to streaming because of the nature of the processing (time ser= ies >>>> data mostly). It is SQL because there is a good way to express and opt= imize >>>> the query. >>>> >>>> Best Regards, >>>> >>>> Jerry >>>> >>> >>> >> > --001a114318c88620da052cff89a3 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi Henri,

Finally, there is a good reas= on for me to use Flink! Thanks for sharing this information. This is exactl= y the solution I'm looking for especially the ticket references a paper= I was reading a week ago. It would be nice if Flink adds support SQL becau= se this makes business analyst (traders as well) a way to express it.=C2=A0=

Best Regards,

Jerry=C2= =A0

On= Tue, Mar 1, 2016 at 11:34 AM, Henri Dubois-Ferriere <= henridf@gmail.com> wrote:
fwiw= Apache Flink just added CEP. Queries are constructed programmatically rath= er than in SQL, but the underlying functionality is similar.

=

On 1 March 2016 at 08:19, Jerry Lam <chilingla= m@gmail.com> wrote:
Hi Herman,

Thank you for your reply!
This functionality usually finds its place in financial services which use= CEP (complex event processing) for correlation and pattern matching. Many = commercial products have this including Oracle and Teradata Aster Data MR A= nalytics. I do agree the syntax a bit awkward but after you understand it, = it is actually very compact for expressing something that is very complex. = Esper has this feature partially implemented (http://www.espertech.com/esper/release-5.1.0/esper-reference/h= tml/match-recognize.html).=C2=A0

I found the T= eradata Analytics documentation best to describe the usage of it. For examp= le (note npath is similar to match_recognize):

SELECT last_pageid, MAX( count_page80 )
=C2=A0FROM nPath(
=
=C2=A0ON ( SELECT * FROM clicks WHERE category >=3D 0 )
= =C2=A0PARTITION BY sessionid
=C2=A0ORDER BY ts
=C2=A0PA= TTERN ( 'A.(B|C)*' )
=C2=A0MODE ( OVERLAPPING )
=C2=A0SYMBOLS ( pageid =3D 50 AS A,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0pageid =3D 80 AS B,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0pageid <> 80 AND category IN (9,10) AS C )
=C2=A0= RESULT ( LAST ( pageid OF ANY ( A,B,C ) ) AS last_pageid,
=C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 COUNT ( * OF B ) AS count_page80,
=C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 COUNT ( * OF ANY ( A,B,C ) ) AS count_any )=
=C2=A0)
=C2=A0WHERE count_any >=3D 5
=C2= =A0GROUP BY last_pageid
=C2=A0ORDER BY MAX( count_page80 )
<= /div>

The above means:
Find user click-paths s= tarting at pageid 50 and passing exclusively through either pageid 80 or pages in category 9 or category 10. Find the pageid of the last page in the= path and count the number of times page 80 was visited. Report the maximum count for each last= page, and sort the output by the latter. Restrict to paths containing at least 5 pages. Ignore= pages in the sequence with category < 0.

If this query is written= in pure SQL (if possible at all), it requires several self-joins. The inte= resting thing about this feature is that it integrates SQL+Streaming+ML in = one (perhaps potentially graph too).

Best Regards,=

Jerry


On Tue, Mar 1, 2016 at 9= :39 AM, Herman van H=C3=B6vell tot Westerflier <hvanhovell@questtec.n= l> wrote:
=
Hi Jerry,

This is not on any roadmap. I (shor= tly) browsed through this; and this looks like some sort of a window functi= on with very awkward syntax. I think spark provided better constructs for t= his using dataframes/datasets/nested data...=C2=A0

Feel free to submit a PR.

Kind regards,
=

Herman van H=C3=B6vell

2016-03-01 15:16 GMT+01:00 Jerry Lam <ch= ilinglam@gmail.com>:
Hi Spark developers,

Will you consider to add= support for implementing "Pattern matching in sequences of rows"= ? More specifically, I'm referring to this:=C2=A0http://web.cs.ucla.edu/classes/fall15/cs240A/notes/t= emporal/row-pattern-recogniton-11.pdf

This is = a very cool/useful feature to pattern matching over live stream/archived da= ta. It is sorted of related to machine learning because this is usually use= d in clickstream analysis or path analysis. Also it is related to streaming= because of the nature of the processing (time series data mostly). It is S= QL because there is a good way to express and optimize the query.=C2=A0

Best Regards,

Jerry




--001a114318c88620da052cff89a3--