spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jerry Lam <chiling...@gmail.com>
Subject Re: SPARK-SQL: Pattern Detection on Live Event or Archived Event Data
Date Tue, 01 Mar 2016 16:48:39 GMT
Hi Henri,

Finally, there is a good reason for me to use Flink! Thanks for sharing
this information. This is exactly the solution I'm looking for especially
the ticket references a paper I was reading a week ago. It would be nice if
Flink adds support SQL because this makes business analyst (traders as
well) a way to express it.

Best Regards,

Jerry

On Tue, Mar 1, 2016 at 11:34 AM, Henri Dubois-Ferriere <henridf@gmail.com>
wrote:

> fwiw Apache Flink just added CEP. Queries are constructed programmatically
> rather than in SQL, but the underlying functionality is similar.
>
> https://issues.apache.org/jira/browse/FLINK-3215
>
> On 1 March 2016 at 08:19, Jerry Lam <chilinglam@gmail.com> wrote:
>
>> Hi Herman,
>>
>> Thank you for your reply!
>> This functionality usually finds its place in financial services which
>> use CEP (complex event processing) for correlation and pattern matching.
>> Many commercial products have this including Oracle and Teradata Aster Data
>> MR Analytics. I do agree the syntax a bit awkward but after you understand
>> it, it is actually very compact for expressing something that is very
>> complex. Esper has this feature partially implemented (
>> http://www.espertech.com/esper/release-5.1.0/esper-reference/html/match-recognize.html
>> ).
>>
>> I found the Teradata Analytics documentation best to describe the usage
>> of it. For example (note npath is similar to match_recognize):
>>
>> SELECT last_pageid, MAX( count_page80 )
>>  FROM nPath(
>>  ON ( SELECT * FROM clicks WHERE category >= 0 )
>>  PARTITION BY sessionid
>>  ORDER BY ts
>>  PATTERN ( 'A.(B|C)*' )
>>  MODE ( OVERLAPPING )
>>  SYMBOLS ( pageid = 50 AS A,
>>            pageid = 80 AS B,
>>            pageid <> 80 AND category IN (9,10) AS C )
>>  RESULT ( LAST ( pageid OF ANY ( A,B,C ) ) AS last_pageid,
>>           COUNT ( * OF B ) AS count_page80,
>>           COUNT ( * OF ANY ( A,B,C ) ) AS count_any )
>>  )
>>  WHERE count_any >= 5
>>  GROUP BY last_pageid
>>  ORDER BY MAX( count_page80 )
>>
>> The above means:
>> Find user click-paths starting at pageid 50 and passing exclusively
>> through either pageid 80 or pages in category 9 or category 10. Find the
>> pageid of the last page in the path and count the number of times page 80
>> was visited. Report the maximum count for each last page, and sort the
>> output by the latter. Restrict to paths containing at least 5 pages. Ignore
>> pages in the sequence with category < 0.
>>
>> If this query is written in pure SQL (if possible at all), it requires
>> several self-joins. The interesting thing about this feature is that it
>> integrates SQL+Streaming+ML in one (perhaps potentially graph too).
>>
>> Best Regards,
>>
>> Jerry
>>
>>
>> On Tue, Mar 1, 2016 at 9:39 AM, Herman van Hövell tot Westerflier <
>> hvanhovell@questtec.nl> wrote:
>>
>>> Hi Jerry,
>>>
>>> This is not on any roadmap. I (shortly) browsed through this; and this
>>> looks like some sort of a window function with very awkward syntax. I think
>>> spark provided better constructs for this using dataframes/datasets/nested
>>> data...
>>>
>>> Feel free to submit a PR.
>>>
>>> Kind regards,
>>>
>>> Herman van Hövell
>>>
>>> 2016-03-01 15:16 GMT+01:00 Jerry Lam <chilinglam@gmail.com>:
>>>
>>>> Hi Spark developers,
>>>>
>>>> Will you consider to add support for implementing "Pattern matching in
>>>> sequences of rows"? More specifically, I'm referring to this:
>>>> http://web.cs.ucla.edu/classes/fall15/cs240A/notes/temporal/row-pattern-recogniton-11.pdf
>>>>
>>>> This is a very cool/useful feature to pattern matching over live
>>>> stream/archived data. It is sorted of related to machine learning because
>>>> this is usually used in clickstream analysis or path analysis. Also it is
>>>> related to streaming because of the nature of the processing (time series
>>>> data mostly). It is SQL because there is a good way to express and optimize
>>>> the query.
>>>>
>>>> Best Regards,
>>>>
>>>> Jerry
>>>>
>>>
>>>
>>
>

Mime
View raw message