samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Riccomini <criccom...@linkedin.com>
Subject Re: Using Samza to build a CEP (Complex Events Processing) system?
Date Sun, 25 Aug 2013 15:26:06 GMT
Hey Alex,

As I understand it, the CEP pattern you describing is, "look for a series
of events within some bounded time frame, and take an action based on the
combination of events." You use an example of three events arriving within
10 minutes of each other, consecutively. Wikipedia uses a similar example
(wedding bell event + man in suit event + woman in white dress event +
rice thrown event = wedding) on their CEP page.

This pattern can be implemented in Samza fairly easily using Samza's
key/value store (or some other StorageEngine, if you choose to implement
it). It's best to use a key/value store for this use case, since the
window might be quite long (10 minutes), and all events in the window
might not fit in memory. If you use Samza's key/value store, you can put
each message (and a timestamp) into the key/value store as the messages
arrive. You can then implement the WindowableTask interface along with the
StreamTask interface, and configure Samza to call window() on your task
every N seconds (say, task.window.ms=60000). The window method could then
do a range query on the key/value store, and check for message chains
(e.g. E1 -> E2 -> E3) that were last updated > 10 minutes ago. If an
expected message was missing, you could then take some action (send an
alert, or whatever).

In general, when I think CEP, I think Esper (http://esper.codehaus.org/).
You should be able to implement a lot of CEP/SQL type commands (SELECT,
JOIN, COUNT, SUM, DISTINCT, WHERE, GROUP BY, HAVING, WINDOW, ORDER, etc)
using Samza's StreamTask interface, and is state management facilities.

Beyond state management, most features in Samza enable CEP processing, in
one way or another. From your perspective, you can look at Samza as the
underlying framework with which you might choose to implement a CEP type
system (think MapReduce is to Hive as Samza is to a CEP system). Specific
things that help are its WindowableTask interface, the partitioning model
(which lends itself to distributed joins and aggregation), and Samza's
state management features.

One thing to be aware of right now is Samza's "at least once" messaging
guarantee when failures occur (inherited from Kafka). You might receive
duplicate messages. This means you can potentially double count, if you're
doing aggregation. In the example you give (E1, E2, E3), this shouldn┬╣t be
a problem. We have plans to provide exactly once messaging, but we haven't
implemented the feature yet.

Cheers,
Chris

On 8/24/13 12:05 PM, "Alex The Rocker" <alex.m3tal@gmail.com> wrote:

>Hello,
>
>I just began to read about Samza, and I very excited about it (I was
>warned
>of its existence by Jay Kreps' post in Kafka users list, BTW).
>
>My first reaction is: are you guys using it at LinkedIn for applications
>which lies in the CEP (Complex Event Processing) system domain?
>
>To be more specific, would stateful Samza tasks be used in order to
>compute
>complex states such as "event E1 is followed by E2 then by E3 with less
>than 10 minutes interval between each event" ?
>
>I was looking at Storm for CEP, but as pointed out in Samza Storm page,
>Storm leaves state management to the bolts code, whereas Samza has
>"something".
>
>Beyond state management, what else would make Samza a good building block
>for a CEP?  Or a bad one?
>
>Thanks,
>Alex.


Mime
View raw message