predictionio-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <>
Subject Re: Kafka support as Event Store
Date Thu, 06 Jul 2017 14:32:48 GMT
The EventStore is queryable in flexible ways and so needs to have random access and indexing
of events. In short it needs to be backed by a DB.

There are ML algorithms that do not need stored events like online learners in the Kappa style.
But existing Templates and the PIO architecture do not support Kappa yet. So unless you are
developing your own Template you will find that Kafka will not fill all the requirements.

It is, however an excellent way to manage streams so I tend to see it as a source for events
and have seen it used as such.

To be more precise about why your method may not work is that not all events can be aged out.
The item property events in PIO $set, $unset, and $delete are embedded in the stream but are
only looked at in aggregate since they record a set of changes to an object. If you drop one
of these events the current state of the object can not be calculated with certainty.

To truly use a streaming model we have to introduce the idea of watermarks that snapshot state
in the stream so that any event can be dropped. This is how kappa learners work but for existing
Lambda learners is often not possible. Many of the Lambda learners can be converted but not
easily and not all.

On Jul 6, 2017, at 1:59 AM, Thomas POCREAU <> wrote:

After some talks in intern, I misunderstood our needs.
Indeed, we will use Kafka HDFS connector <>
to dump expired data.
So we will basically have Kafka for the fresh events and HDFS for the past events.

2017-07-06 7:06 GMT+02:00 Thomas POCREAU < <>>:

Thanks for your responses.

Our goal is to use kafka as our main event store for event sourcing. I'm pretty sure that
kafka can be used with an infinite retention time.

We could use KStream and the Java sdk but I would like to give a try to an implementation
of PStore on top of spark-streaming-kafka (

My main concern is related to the http interface to push events. We will probably use KStream
or websockets to load events in our Kafka topic used by an app channel.

Are you planning on supporting websockets as an alternative to batch import? 

Thomas Pocreau 

Le 5 juil. 2017 21:36, "Pat Ferrel" < <>>
a écrit :
No, we try not to fork :-) But it would be nice as you say. It can be done with a small intermediary
app that just reads from a Kafka topic and send events to a localhost EventServer, which would
allow events to be custom extracted from say log files (typical contents of Kafka). We’ve
done this in non-PIO projects.

The intermediary app should use Spark streaming. I may have a snippet of code around if you
need it but it just saves to micro-batch files. You’d have to use the PIO Java-SDK to send
them to the EventServer. A relatively simple thing.

Donald, what did you have in mind for deeper integration? I guess we could cut out the intermediate
app and integrate into a new Kafka aware EventServer endpoint where the raw topic input is
stored in the EventStore. This would force any log filtering onto the Kafka source.

On Jul 5, 2017, at 10:20 AM, Donald Szeto < <>>

Hi Thomas,

Supporting Kafka is definitely interesting and desirable. Are you looking to sinking your
Kafka messages to event store for batch processing, or stream processing directly from Kafka?
The latter would require more work because Apache PIO does not yet support streaming properly.

Folks from ActionML might have a flavor of PIO that works with Kafka.


On Tue, Jul 4, 2017 at 8:34 AM, Thomas POCREAU < <>>

Thanks a lot for this awesome project.

I have a question regarding Kafka and it's possible integration as an Event Store.
Do you have any plan on this matter ? 
Are you aware of someone working on a similar sujet ?


View raw message