flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jamie Grier <ja...@data-artisans.com>
Subject Re: Flink for historical time series processing
Date Fri, 01 Jul 2016 22:54:16 GMT
Hi Mindis,

This does actually sound like a good use case for Flink.  Without knowing
more details it's a bit hard to say which of the options you mention would
be most efficient but my gut feeling is that the "one big dataset" approach
would be the way to go.

I think there probably is a simplified workflow here where you could unify
both the historical and realtime processing into a single Flink job.

-Jamie


On Tue, Jun 28, 2016 at 11:15 AM, Mindaugas Zickus <
mindaugas.zickus@yahoo.com> wrote:

> Hi All,
>
>
>
> I wonder if Flink is a right tool for processing historical time series
> data e.g. many small files.
>
> Our use case: we have clickstream histories (time series) of many users.
> We would like to calculate user specific sliding count window aggregates
> over past periods for a sample of users to create features to train machine
> learning models.
>
> As I see it, Flink would load user histories from some nosql database
> (e.g. hbase), process them and publish aggregates for machine learning.
> Flink also would update user histories with new events.
>
> I wonder if its it equally efficient to load and process each user history
> in parallel or it's better to create one big dataset with multiple user
> histories and run single map-reduce task on it?
>
> The first approach is more attractive since we could use same event
> aggregation code both for processing historical user data for training
> models and for aggregating real time user events into features for model
> execution.
>
> thanks, Mindis
>
>
>
>


-- 

Jamie Grier
data Artisans, Director of Applications Engineering
@jamiegrier <https://twitter.com/jamiegrier>
jamie@data-artisans.com

Mime
View raw message