How large are the individual time series?
s
On 07.04.2015 12:42, Kostas Tzoumas wrote:
> Hi everyone,
>
> I'm forwarding a private conversation to the list with Mats' approval.
>
> The problem is how to compute correlation between time series in Flink.
> We have two time series, U and V, and need to compute 1000 correlation
> measures between the series, each measure shifts one series by one more
> item: corr(U[0:N], V[n:N+n]) for n=0 to n=1000.
>
> Any ideas on how one can do that without a Cartesian product?
>
> Best,
> Kostas
>
>  Forwarded message 
> From: *Mats Zachrison* <mats.zachrison@ericsson.com
> <mailto:mats.zachrison@ericsson.com>>
> Date: Tue, Mar 31, 2015 at 9:21 AM
> Subject:
> To: Kostas Tzoumas <kostas@dataartisans.com
> <mailto:kostas@dataartisans.com>>, Stefan Avesand
> <stefan.avesand@ericsson.com <mailto:stefan.avesand@ericsson.com>>
> Cc: "stephan@dataartisans.com <mailto:stephan@dataartisans.com>"
> <stephan@dataartisans.com <mailto:stephan@dataartisans.com>>
>
> As Stefan said, what I’m trying to achieve is basically a nice way to do
> a correlation between two large time series. Since I’m looking for an
> optimal delay between the two series, I’d like to delay one of the
> series x observations when doing the correlation, and step x from 1 to
> 1000.____
>
> __ __
>
> Some pseudo code:____
>
> __ __
>
> For (x = 1 to 1000)____
>
> Shift Series A ‘x1’ steps____
>
> Correlation[x] = Correlate(Series A and Series B)____
>
> End For____
>
> __ __
>
> In R, using cor() and apply(), this could look like:____
>
> __ __
>
> shift < as.array(c(1:1000))____
>
> corrAB < apply(shift, 1, function(x) cor(data[x:nrow(data),
> ]$ColumnA, data[1:(nrow(data)  (x  1)), ]$ColumnB))____
>
> __ __
>
> __ __
>
> Since this basically is 1000 independent correlation calculations, it is
> fairly easy to parallelize. Here is an R example using foreach() and
> package doParallel:____
>
> __ __
>
> cl < makeCluster(3)____
>
> registerDoParallel(cl)____
>
> corrAB < foreach(step = c(1:1000)) %dopar% {____
>
> corrAB < cor(data[step:nrow(data), ]$ColumnA,
> data[1:(nrow(data)  (step  1)), ]$ColumnB)____
>
> }____
>
> stopCluster(cl)____
>
> __ __
>
> So I guess the question is – how to do this in a Flink environment? Do
> we have to define how to parallelize the algorithm, or can the cluster
> take care of that for us?____
>
> __ __
>
> And of course this is most interesting on a generic level – given the
> environment of a multicore or –processor setup running Flink, how hard
> is it to take advantage of all the clock cycles? Do we have to split the
> algorithm, and data, and distribute the processing, or can the system do
> much of that for us?____
>
> __
>
>
> __ __
>
> __
>
>
