For some similarity/correlation measures, it is also possible to discard
candidate pairs early, if a threshold for the resulting correlation is
given. This could help to fight the quadratic nature of the problem.
Looking for papers on similarity search might help.
s
On 07.04.2015 15:19, Till Rohrmann wrote:
> I don't know whether my ideas are much better than the cartesian product
> solution. As a matter of fact at some point we have to replicate the
> data to be able to compute the correlations in parallel. There are
> basically 3 ideas I had:
>
> 1. Broadcast U and V and simply compute the correlation for different
> shifts in a mapper. This only works if the time series data is small
> enough to be kept in memory of a task manager.
> 2. Create for each shift and element a join key, join the elements and
> reduce them to obtain the final result. This has a communication
> complexity of (n^2+n)/2 which is asymptotically the same as the
> cartesian product solution. But this solution will probably run for
> arbitrarily large correlation intervals.
>
> So let's say we have (u1, u2, u3) and (v1, v2, v3): Then we would first
> create the join keys: (1, 1, u1), (2, 1, u1), (3, 1, u1), (1, 2, u2),
> (2, 2, u2), (1, 3, u3), (1, 1, v1), (1, 2, v2), (2, 1, v2), (1, 3, v3),
> (2, 2, v3), (3, 1, v3). Then join on the first and second field and
> compute u*v with the first field as key. Reducing on this field let's
> you then compute the correlation.
>
> 3. Group the elements of each subinterval with respect to their shift
> value and join both grouped subintervals. Then compute the correlation.
> This again only works if the grouped data can be kept on the heap of the
> task manager.
>
> On Tue, Apr 7, 2015 at 1:29 PM, Sebastian <ssc@apache.org
> <mailto:ssc@apache.org>> wrote:
>
> How large are the individual time series?
>
> s
>
> On 07.04.2015 12:42, Kostas Tzoumas wrote:
>
> Hi everyone,
>
> I'm forwarding a private conversation to the list with Mats'
> approval.
>
> The problem is how to compute correlation between time series in
> Flink.
> We have two time series, U and V, and need to compute 1000
> correlation
> measures between the series, each measure shifts one series by
> one more
> item: corr(U[0:N], V[n:N+n]) for n=0 to n=1000.
>
> Any ideas on how one can do that without a Cartesian product?
>
> Best,
> Kostas
>
>  Forwarded message 
> From: *Mats Zachrison* <mats.zachrison@ericsson.com
> <mailto:mats.zachrison@ericsson.com>
> <mailto:mats.zachrison@__ericsson.com
> <mailto:mats.zachrison@ericsson.com>>>
> Date: Tue, Mar 31, 2015 at 9:21 AM
> Subject:
> To: Kostas Tzoumas <kostas@dataartisans.com
> <mailto:kostas@dataartisans.com>
> <mailto:kostas@dataartisans.__com
> <mailto:kostas@dataartisans.com>>>, Stefan Avesand
> <stefan.avesand@ericsson.com
> <mailto:stefan.avesand@ericsson.com>
> <mailto:stefan.avesand@__ericsson.com
> <mailto:stefan.avesand@ericsson.com>>>
> Cc: "stephan@dataartisans.com
> <mailto:stephan@dataartisans.com>
> <mailto:stephan@dataartisans.__com
> <mailto:stephan@dataartisans.com>>"
> <stephan@dataartisans.com <mailto:stephan@dataartisans.com>
> <mailto:stephan@dataartisans.__com
> <mailto:stephan@dataartisans.com>>>
>
> As Stefan said, what I’m trying to achieve is basically a nice
> way to do
> a correlation between two large time series. Since I’m looking
> for an
> optimal delay between the two series, I’d like to delay one of the
> series x observations when doing the correlation, and step x
> from 1 to
> 1000.____
>
> __ __
>
> Some pseudo code:____
>
> __ __
>
> For (x = 1 to 1000)____
>
> Shift Series A ‘x1’ steps____
>
> Correlation[x] = Correlate(Series A and Series B)____
>
> End For____
>
> __ __
>
> In R, using cor() and apply(), this could look like:____
>
> __ __
>
> shift < as.array(c(1:1000))____
>
> corrAB < apply(shift, 1, function(x) cor(data[x:nrow(data),
> ]$ColumnA, data[1:(nrow(data)  (x  1)), ]$ColumnB))____
>
> __ __
>
> __ __
>
> Since this basically is 1000 independent correlation
> calculations, it is
> fairly easy to parallelize. Here is an R example using foreach() and
> package doParallel:____
>
> __ __
>
> cl < makeCluster(3)____
>
> registerDoParallel(cl)____
>
> corrAB < foreach(step = c(1:1000)) %dopar% {____
>
> corrAB < cor(data[step:nrow(data), ]$ColumnA,
> data[1:(nrow(data)  (step  1)), ]$ColumnB)____
>
> }____
>
> stopCluster(cl)____
>
> __ __
>
> So I guess the question is – how to do this in a Flink
> environment? Do
> we have to define how to parallelize the algorithm, or can the
> cluster
> take care of that for us?____
>
> __ __
>
> And of course this is most interesting on a generic level –
> given the
> environment of a multicore or –processor setup running Flink,
> how hard
> is it to take advantage of all the clock cycles? Do we have to
> split the
> algorithm, and data, and distribute the processing, or can the
> system do
> much of that for us?____
>
> __
>
>
> __ __
>
> __
>
>
>
