What is NW data? And each row is basically a double?
Cheers,
Till
On Thu, Apr 9, 2015 at 12:51 PM, MatsZ <mats.zachrison@ericsson.com> wrote:
> Hi,
> Thanks for the input! The data is some NW data, measured for about 1.5
> hours
> and aggregated into msec, i e 5.4 M rows. Sorry, should have included that
> information to begin with.
> B r /Mats
>
>
>
> Sebastian Schelter2 wrote
> > For some similarity/correlation measures, it is also possible to discard
> > candidate pairs early, if a threshold for the resulting correlation is
> > given. This could help to fight the quadratic nature of the problem.
> > Looking for papers on similarity search might help.
> >
> > s
> >
> > On 07.04.2015 15:19, Till Rohrmann wrote:
> >> I don't know whether my ideas are much better than the cartesian product
> >> solution. As a matter of fact at some point we have to replicate the
> >> data to be able to compute the correlations in parallel. There are
> >> basically 3 ideas I had:
> >>
> >> 1. Broadcast U and V and simply compute the correlation for different
> >> shifts in a mapper. This only works if the time series data is small
> >> enough to be kept in memory of a task manager.
> >> 2. Create for each shift and element a join key, join the elements and
> >> reduce them to obtain the final result. This has a communication
> >> complexity of (n^2+n)/2 which is asymptotically the same as the
> >> cartesian product solution. But this solution will probably run for
> >> arbitrarily large correlation intervals.
> >>
> >> So let's say we have (u1, u2, u3) and (v1, v2, v3): Then we would first
> >> create the join keys: (1, 1, u1), (2, 1, u1), (3, 1, u1), (1, 2, u2),
> >> (2, 2, u2), (1, 3, u3), (1, 1, v1), (1, 2, v2), (2, 1, v2), (1, 3, v3),
> >> (2, 2, v3), (3, 1, v3). Then join on the first and second field and
> >> compute u*v with the first field as key. Reducing on this field let's
> >> you then compute the correlation.
> >>
> >> 3. Group the elements of each subinterval with respect to their shift
> >> value and join both grouped subintervals. Then compute the correlation.
> >> This again only works if the grouped data can be kept on the heap of the
> >> task manager.
> >>
> >> On Tue, Apr 7, 2015 at 1:29 PM, Sebastian <
>
> > ssc@
>
> > > <mailto:
>
> > ssc@
>
> > >> wrote:
> >>
> >> How large are the individual time series?
> >>
> >> s
> >>
> >> On 07.04.2015 12:42, Kostas Tzoumas wrote:
> >>
> >> Hi everyone,
> >>
> >> I'm forwarding a private conversation to the list with Mats'
> >> approval.
> >>
> >> The problem is how to compute correlation between time series in
> >> Flink.
> >> We have two time series, U and V, and need to compute 1000
> >> correlation
> >> measures between the series, each measure shifts one series by
> >> one more
> >> item: corr(U[0:N], V[n:N+n]) for n=0 to n=1000.
> >>
> >> Any ideas on how one can do that without a Cartesian product?
> >>
> >> Best,
> >> Kostas
> >>
> >>  Forwarded message 
> >> From: *Mats Zachrison* <
>
> > mats.zachrison@
>
> > > <mailto:
>
> > mats.zachrison@
>
> > >
> >> <mailto:
>
> > mats.zachrison@
>
> > > <mailto:
>
> > mats.zachrison@
>
> > >>>
> >> Date: Tue, Mar 31, 2015 at 9:21 AM
> >> Subject:
> >> To: Kostas Tzoumas <
>
> > kostas@
>
> > > <mailto:
>
> > kostas@
>
> > >
> >> <mailto:kostas@dataartisans.__com
> > > <mailto:
>
> > kostas@
>
> > >>>, Stefan Avesand
> >> <
>
> > stefan.avesand@
>
> > > <mailto:
>
> > stefan.avesand@
>
> > >
> >> <mailto:
>
> > stefan.avesand@
>
> > > <mailto:
>
> > stefan.avesand@
>
> > >>>
> >> Cc: "
>
> > stephan@
>
> >> <mailto:
>
> > stephan@
>
> > >
> >> <mailto:stephan@dataartisans.__com
> > > <mailto:
>
> > stephan@
>
> > >>"
> >> <
>
> > stephan@
>
> > <mailto:
>
> > stephan@
>
> > >
> >> <mailto:stephan@dataartisans.__com
> > > <mailto:
>
> > stephan@
>
> > >>>
> >>
> >> As Stefan said, what I’m trying to achieve is basically a nice
> >> way to do
> >> a correlation between two large time series. Since I’m looking
> >> for an
> >> optimal delay between the two series, I’d like to delay one of
> >> the
> >> series x observations when doing the correlation, and step x
> >> from 1 to
> >> 1000.____
> >>
> >> __ __
> >>
> >> Some pseudo code:____
> >>
> >> __ __
> >>
> >> For (x = 1 to 1000)____
> >>
> >> Shift Series A ‘x1’ steps____
> >>
> >> Correlation[x] = Correlate(Series A and Series B)____
> >>
> >> End For____
> >>
> >> __ __
> >>
> >> In R, using cor() and apply(), this could look like:____
> >>
> >> __ __
> >>
> >> shift < as.array(c(1:1000))____
> >>
> >> corrAB < apply(shift, 1, function(x) cor(data[x:nrow(data),
> >> ]$ColumnA, data[1:(nrow(data)  (x  1)), ]$ColumnB))____
> >>
> >> __ __
> >>
> >> __ __
> >>
> >> Since this basically is 1000 independent correlation
> >> calculations, it is
> >> fairly easy to parallelize. Here is an R example using foreach()
> >> and
> >> package doParallel:____
> >>
> >> __ __
> >>
> >> cl < makeCluster(3)____
> >>
> >> registerDoParallel(cl)____
> >>
> >> corrAB < foreach(step = c(1:1000)) %dopar% {____
> >>
> >> corrAB < cor(data[step:nrow(data), ]$ColumnA,
> >> data[1:(nrow(data)  (step  1)), ]$ColumnB)____
> >>
> >> }____
> >>
> >> stopCluster(cl)____
> >>
> >> __ __
> >>
> >> So I guess the question is – how to do this in a Flink
> >> environment? Do
> >> we have to define how to parallelize the algorithm, or can the
> >> cluster
> >> take care of that for us?____
> >>
> >> __ __
> >>
> >> And of course this is most interesting on a generic level –
> >> given the
> >> environment of a multicore or –processor setup running Flink,
> >> how hard
> >> is it to take advantage of all the clock cycles? Do we have to
> >> split the
> >> algorithm, and data, and distribute the processing, or can the
> >> system do
> >> much of that for us?____
> >>
> >> __
> >>
> >>
> >> __ __
> >>
> >> __
> >>
> >>
> >>
>
>
>
>
>
> 
> View this message in context:
> http://apacheflinkincubatorusermailinglistarchive.2336050.n4.nabble.com/FwdExternalTalkApacheFlinkSpeakersKostasTzoumasCEOdataArtisansStephanEwenCTOdataArtisantp955p975.html
> Sent from the Apache Flink (Incubator) User Mailing List archive. mailing
> list archive at Nabble.com.
>
