flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sachin Goel <sachingoel0...@gmail.com>
Subject Re: Multiple control flows in a program
Date Wed, 12 Aug 2015 07:54:27 GMT
Hi Till
Thanks for the reply.
If you think about it however, having several diverging computational paths
from an intermediate point will probably require re-computation anyway, in
case the number of these paths is even higher than the slots available.
Could that be an argument against a possible implementation?
Making the output of the non-deterministic step persistent seems costly
however. Is there any way to ensure that the data source is partitioned
across the different slots in exactly the same way every-time?
For example, I am using a {{generateSequence}} call, and the internal
iterator, namely the NumberSequenceIterator seems deterministic in its
operation, at least as far as how the elements are grouped together. But
surprisingly, I observed different splits now and then.

Regards
Sachin

-- Sachin Goel
Computer Science, IIT Delhi
m. +91-9871457685

On Wed, Aug 12, 2015 at 12:58 PM, Till Rohrmann <trohrmann@apache.org>
wrote:

> At the moment, Flink does not support the calculation of intermediate
> results from which you can continue your computation. When you execute jobs
> which share parts of its job graph, then they are recomputed. When your job
> contains operators with non-deterministic output, then there is no
> guarantee that the shared job graph parts produce the same results.
>
> What you can do is to execute the two jobs in parallel so that they share
> the input of the non-deterministic operator. Alternatively, you can persist
> the data set after your non-deterministic operator by writing it manually
> to disc and reading it from there.
>
> Cheers,
> Till
>
> On Wed, Aug 12, 2015 at 1:34 AM, Sachin Goel <sachingoel0101@gmail.com>
> wrote:
>
> > I'm writing a utility to split a data set randomly into several parts and
> > return an Array of data sets. However, whenever I operate on any of
> > these *subsets,
> > *the program basically start from the original data set, and the split is
> > performed again.
> >
> > To ensure that these subsets are mutually exclusive, we need to generate
> > the exact same sequence of random numbers, but also to ensure that the
> > elements arrive in a filter job in exactly the same order. How do I
> achieve
> > this?
> > Here's the code I've written:
> > https://github.com/apache/flink/pull/921/files
> >
> > Regards
> > Sachin
> >
> > -- Sachin Goel
> > Computer Science, IIT Delhi
> > m. +91-9871457685
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message