flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paris Carbone <par...@kth.se>
Subject RE: Gather a distributed dataset
Date Mon, 12 Jan 2015 10:58:37 GMT
Hello Alexander,

Intermediate results are indeed looking promising, also for finally implementing a proper
flink-shell for exploratory data analysis.
We are also looking at the moment on how to implement a collect() for the flink-streaming
scala api that returns a Seq that can be consumed at the client side as a part of Flink-1344
[1]. It looks like intermediate results support will help, basically I would like to be able
to initiate a stream endpoint at the client side via the JobClient perhaps, referencing an
intermediate result id for example. For streaming this is a feature that Spark doesn't explicitly
have (one has to use foreach and collect on a dstream which is quite inefficient) so I guess
it would be nice to add. 


[1] https://issues.apache.org/jira/browse/FLINK-1344
From: Alexander Alexandrov [alexander.s.alexandrov@gmail.com]
Sent: Monday, January 12, 2015 11:42 AM
To: dev@flink.apache.org
Subject: Gather a distributed dataset

Hi there,

I wished for intermediate datasets, and Santa Ufuk made my wishes come true
(thank you, Santa)!

Now that FLINK-986 is in the mainline, I want to ask some practical

In Spark, there is a way to put a value from the local driver to the
distributed runtime via

val x = env.parallelize(...) // expose x to the distributed runtime
val y = dataflow(env, x) // y is produced by a dataflow which reads from x

and also to get a value from the distributed runtime back to the driver

val z = env.collect("y")

As far as I know, in Flink we have an equivalent for parallelize

val x = env.fromCollection(...)

but not for collect. Is this still the case?

If yes, how hard would it be to add this feature at the moment? Can you
give me some pointers?


View raw message