flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Palumbo <ap....@outlook.com>
Subject Re: Iteration Intermediate Output
Date Mon, 30 May 2016 21:05:22 GMT
Greg,

We ran into this Issue when implementing the Mahout bindings for Flink [1].  It ended up being
the major bottleneck for Mahout on Flink, and makes iterative algorithms basically unreasonable.
 While it is understook that that Flink's Delta-iterations are intended for use when iterating
over Flink DataSet, they are not always suitable to the task.  Eg. FlinkML's  ALS.scala [2][3]
which must flush intermediate results to the file system.

In-memory caching is a must have for any type of declarative ML DSL riding on top of fink.
 Eg, SystemML, Mahout, etc, or an internal Flink DSL.  The Mahout bindings are Linear Algebra
based.  In close collaboration with the Flink community [4] and based on a contribution from
an intern hired by Data Artisans  to complete this task and other members of the community,
we recently finished up work on the Mahout Distributed Linear Algebra bindings.

Unfortunately the lack of an in-memory cache made the outcome of this effort very sub-optimal.
 After the unfinished hand-off of the bindings from the Flink community to the Mahout community,
we were forced to find a workaround for the caching.  We used the template of flushing and
executing and flushing results to the File system [5], and had to release the bindings as
"Experimental".

Anyone building a declarative ML interface using the Flink Dataset API as a backend will run
into similar issues as has been reported on the stackoverflow thread u refer to and it would
be great to have this feature.


Its great to see this being talked about.

Andy

[1] http://mahout.apache.org/users/flinkbindings/flink-internals.html
[2] https://github.com/apache/flink/blob/master/flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/recommendation/ALS.scala#L481
[3] https://github.com/apache/flink/blob/master/flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/common/FlinkMLTools.scala#L84
[4]https://issues.apache.org/jira/browse/MAHOUT-1570
[5]https://github.com/apache/mahout/blob/master/flink/src/main/scala/org/apache/mahout/flinkbindings/drm/CheckpointedFlinkDrm.scala#L139

________________________________________
From: Greg Hogan <code@greghogan.com>
Sent: Thursday, May 26, 2016 4:41:53 PM
To: dev@flink.apache.org
Subject: Iteration Intermediate Output

Hi y'all,

I think this is an oft-requested feature [0] and there are many graph
algorithms for which intermediate output is the desired result. I'd like to
take Stephan up on his offer [1] for pointers.

I have yet to get in deep, but I see that iteration tasks are treated
specially as IterationIntermediateTask for synchronization between
supersteps. Also, when OperatorTranslation and GraphCreatingVisitor are
walking the program DAG an iteration must be first reached through the tail.

Greg

[0]
http://stackoverflow.com/questions/37224140/possibility-of-saving-partial-outputs-from-bulk-iteration-in-flink-dataset
[1]
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Intermediate-output-during-delta-iterations-td436.html

Mime
View raw message