flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-2034) Add vision and roadmap for ML library to docs
Date Thu, 21 May 2015 09:38:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14553976#comment-14553976

ASF GitHub Bot commented on FLINK-2034:

Github user thvasilo commented on a diff in the pull request:

    --- Diff: docs/libs/ml/index.md ---
    @@ -20,8 +20,100 @@ specific language governing permissions and limitations
     under the License.
    +The Machine Learning (ML) library for Flink is a new effort to bring scalable ML tools
to the Flink
    +community. Our goal is is to design and implement a system that is scalable and can deal
    +problems of various sizes, whether your data size is measured in megabytes or terabytes
and beyond.
    +We call this library FlinkML.
    +An important concern for developers of ML systems is the amount of glue code that developers
    +forced to write [1] in the process of implementing an end-to-end ML system. Our goal
with FlinkML
    +is to help developers keep glue code to a minimum. The Flink ecosystem provides a great
setting to
    +tackle this problem, with its scalable ETL capabilities that can be easily combined inside
the same
    +program with FlinkML, allowing the development of robust pipelines without the need to
use yet
    +another technology for data ingestion and data munging.
    +Another goal for FlinkML is to make the library easy to use. To that end we will be providing
    +detailed documentation along with examples for every part of the system. Our aim is that
    +will be able to get started with writing their ML pipelines quickly, using familiar programming
    +concepts and terminology.
    +Contrary to other data-processing systems, Flink exploits in-memory data streaming, and
    +executes iterative processing algorithms which are common in ML. We plan to exploit the
    +nature of Flink, and provide functionality designed specifically for data streams.
    +FlinkML will allow data scientists to test their models locally and using subsets of
data, and then
    +use the same code to run their algorithms at a much larger scale in a cluster setting.
    +We are inspired by other open source efforts to provide ML systems, in particular
    +[scikit-learn](http://scikit-learn.org/) for cleanly specifying ML pipelines, and Spark’s
    +[MLLib](https://spark.apache.org/mllib/) for providing ML algorithms that scale with
problem and
    +cluster sizes.
    +We already have some of the building blocks for FlinkML in place, and will continue to
extend the
    +library with more algorithms. An example of how simple it is to create a learning model
    +FlinkML is given below:
    --- End diff --
    A separate example section is a good idea. Still, I would like to keep this very small
example here, to make it clear that getting up and running with the library is just a few
lines of code.

> Add vision and roadmap for ML library to docs
> ---------------------------------------------
>                 Key: FLINK-2034
>                 URL: https://issues.apache.org/jira/browse/FLINK-2034
>             Project: Flink
>          Issue Type: Improvement
>          Components: Machine Learning Library
>            Reporter: Theodore Vasiloudis
>            Assignee: Theodore Vasiloudis
>              Labels: ML
>             Fix For: 0.9
> We should have a document describing the vision of the Machine Learning library in Flink
and an up to date roadmap.

This message was sent by Atlassian JIRA

View raw message