flink-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From trohrm...@apache.org
Subject [2/7] flink git commit: [FLINK-2034] [ml] [docs] Adds FlinkML web documentation (introduction, vision, roadmap)
Date Fri, 22 May 2015 08:43:34 GMT
[FLINK-2034] [ml] [docs] Adds FlinkML web documentation (introduction, vision, roadmap)

Also added attribution for some of the Latex in optimization framework.

This closes #688.


Project: http://git-wip-us.apache.org/repos/asf/flink/repo
Commit: http://git-wip-us.apache.org/repos/asf/flink/commit/b602b2ee
Tree: http://git-wip-us.apache.org/repos/asf/flink/tree/b602b2ee
Diff: http://git-wip-us.apache.org/repos/asf/flink/diff/b602b2ee

Branch: refs/heads/master
Commit: b602b2ee1c9d130e97e844572f9827b29fbd9cf8
Parents: b3b6a9d
Author: Theodore Vasiloudis <tvas@sics.se>
Authored: Mon May 18 15:52:56 2015 +0200
Committer: Till Rohrmann <trohrmann@apache.org>
Committed: Fri May 22 09:41:00 2015 +0200

----------------------------------------------------------------------
 docs/libs/ml/contribution_guide.md | 26 +++++++++
 docs/libs/ml/index.md              | 58 +++++++++++++++++--
 docs/libs/ml/optimization.md       |  2 +
 docs/libs/ml/quickstart.md         | 26 +++++++++
 docs/libs/ml/vision_roadmap.md     | 98 +++++++++++++++++++++++++++++++++
 5 files changed, 204 insertions(+), 6 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/flink/blob/b602b2ee/docs/libs/ml/contribution_guide.md
----------------------------------------------------------------------
diff --git a/docs/libs/ml/contribution_guide.md b/docs/libs/ml/contribution_guide.md
new file mode 100644
index 0000000..e0db10a
--- /dev/null
+++ b/docs/libs/ml/contribution_guide.md
@@ -0,0 +1,26 @@
+---
+title: "FlinkML - Contribution guide"
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+* This will be replaced by the TOC
+{:toc}
+
+Coming soon. In the meantime, check our list of [open issues on JIRA](https://issues.apache.org/jira/browse/FLINK-1748?jql=component%20%3D%20%22Machine%20Learning%20Library%22%20AND%20project%20%3D%20FLINK%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20priority%20DESC)

http://git-wip-us.apache.org/repos/asf/flink/blob/b602b2ee/docs/libs/ml/index.md
----------------------------------------------------------------------
diff --git a/docs/libs/ml/index.md b/docs/libs/ml/index.md
index d36ce20..f774fcf 100644
--- a/docs/libs/ml/index.md
+++ b/docs/libs/ml/index.md
@@ -1,5 +1,5 @@
 ---
-title: "Machine Learning Library"
+title: "FlinkML - Machine Learning for Flink"
 ---
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
@@ -20,7 +20,18 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-## Link
+FlinkML is the Machine Learning (ML) library for Flink. It is a new effort in the Flink community,
+with a growing list of algorithms and contributors. With FlinkML we aim to provide 
+scalable ML algorithms, an intuitive API, and tools that help minimize glue code in end-to-end
ML 
+systems. You can see more details about our goals and where the library is headed in our
[vision 
+and roadmap here](vision_roadmap.html).
+
+* This will be replaced by the TOC
+{:toc}
+
+## Getting Started
+
+You can use FlinkML in your project by adding the following dependency to your pom.xml
 
 {% highlight bash %}
 <dependency>
@@ -30,16 +41,51 @@ under the License.
 </dependency>
 {% endhighlight %}
 
-## Algorithms
+## Supported Algorithms
+
+### Supervised Learning
 
-* [Alternating Least Squares (ALS)](als.html)
 * [Communication efficient distributed dual coordinate ascent (CoCoA)](cocoa.html)
 * [Multiple linear regression](multiple_linear_regression.html)
+* [Optimization Framework](optimization.html)
+
+### Data Preprocessing
+
 * [Polynomial Base Feature Mapper](polynomial_base_feature_mapper.html)
 * [Standard Scaler](standard_scaler.html)
-* [Optimization Framework](optimization.html)
 
+### Recommendation
+
+* [Alternating Least Squares (ALS)](als.html)
 
-## Metrics
+### Utilities
 
 * [Distance Metrics](distance_metrics.html)
+
+## Example & Quickstart guide
+
+We already have some of the building blocks for FlinkML in place, and will continue to extend
the
+library with more algorithms. An example of how simple it is to create a learning model in
+FlinkML is given below:
+
+{% highlight scala %}
+// LabeledVector is a feature vector with a label (class or real value)
+val data: DataSet[LabeledVector] = ...
+
+val learner = MultipleLinearRegression()
+  .setStepsize(1.0)
+  .setIterations(100)
+  .setConvergenceThreshold(0.001)
+
+learner.fit(data, parameters)
+
+// The learner can now be used to make predictions using learner.predict()
+{% endhighlight %}
+
+For a more comprehensive guide, you can check out our [quickstart guide](quickstart.html)
+
+## How to contribute
+
+Please check our [roadmap](vision_roadmap.html#roadmap) and [contribution guide](contribution_guide.html).

+You can also check out our list of
+[unresolved issues on JIRA](https://issues.apache.org/jira/browse/FLINK-1748?jql=component%20%3D%20%22Machine%20Learning%20Library%22%20AND%20project%20%3D%20FLINK%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20priority%20DESC)

http://git-wip-us.apache.org/repos/asf/flink/blob/b602b2ee/docs/libs/ml/optimization.md
----------------------------------------------------------------------
diff --git a/docs/libs/ml/optimization.md b/docs/libs/ml/optimization.md
index 5d1f3a7..b30e0d0 100644
--- a/docs/libs/ml/optimization.md
+++ b/docs/libs/ml/optimization.md
@@ -231,3 +231,5 @@ val weightVector = weightDS
 // We can now use the weightVector to make predictions
 
 {% endhighlight %}
+
+Note: Some of the Latex math notation was adapted from Apache Spark MLlib's documentation

http://git-wip-us.apache.org/repos/asf/flink/blob/b602b2ee/docs/libs/ml/quickstart.md
----------------------------------------------------------------------
diff --git a/docs/libs/ml/quickstart.md b/docs/libs/ml/quickstart.md
new file mode 100644
index 0000000..43a3144
--- /dev/null
+++ b/docs/libs/ml/quickstart.md
@@ -0,0 +1,26 @@
+---
+title: "FlinkML - Quickstart guide"
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+* This will be replaced by the TOC
+{:toc}
+
+Coming soon.

http://git-wip-us.apache.org/repos/asf/flink/blob/b602b2ee/docs/libs/ml/vision_roadmap.md
----------------------------------------------------------------------
diff --git a/docs/libs/ml/vision_roadmap.md b/docs/libs/ml/vision_roadmap.md
new file mode 100644
index 0000000..1e319b6
--- /dev/null
+++ b/docs/libs/ml/vision_roadmap.md
@@ -0,0 +1,98 @@
+---
+title: "FlinkML - Vision and Roadmap"
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+* This will be replaced by the TOC
+{:toc}
+
+## Vision
+
+The Machine Learning (ML) library for Flink is a new effort to bring scalable ML tools to
the Flink
+community. Our goal is is to design and implement a system that is scalable and can deal
with
+problems of various sizes, whether your data size is measured in megabytes or terabytes and
beyond.
+We call this library FlinkML.
+
+An important concern for developers of ML systems is the amount of glue code that developers
are
+forced to write [1] in the process of implementing an end-to-end ML system. Our goal with
FlinkML
+is to help developers keep glue code to a minimum. The Flink ecosystem provides a great setting
to
+tackle this problem, with its scalable ETL capabilities that can be easily combined inside
the same
+program with FlinkML, allowing the development of robust pipelines without the need to use
yet
+another technology for data ingestion and data munging.
+
+Another goal for FlinkML is to make the library easy to use. To that end we will be providing
+detailed documentation along with examples for every part of the system. Our aim is that
developers
+will be able to get started with writing their ML pipelines quickly, using familiar programming
+concepts and terminology.
+
+Contrary to other data-processing systems, Flink exploits in-memory data streaming, and natively
+executes iterative processing algorithms which are common in ML. We plan to exploit the streaming
+nature of Flink, and provide functionality designed specifically for data streams.
+
+FlinkML will allow data scientists to test their models locally and using subsets of data,
and then
+use the same code to run their algorithms at a much larger scale in a cluster setting.
+
+We are inspired by other open source efforts to provide ML systems, in particular
+[scikit-learn](http://scikit-learn.org/) for cleanly specifying ML pipelines, and Spark’s
+[MLLib](https://spark.apache.org/mllib/) for providing ML algorithms that scale with problem
and
+cluster sizes.
+
+## Roadmap
+
+The roadmap below can provide an indication of the algorithms we aim to implement in the
coming
+months. If you are interested in helping out, please check our [contribution guide](contribution_guide.html).
+Items in **bold** have already been implemented:
+
+* Pipelines of transformers and learners
+* Data pre-processing
+  * **Feature scaling**
+  * **Polynomial feature base mapper**
+  * Feature hashing
+  * Feature extraction for text
+  * Dimensionality reduction
+* Model selection and performance evaluation
+  * Cross-validation for model selection and evaluation
+* Supervised learning
+  * Optimization framework
+    * **Stochastic Gradient Descent**
+    * L-BFGS
+  * Generalized Linear Models
+    * **Multiple linear regression**
+    * LASSO, Ridge regression
+    * Multi-class Logistic regression
+  * Random forests
+  * **Support Vector Machines**
+* Unsupervised learning
+  * Clustering
+    * K-means clustering
+  * PCA
+* Recommendation
+  * **ALS**
+* Text analytics
+  * LDA
+* Statistical estimation tools
+* Distributed linear algebra
+* Streaming ML
+
+**References:**
+
+[1] D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner, V. Chaudhary,
+and M. Young. _Machine learning: The high interest credit card of technical debt._ In SE4ML:
+Software Engineering for Machine Learning (NIPS 2014 Workshop), 2014.


Mime
View raw message