flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-2072) Add a quickstart guide for FlinkML
Date Thu, 11 Jun 2015 08:04:01 GMT

    [ https://issues.apache.org/jira/browse/FLINK-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14581642#comment-14581642

ASF GitHub Bot commented on FLINK-2072:

Github user thvasilo commented on a diff in the pull request:

    --- Diff: docs/libs/ml/quickstart.md ---
    @@ -24,4 +25,214 @@ under the License.
     * This will be replaced by the TOC
    -Coming soon.
    +## Introduction
    +FlinkML is designed to make learning from your data a straight-forward process, abstracting
    +the complexities that usually come with having to deal with big data learning tasks.
In this
    +quick-start guide we will show just how easy it is to solve a simple supervised learning
    +using FlinkML. But first some basics, feel free to skip the next few lines if you're
    +familiar with Machine Learning (ML).
    +As defined by Murphy [1] ML deals with detecting patterns in data, and using those
    +learned patterns to make predictions about the future. We can categorize most ML algorithms
    +two major categories: Supervised and Unsupervised Learning.
    +* **Supervised Learning** deals with learning a function (mapping) from a set of inputs
    +(features) to a set of outputs. The learning is done using a *training set* of (input,
    +output) pairs that we use to approximate the mapping function. Supervised learning problems
    +further divided into classification and regression problems. In classification problems
we try to
    +predict the *class* that an example belongs to, for example whether a user is going to
click on
    +an ad or not. Regression problems one the other hand, are about predicting (real) numerical
    +values, often called the dependent variable, for example what the temperature will be
    +* **Unsupervised Learning** deals with discovering patterns and regularities in the data.
An example
    +of this would be *clustering*, where we try to discover groupings of the data from the
    +descriptive features. Unsupervised learning can also be used for feature selection, for
    +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
    +## Linking with FlinkML
    +In order to use FlinkML in you project, first you have to
    +[set up a Flink program](http://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#linking-with-flink).
    +Next, you have to add the FlinkML dependency to the `pom.xml` of your project:
    +{% highlight xml %}
    +  <groupId>org.apache.flink</groupId>
    +  <artifactId>flink-ml</artifactId>
    +  <version>{{site.version }}</version>
    +{% endhighlight %}
    +## Loading data
    +To load data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized
    +functions for formatted data, such as the LibSVM format. For supervised learning problems
it is
    +common to use the `LabeledVector` class to represent the `(features, label)` examples.
A `LabeledVector`
    +object will have a FlinkML `Vector` member representing the features of the example and
a `Double`
    +member which represents the label, which could be the class in a classification problem,
or the dependent
    +variable for a regression problem.
    +As an example, we can use Haberman's Survival Data Set , which you can
    +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data.
    +This dataset *"contains cases from study conducted on the survival of patients who had
    +surgery for breast cancer"*. The data comes in a comma-separated file, where the first
3 columns
    +are the features and last column is the class, and the 4th column indicates whether the
    +survived 5 years or longer (label 1), or died within 5 years (label 2). You can check
the [UCI
    +page](https://archive.ics.uci.edu/ml/datasets/Haberman%27s+Survival) for more information
on the data.
    +We can load the data as a `DataSet[String]` first:
    +{% highlight scala %}
    +import org.apache.flink.api.scala.ExecutionEnvironment
    +val env = ExecutionEnvironment.createLocalEnvironment(2)
    +val survival = env.readCsvFile[(String, String, String, String)]("/path/to/haberman.data")
    +{% endhighlight %}
    +We can now transform the data into a `DataSet[LabeledVector]`. This will allow us to
use the
    +dataset with the FlinkML classification algorithms. We know that the 4th element of the
    +is the class label, and the rest are features, so we can build `LabeledVector` elements
like this:
    +{% highlight scala %}
    +import org.apache.flink.ml.common.LabeledVector
    +import org.apache.flink.ml.math.DenseVector
    +val survivalLV = survival
    +  .map{tuple =>
    +    val list = tuple.productIterator.toList
    +    val numList = list.map(_.asInstanceOf[String].toDouble)
    +    LabeledVector(numList(3), DenseVector(numList.take(3).toArray))
    +  }
    +{% endhighlight %}
    +We can then use this data to train a learner. We will however use another dataset to
    +building a learner; that will allow us to show how we can import other dataset formats.
    +**LibSVM files**
    +A common format for ML datasets is the LibSVM format and a number of datasets using that
format can be
    +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/).
FlinkML provides utilities for loading
    +datasets using the LibSVM format through the `readLibSVM` function available through
the MLUtils object.
    --- End diff --

> Add a quickstart guide for FlinkML
> ----------------------------------
>                 Key: FLINK-2072
>                 URL: https://issues.apache.org/jira/browse/FLINK-2072
>             Project: Flink
>          Issue Type: New Feature
>          Components: Documentation, Machine Learning Library
>            Reporter: Theodore Vasiloudis
>            Assignee: Theodore Vasiloudis
>             Fix For: 0.9
> We need a quickstart guide that introduces users to the core concepts of FlinkML to get
them up and running quickly.

This message was sent by Atlassian JIRA

View raw message